Claude Opus 4.6 & 4.7 Reasoning Dataset

What problem does it solve

“"No refusals or safety hedging — dataset teaches capability, not alignment." — angrygiraffe, dataset card (huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k, verified 2026-05-24)”

You know that feeling when you want to fine-tune an open-source model to reason more carefully but generating high-quality chain-of-thought training data costs hundreds or thousands of dollars in API calls and weeks of prompt engineering? Building a diverse SFT corpus covering coding, science, humanities, law, and creative writing — with coherent deliberation in every assistant turn — from scratch is a months-long project. This dataset drops 8,706 ready-to-use examples covering 28 categories, formatted for SFT trainers with <think> blocks pre-included, at zero generation cost to you.

machine-learningfine-tuningllmdatasetchain-of-thoughtreasoningsft

How it works

The dataset ships as four JSONL files you download from HuggingFace. Each JSON object has a category tag, a model tag (claude-opus-4-6 or claude-opus-4-7), and a messages array. Every assistant message opens with a <think>...</think> block — typically 150–500 words of deliberation — followed by the actual response. You load it with HuggingFace's datasets library and point your SFT trainer (TRL, Unsloth, or Axolotl) at it with train_on_responses_only enabled so gradients flow through the <think> block and response but not through user turns. The system prompts are domain-specific expert personas (5,814 unique prompts) rather than generic boilerplate, which pushes fine-tuned models toward domain-calibrated depth.

Key takeaways

✦

01

28-category breadth in one download: coding leads with 1,628 examples and 2.5M tokens, followed by humanities (862 examples) and science (737 examples) — you get the full range without stitching together multiple datasets.

⟁

02

5,814 unique domain-specific system prompts: each prompt sets a real expert persona (e.g., database performance consultant) instead of a generic assistant prompt, pushing fine-tuned models toward domain-calibrated depth.

⊕

03

Dual-model provenance with per-example tags: claude-opus-4-6 (4,675 examples / 6.3M tokens) and claude-opus-4-7 (4,031 examples / 10.7M tokens) are tagged separately, so you can train on one model's style or both.

◈

04

Four pre-split JSONL files (full / instruct-only / roleplay-only / coding+math): pick the split matching your training goal without writing your own category filter.

∞

05

Multi-turn conversations included: 3,454 of 8,706 examples (39.7%) are multi-turn, teaching the model to handle follow-ups and revisions rather than only one-shot answers.

◎

06

Synthetic <think> blocks in every assistant turn: 150–500 words of deliberation per example — per the dataset card, not Claude's actual reasoning, but prompted text mimicking expected reasoning patterns.

✺

07

No refusals or safety hedging by design: the full response range without alignment constraints — explicitly a capability-maximalist corpus, not a safety-neutral one.

Should you care?

Who it’s for

If you are fine-tuning an open-source LLM — Qwen, Llama, Mistral — and need a broad SFT corpus with synthetic chain-of-thought across diverse domains, this covers 28 categories without generation cost. You need familiarity with HuggingFace datasets, an SFT training framework (TRL, Unsloth, Axolotl), and GPU access. This is not for you if you need legally clear training data for commercial deployment, production-grade quality assurance, or real chain-of-thought distillation — for the last goal, lordx64's dataset uses actual extended-thinking API traces and is structurally stronger.

Worth exploring

Worth exploring for experimental fine-tuning where broad domain coverage matters more than verified quality — 4,445 monthly downloads and 6+ downstream models signal active community use. Treat it as prototype material: no human review, no benchmark results, no reproducibility info, and a real legal risk under Anthropic's ToS for commercial use. If genuine reasoning distillation is the goal, lordx64's dataset (real API traces, 8,124 samples) is the structurally sounder choice.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Claude Opus 4.6 & 4.7 Reasoning Dataset

Underrated tools. Unfiltered takes.