R&D advanced 3 min read May 24, 2026
Public Preview Sign in free for the full digest →

Claude Opus 4.6 & 4.7 Reasoning Dataset

“The <think> blocks in this free 8.7k-example SFT dataset are synthetic text — not Claude's actual reasoning — and training a general LLM on them likely violates Anthropic's ToS.”

Claude Opus 4.6 & 4.7 Reasoning Dataset
Source · huggingface.co

“"No refusals or safety hedging — dataset teaches capability, not alignment." — angrygiraffe, dataset card (huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k, verified 2026-05-24)”

You know that feeling when you want to fine-tune an open-source model to reason more carefully but generating high-quality chain-of-thought training data costs hundreds or thousands of dollars in API calls and weeks of prompt engineering? Building a diverse SFT corpus covering coding, science, humanities, law, and creative writing — with coherent deliberation in every assistant turn — from scratch is a months-long project. This dataset drops 8,706 ready-to-use examples covering 28 categories, formatted for SFT trainers with <think> blocks pre-included, at zero generation cost to you.

machine-learningfine-tuningllmdatasetchain-of-thoughtreasoningsft

The dataset ships as four JSONL files you download from HuggingFace. Each JSON object has a category tag, a model tag (claude-opus-4-6 or claude-opus-4-7), and a messages array. Every assistant message opens with a <think>...</think> block — typically 150–500 words of deliberation — followed by the actual response. You load it with HuggingFace's datasets library and point your SFT trainer (TRL, Unsloth, or Axolotl) at it with train_on_responses_only enabled so gradients flow through the <think> block and response but not through user turns. The system prompts are domain-specific expert personas (5,814 unique prompts) rather than generic boilerplate, which pushes fine-tuned models toward domain-calibrated depth.

01
28-category breadth in one download: coding leads with 1,628 examples and 2.5M tokens, followed by humanities (862 examples) and science (737 examples) — you get the full range without stitching together multiple datasets.
02
5,814 unique domain-specific system prompts: each prompt sets a real expert persona (e.g., database performance consultant) instead of a generic assistant prompt, pushing fine-tuned models toward domain-calibrated depth.
03
Dual-model provenance with per-example tags: claude-opus-4-6 (4,675 examples / 6.3M tokens) and claude-opus-4-7 (4,031 examples / 10.7M tokens) are tagged separately, so you can train on one model's style or both.
04
Four pre-split JSONL files (full / instruct-only / roleplay-only / coding+math): pick the split matching your training goal without writing your own category filter.
05
Multi-turn conversations included: 3,454 of 8,706 examples (39.7%) are multi-turn, teaching the model to handle follow-ups and revisions rather than only one-shot answers.
06
Synthetic <think> blocks in every assistant turn: 150–500 words of deliberation per example — per the dataset card, not Claude's actual reasoning, but prompted text mimicking expected reasoning patterns.
07
No refusals or safety hedging by design: the full response range without alignment constraints — explicitly a capability-maximalist corpus, not a safety-neutral one.
Who it’s for

If you are fine-tuning an open-source LLM — Qwen, Llama, Mistral — and need a broad SFT corpus with synthetic chain-of-thought across diverse domains, this covers 28 categories without generation cost. You need familiarity with HuggingFace datasets, an SFT training framework (TRL, Unsloth, Axolotl), and GPU access. This is not for you if you need legally clear training data for commercial deployment, production-grade quality assurance, or real chain-of-thought distillation — for the last goal, lordx64's dataset uses actual extended-thinking API traces and is structurally stronger.

Worth exploring

Worth exploring for experimental fine-tuning where broad domain coverage matters more than verified quality — 4,445 monthly downloads and 6+ downstream models signal active community use. Treat it as prototype material: no human review, no benchmark results, no reproducibility info, and a real legal risk under Anthropic's ToS for commercial use. If genuine reasoning distillation is the goal, lordx64's dataset (real API traces, 8,124 samples) is the structurally sounder choice.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →