Chain-of-Thought: The Prompt That Started It All

What problem does it solve

“"chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ~100B parameters." — Jason Wei et al., arXiv:2201.11903 (source: huggingface.co/papers/2201.11903, verified: 2026-05-27)”

You know that feeling when you give an AI a multi-step word problem and it confidently returns a wrong answer with no explanation of where it went wrong? Standard few-shot prompting — giving the model a few input-output example pairs — flatlines on arithmetic, commonsense, and symbolic reasoning tasks no matter how big the model gets. Scaling from GPT-3 175B to PaLM 540B improved GSM8K accuracy from 15.6% to only 17.9% under standard prompting — three times the parameters, barely any gain. The missing piece: models had no mechanism to decompose problems into intermediate steps before committing to a final answer.

prompt-engineeringllmchain-of-thoughtreasoningnlpresearch-paperfew-shot-learning

How it works

Instead of showing the model (question, answer) pairs, you show it (question, reasoning chain, answer) triples. The reasoning chain is a natural-language walkthrough of intermediate steps — the kind of scratch work a student writes in the margin. When you give the model a new question, it generates its own reasoning chain before producing the final answer. The technique works because large models already have the latent capability to execute step-by-step calculations; the exemplars activate that capability by establishing the output format. For arithmetic tasks, 8 exemplars are enough; for multiple-choice tasks, 4 suffice. All paper results use greedy decoding — no sampling tricks required.

Key takeaways

✦

01

Emergent accuracy jump at scale — PaLM 540B goes from 17.9% to 56.9% on GSM8K with CoT, a +39pp gain; applying the same technique to models below ~100B parameters makes performance worse, so check your model size before investing in exempl...

⟁

02

Zero fine-tuning required — 8 hand-crafted (question, reasoning, answer) triples in your prompt match or beat purpose-trained verifier models; no GPU, no training data pipeline, no model deployment changes

⊕

03

Cross-domain accuracy gains from one format change — arithmetic (GSM8K +39pp), commonsense reasoning (Sports Understanding 80.5% to 95.4%), and symbolic tasks (Last Letter Concatenation OOD: 0.0% to 63.0%) all improve with the same prompti...

◈

04

Annotator-independent results — three different humans wrote GSM8K exemplars and produced 14.3%, 15.5%, and 17.6% accuracy respectively; all beat the standard baseline, meaning your team does not need a specific expert author

∞

05

Exemplar-order robustness — resampling and reordering exemplars changed scores by only 0.4–1.5 standard deviations on GSM8K; you do not need to hand-tune exemplar sequence

◎

06

Interpretable output trace — the model writes its intermediate reasoning before the final answer, giving you a debug surface when answers are wrong; note that Anthropic's 2023 faithfulness study found larger models' stated traces may not r...

Should you care?

Who it’s for

If you build LLM applications requiring multi-step reasoning — math tutors, code debuggers, legal clause analyzers, scientific Q&A systems — this paper is the conceptual foundation of your prompt strategy. You need to be working with models above ~70B parameters (GPT-4, Claude, Gemini, LLaMA 3 70B+) to see meaningful gains; applying CoT to smaller models reduces accuracy per the paper's ablations. This is not useful if your task is single-step classification, retrieval, or summarization — CoT shines on compositional tasks where intermediate step correctness determines the final answer.

Worth exploring

Yes, for conceptual grounding — the technique is already embedded in DSPy, LangChain, and Guidance, so you are unlikely to implement it from scratch. Study the paper to understand the ~100B parameter scale threshold, the error analysis showing 46% of LaMDA 137B failures are 'nearly correct' with a single flawed step, and the faithfulness gap documented by Anthropic (2023) before deciding how much to trust LLM reasoning traces in your production system.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Chain-of-Thought: The Prompt That Started It All

Underrated tools. Unfiltered takes.