R&D intermediate 3 min read May 27, 2026
Public Preview Sign in free for the full digest →

Chain-of-Thought: The Prompt That Started It All

“PaLM 540B: 17.9% on GSM8K with standard prompting. Add 8 worked examples with reasoning steps: 56.9%. Same model, same weights, zero retraining.”

Chain-of-Thought: The Prompt That Started It All
1 Views
0 Likes
0 Bookmarks
Source · arxiv.org

“"chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ~100B parameters." — Jason Wei et al., arXiv:2201.11903 (source: huggingface.co/papers/2201.11903, verified: 2026-05-27)”

You know that feeling when you give an AI a multi-step word problem and it confidently returns a wrong answer with no explanation of where it went wrong? Standard few-shot prompting — giving the model a few input-output example pairs — flatlines on arithmetic, commonsense, and symbolic reasoning tasks no matter how big the model gets. Scaling from GPT-3 175B to PaLM 540B improved GSM8K accuracy from 15.6% to only 17.9% under standard prompting — three times the parameters, barely any gain. The missing piece: models had no mechanism to decompose problems into intermediate steps before committing to a final answer.

prompt-engineeringllmchain-of-thoughtreasoningnlpresearch-paperfew-shot-learning

Instead of showing the model (question, answer) pairs, you show it (question, reasoning chain, answer) triples. The reasoning chain is a natural-language walkthrough of intermediate steps — the kind of scratch work a student writes in the margin. When you give the model a new question, it generates its own reasoning chain before producing the final answer. The technique works because large models already have the latent capability to execute step-by-step calculations; the exemplars activate that capability by establishing the output format. For arithmetic tasks, 8 exemplars are enough; for multiple-choice tasks, 4 suffice. All paper results use greedy decoding — no sampling tricks required.

01
Emergent accuracy jump at scale — PaLM 540B goes from 17.9% to 56.9% on GSM8K with CoT, a +39pp gain; applying the same technique to models below ~100B parameters makes performance worse, so check your model size before investing in exempl...
02
Zero fine-tuning required — 8 hand-crafted (question, reasoning, answer) triples in your prompt match or beat purpose-trained verifier models; no GPU, no training data pipeline, no model deployment changes
03
Cross-domain accuracy gains from one format change — arithmetic (GSM8K +39pp), commonsense reasoning (Sports Understanding 80.5% to 95.4%), and symbolic tasks (Last Letter Concatenation OOD: 0.0% to 63.0%) all improve with the same prompti...
04
Annotator-independent results — three different humans wrote GSM8K exemplars and produced 14.3%, 15.5%, and 17.6% accuracy respectively; all beat the standard baseline, meaning your team does not need a specific expert author
05
Exemplar-order robustness — resampling and reordering exemplars changed scores by only 0.4–1.5 standard deviations on GSM8K; you do not need to hand-tune exemplar sequence
06
Interpretable output trace — the model writes its intermediate reasoning before the final answer, giving you a debug surface when answers are wrong; note that Anthropic's 2023 faithfulness study found larger models' stated traces may not r...
Who it’s for

If you build LLM applications requiring multi-step reasoning — math tutors, code debuggers, legal clause analyzers, scientific Q&A systems — this paper is the conceptual foundation of your prompt strategy. You need to be working with models above ~70B parameters (GPT-4, Claude, Gemini, LLaMA 3 70B+) to see meaningful gains; applying CoT to smaller models reduces accuracy per the paper's ablations. This is not useful if your task is single-step classification, retrieval, or summarization — CoT shines on compositional tasks where intermediate step correctness determines the final answer.

Worth exploring

Yes, for conceptual grounding — the technique is already embedded in DSPy, LangChain, and Guidance, so you are unlikely to implement it from scratch. Study the paper to understand the ~100B parameter scale threshold, the error analysis showing 46% of LaMDA 137B failures are 'nearly correct' with a single flawed step, and the faithfulness gap documented by Anthropic (2023) before deciding how much to trust LLM reasoning traces in your production system.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →