“"chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ~100B parameters." — Jason Wei et al., arXiv:2201.11903 (source: huggingface.co/papers/2201.11903, verified: 2026-05-27)”
You know that feeling when you give an AI a multi-step word problem and it confidently returns a wrong answer with no explanation of where it went wrong? Standard few-shot prompting — giving the model a few input-output example pairs — flatlines on arithmetic, commonsense, and symbolic reasoning tasks no matter how big the model gets. Scaling from GPT-3 175B to PaLM 540B improved GSM8K accuracy from 15.6% to only 17.9% under standard prompting — three times the parameters, barely any gain. The missing piece: models had no mechanism to decompose problems into intermediate steps before committing to a final answer.
Instead of showing the model (question, answer) pairs, you show it (question, reasoning chain, answer) triples. The reasoning chain is a natural-language walkthrough of intermediate steps — the kind of scratch work a student writes in the margin. When you give the model a new question, it generates its own reasoning chain before producing the final answer. The technique works because large models already have the latent capability to execute step-by-step calculations; the exemplars activate that capability by establishing the output format. For arithmetic tasks, 8 exemplars are enough; for multiple-choice tasks, 4 suffice. All paper results use greedy decoding — no sampling tricks required.
If you build LLM applications requiring multi-step reasoning — math tutors, code debuggers, legal clause analyzers, scientific Q&A systems — this paper is the conceptual foundation of your prompt strategy. You need to be working with models above ~70B parameters (GPT-4, Claude, Gemini, LLaMA 3 70B+) to see meaningful gains; applying CoT to smaller models reduces accuracy per the paper's ablations. This is not useful if your task is single-step classification, retrieval, or summarization — CoT shines on compositional tasks where intermediate step correctness determines the final answer.
Yes, for conceptual grounding — the technique is already embedded in DSPy, LangChain, and Guidance, so you are unlikely to implement it from scratch. Study the paper to understand the ~100B parameter scale threshold, the error analysis showing 46% of LaMDA 137B failures are 'nearly correct' with a single flawed step, and the faithfulness gap documented by Anthropic (2023) before deciding how much to trust LLM reasoning traces in your production system.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.