R&D advanced 3 min read May 18, 2026
Public Preview Sign in free for the full digest →

LLMs know they need a tool — and ignore it 54% of the time

“Up to 54% of the time, an LLM's hidden states correctly signal 'I need a tool' — and the output ignores it completely.”

LLMs know they need a tool — and ignore it 54% of the time
Source · arxiv.org

“"the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself." — Cheng et al., abstract (https://arxiv.org/abs/2605.14038)”

You know that feeling when an AI agent confidently gives a wrong answer to an arithmetic problem instead of reaching for a calculator — even though tools are available and the system prompt explicitly allows them? The current assumption is that the model either recognized the need for a tool or did not. This paper shows that framing is wrong: recognition and action are two separate internal processes that fail to connect. Worse, tool necessity is not a universal property — whether a query needs a tool depends entirely on which specific model is handling it, so labeling based on human judgment introduces noise that does not reflect any individual model's actual capability boundary.

llmagentstool-usemechanistic-interpretabilityresearch-paperpythonopen-source

The paper trains lightweight linear probes on the hidden-state vectors inside an LLM at each layer. One probe decodes whether the model's internal representation encodes 'this query needs a tool' — the cognition signal. A second probe decodes whether the model will actually emit a tool-call token — the action signal. The authors then measure the cosine similarity between the two probe directions across all layers. In early layers the directions partially align; in the late layers where the next token is actually predicted, the directions become nearly orthogonal, meaning the two representations are essentially uncorrelated. The paper also introduces a model-specific necessity label: a query is labeled 'tool-required for model X' only if model X fails that query consistently across 10 independent runs at temperature 0.7, grounding labels in each model's actual failure rate.

01
Model-adaptive necessity labels — instead of a universal standard, you ground the label in whether THIS specific model fails the query across 10 runs; this removes noise from human or LLM-judge annotations that treat necessity as model-agn...
02
Layer-by-layer probe analysis — you get a full map of where cognition and action representations align versus diverge inside the model, not just a behavioral output count, so you can identify exactly which layer regime the translation fail...
03
Knowing-doing gap quantification with exact mismatch rates — 26.5–54.0% for arithmetic and 30.8–41.8% for TruthfulQA across four open-weight models give you a concrete baseline for how bad the translation problem actually is before any int...
04
Four-cell outcome breakdown shipped as raw data — the repo includes JSON files categorizing every query into (necessary, called), (necessary, not-called), (not-necessary, called), (not-necessary, not-called) per model, giving you a full co...
05
Cosine similarity as a mechanistic diagnostic — measuring the angle between cognition and action probe directions gives you a representation-level explanation for tool-use failures rather than treating behavioral mismatch as a black box
Who it’s for

If you build or evaluate LLM-based agents and you have seen unreliable tool-calling behavior that does not improve with better prompts or system instructions, this paper directly addresses why. It is also directly relevant if you work on mechanistic interpretability — the probe-based methodology extends to other agent failure modes. Not useful yet if you work exclusively with closed-source models like GPT-4, Gemini, or Claude, since the methodology requires white-box access to hidden states.

Worth exploring

Read this paper if you work on agent reliability — the knowing-doing gap reframes where to invest engineering effort. The code is public and the repo ships raw data for four models, so you can run the probe analysis on a new model without recreating the dataset from scratch. The main caveat: findings are on 4B–8B open-weight models across two narrow domains, so extrapolating to frontier-scale or diverse agentic tasks requires caution.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →