“"the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself." — Cheng et al., abstract (https://arxiv.org/abs/2605.14038)”
You know that feeling when an AI agent confidently gives a wrong answer to an arithmetic problem instead of reaching for a calculator — even though tools are available and the system prompt explicitly allows them? The current assumption is that the model either recognized the need for a tool or did not. This paper shows that framing is wrong: recognition and action are two separate internal processes that fail to connect. Worse, tool necessity is not a universal property — whether a query needs a tool depends entirely on which specific model is handling it, so labeling based on human judgment introduces noise that does not reflect any individual model's actual capability boundary.
The paper trains lightweight linear probes on the hidden-state vectors inside an LLM at each layer. One probe decodes whether the model's internal representation encodes 'this query needs a tool' — the cognition signal. A second probe decodes whether the model will actually emit a tool-call token — the action signal. The authors then measure the cosine similarity between the two probe directions across all layers. In early layers the directions partially align; in the late layers where the next token is actually predicted, the directions become nearly orthogonal, meaning the two representations are essentially uncorrelated. The paper also introduces a model-specific necessity label: a query is labeled 'tool-required for model X' only if model X fails that query consistently across 10 independent runs at temperature 0.7, grounding labels in each model's actual failure rate.
If you build or evaluate LLM-based agents and you have seen unreliable tool-calling behavior that does not improve with better prompts or system instructions, this paper directly addresses why. It is also directly relevant if you work on mechanistic interpretability — the probe-based methodology extends to other agent failure modes. Not useful yet if you work exclusively with closed-source models like GPT-4, Gemini, or Claude, since the methodology requires white-box access to hidden states.
Read this paper if you work on agent reliability — the knowing-doing gap reframes where to invest engineering effort. The code is public and the repo ships raw data for four models, so you can run the probe analysis on a new model without recreating the dataset from scratch. The main caveat: findings are on 4B–8B open-weight models across two narrow domains, so extrapolating to frontier-scale or diverse agentic tasks requires caution.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.