LLMs know they need a tool — and ignore it 54% of the time

What problem does it solve

“"the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself." — Cheng et al., abstract (https://arxiv.org/abs/2605.14038)”

You know that feeling when an AI agent confidently gives a wrong answer to an arithmetic problem instead of reaching for a calculator — even though tools are available and the system prompt explicitly allows them? The current assumption is that the model either recognized the need for a tool or did not. This paper shows that framing is wrong: recognition and action are two separate internal processes that fail to connect. Worse, tool necessity is not a universal property — whether a query needs a tool depends entirely on which specific model is handling it, so labeling based on human judgment introduces noise that does not reflect any individual model's actual capability boundary.

llmagentstool-usemechanistic-interpretabilityresearch-paperpythonopen-source

How it works

The paper trains lightweight linear probes on the hidden-state vectors inside an LLM at each layer. One probe decodes whether the model's internal representation encodes 'this query needs a tool' — the cognition signal. A second probe decodes whether the model will actually emit a tool-call token — the action signal. The authors then measure the cosine similarity between the two probe directions across all layers. In early layers the directions partially align; in the late layers where the next token is actually predicted, the directions become nearly orthogonal, meaning the two representations are essentially uncorrelated. The paper also introduces a model-specific necessity label: a query is labeled 'tool-required for model X' only if model X fails that query consistently across 10 independent runs at temperature 0.7, grounding labels in each model's actual failure rate.

Key takeaways

✦

01

Model-adaptive necessity labels — instead of a universal standard, you ground the label in whether THIS specific model fails the query across 10 runs; this removes noise from human or LLM-judge annotations that treat necessity as model-agn...

⟁

02

Layer-by-layer probe analysis — you get a full map of where cognition and action representations align versus diverge inside the model, not just a behavioral output count, so you can identify exactly which layer regime the translation fail...

⊕

03

Knowing-doing gap quantification with exact mismatch rates — 26.5–54.0% for arithmetic and 30.8–41.8% for TruthfulQA across four open-weight models give you a concrete baseline for how bad the translation problem actually is before any int...

◈

04

Four-cell outcome breakdown shipped as raw data — the repo includes JSON files categorizing every query into (necessary, called), (necessary, not-called), (not-necessary, called), (not-necessary, not-called) per model, giving you a full co...

∞

05

Cosine similarity as a mechanistic diagnostic — measuring the angle between cognition and action probe directions gives you a representation-level explanation for tool-use failures rather than treating behavioral mismatch as a black box

Should you care?

Who it’s for

If you build or evaluate LLM-based agents and you have seen unreliable tool-calling behavior that does not improve with better prompts or system instructions, this paper directly addresses why. It is also directly relevant if you work on mechanistic interpretability — the probe-based methodology extends to other agent failure modes. Not useful yet if you work exclusively with closed-source models like GPT-4, Gemini, or Claude, since the methodology requires white-box access to hidden states.

Worth exploring

Read this paper if you work on agent reliability — the knowing-doing gap reframes where to invest engineering effort. The code is public and the repo ships raw data for four models, so you can run the probe analysis on a new model without recreating the dataset from scratch. The main caveat: findings are on 4B–8B open-weight models across two narrow domains, so extrapolating to frontier-scale or diverse agentic tasks requires caution.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

LLMs know they need a tool — and ignore it 54% of the time

Underrated tools. Unfiltered takes.