LLM Cost Optimization: The Case for Model Routing Over Caching

What problem does it solve

“"Letting the gateway auto-route instead of users picking a model by hand dropped average cost per request by about a third" — Scott Breitenother and Sid Sijbrandij, Kilo co-founders, via ByteByteGo (June 8, 2026)”

You know that feeling when your AI infrastructure bill triples in a month with no new features shipping? Agent loops resend the full conversation history on every iteration — instructions, prior results, and intermediate reasoning — so a session starting at a few thousand tokens can exceed 100,000 by the 12th loop. Teams default every call to a frontier model like Claude Opus 4.7 ($5/M input, $25/M output), which means you pay frontier prices to write commit messages, format output, and run background cleanup tasks. Adding prompt caching helps but doesn't fix this — caching reuses repeated prefixes, and volume plus non-cacheable context remains the dominant cost.

llmai-agentscost-optimizationmodel-routingsystem-designinfrastructuredevtools

How it works

A two-component gateway sits between your agent and every model provider. The entry point normalizes requests into a unified format, handling translation between your code and each provider's API — your application never changes when you swap underlying models. The decision layer reads the agent's current execution mode (planning, editing, debugging, background) and looks up which tier that mode maps to in an externally-served config table. Planning and debugging tasks route to frontier models; routine editing goes to an economical model; background tasks like commit messages go to tiny free models. The config table updates without a code deploy, so you can swap underlying providers without restarting your application.

Key takeaways

✦

01

Signal-based routing on execution mode — you route using the agent's own mode metadata (planning/editing/debugging/background) instead of training a difficulty classifier, sidestepping the bootstrapping paradox where evaluating complexity ...

⟁

02

Four-tier model hierarchy — top tier for planning and debugging with frontier models, balanced tier with one economical model for all work, free tier for zero-cost models, and an internal tier running tiny models for background tasks like ...

⊕

03

Dynamic mode-to-model config served externally — tier-to-model mappings update without code deploys, so you swap underlying providers without restarting your application

◈

04

Bring Your Own Key (BYOK) — you pass your existing API keys; the routing layer charges only for routing overhead, not token markup

∞

05

Routing and caching as orthogonal cost levers — routing targets request difficulty (the non-cacheable portion of cost), addressing what prompt caching misses when agent context exceeds 100k tokens

◎

06

OpenAI-compatible API drop-in — works as a replacement for standard AI SDK clients without rewriting existing call sites

Should you care?

Who it’s for

If you're a backend or AI engineer maintaining an agentic system that runs multi-step loops — coding agents, research agents, document processing pipelines — and your token costs are growing faster than usage, this architecture applies directly. It's especially valuable if you've already applied prompt caching and still see high bills. Not worth building if your system makes fewer than a few hundred LLM calls per day, or if your agent has no structured execution modes to route on.

Worth exploring

The underlying pattern is production-proven: Kilo ships this in a live coding agent and published real cost reduction telemetry. Signal-based routing is the right starting point because it requires no ML infrastructure — just a config table. The critical prerequisite is that your agent must expose structured execution modes; without mode metadata, you fall back to prediction-based classification with classifier training and maintenance overhead.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

LLM Cost Optimization: The Case for Model Routing Over Caching

Underrated tools. Unfiltered takes.