R&D intermediate 3 min read Jun 12, 2026
Public Preview Sign in free for the full digest →

LLM Cost Optimization: The Case for Model Routing Over Caching

“Your agent paid $25 per million output tokens to write commit messages — a 3-line config table would have routed those to a free model.”

LLM Cost Optimization: The Case for Model Routing Over Caching
Source · blog.bytebytego.com

“"Letting the gateway auto-route instead of users picking a model by hand dropped average cost per request by about a third" — Scott Breitenother and Sid Sijbrandij, Kilo co-founders, via ByteByteGo (June 8, 2026)”

You know that feeling when your AI infrastructure bill triples in a month with no new features shipping? Agent loops resend the full conversation history on every iteration — instructions, prior results, and intermediate reasoning — so a session starting at a few thousand tokens can exceed 100,000 by the 12th loop. Teams default every call to a frontier model like Claude Opus 4.7 ($5/M input, $25/M output), which means you pay frontier prices to write commit messages, format output, and run background cleanup tasks. Adding prompt caching helps but doesn't fix this — caching reuses repeated prefixes, and volume plus non-cacheable context remains the dominant cost.

llmai-agentscost-optimizationmodel-routingsystem-designinfrastructuredevtools

A two-component gateway sits between your agent and every model provider. The entry point normalizes requests into a unified format, handling translation between your code and each provider's API — your application never changes when you swap underlying models. The decision layer reads the agent's current execution mode (planning, editing, debugging, background) and looks up which tier that mode maps to in an externally-served config table. Planning and debugging tasks route to frontier models; routine editing goes to an economical model; background tasks like commit messages go to tiny free models. The config table updates without a code deploy, so you can swap underlying providers without restarting your application.

01
Signal-based routing on execution mode — you route using the agent's own mode metadata (planning/editing/debugging/background) instead of training a difficulty classifier, sidestepping the bootstrapping paradox where evaluating complexity ...
02
Four-tier model hierarchy — top tier for planning and debugging with frontier models, balanced tier with one economical model for all work, free tier for zero-cost models, and an internal tier running tiny models for background tasks like ...
03
Dynamic mode-to-model config served externally — tier-to-model mappings update without code deploys, so you swap underlying providers without restarting your application
04
Bring Your Own Key (BYOK) — you pass your existing API keys; the routing layer charges only for routing overhead, not token markup
05
Routing and caching as orthogonal cost levers — routing targets request difficulty (the non-cacheable portion of cost), addressing what prompt caching misses when agent context exceeds 100k tokens
06
OpenAI-compatible API drop-in — works as a replacement for standard AI SDK clients without rewriting existing call sites
Who it’s for

If you're a backend or AI engineer maintaining an agentic system that runs multi-step loops — coding agents, research agents, document processing pipelines — and your token costs are growing faster than usage, this architecture applies directly. It's especially valuable if you've already applied prompt caching and still see high bills. Not worth building if your system makes fewer than a few hundred LLM calls per day, or if your agent has no structured execution modes to route on.

Worth exploring

The underlying pattern is production-proven: Kilo ships this in a live coding agent and published real cost reduction telemetry. Signal-based routing is the right starting point because it requires no ML infrastructure — just a config table. The critical prerequisite is that your agent must expose structured execution modes; without mode metadata, you fall back to prediction-based classification with classifier training and maintenance overhead.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →