“"Letting the gateway auto-route instead of users picking a model by hand dropped average cost per request by about a third" — Scott Breitenother and Sid Sijbrandij, Kilo co-founders, via ByteByteGo (June 8, 2026)”
You know that feeling when your AI infrastructure bill triples in a month with no new features shipping? Agent loops resend the full conversation history on every iteration — instructions, prior results, and intermediate reasoning — so a session starting at a few thousand tokens can exceed 100,000 by the 12th loop. Teams default every call to a frontier model like Claude Opus 4.7 ($5/M input, $25/M output), which means you pay frontier prices to write commit messages, format output, and run background cleanup tasks. Adding prompt caching helps but doesn't fix this — caching reuses repeated prefixes, and volume plus non-cacheable context remains the dominant cost.
A two-component gateway sits between your agent and every model provider. The entry point normalizes requests into a unified format, handling translation between your code and each provider's API — your application never changes when you swap underlying models. The decision layer reads the agent's current execution mode (planning, editing, debugging, background) and looks up which tier that mode maps to in an externally-served config table. Planning and debugging tasks route to frontier models; routine editing goes to an economical model; background tasks like commit messages go to tiny free models. The config table updates without a code deploy, so you can swap underlying providers without restarting your application.
If you're a backend or AI engineer maintaining an agentic system that runs multi-step loops — coding agents, research agents, document processing pipelines — and your token costs are growing faster than usage, this architecture applies directly. It's especially valuable if you've already applied prompt caching and still see high bills. Not worth building if your system makes fewer than a few hundred LLM calls per day, or if your agent has no structured execution modes to route on.
The underlying pattern is production-proven: Kilo ships this in a live coding agent and published real cost reduction telemetry. Signal-based routing is the right starting point because it requires no ML infrastructure — just a config table. The critical prerequisite is that your agent must expose structured execution modes; without mode metadata, you fall back to prediction-based classification with classifier training and maintenance overhead.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.