Overview
Model routing is the art of matching each request to the right LLM at the right cost. Not every task needs GPT-4 — simple refactorings can run on lightweight models, while architectural decisions need the best reasoning available. Smart routing reduces costs by 5-10x without sacrificing quality.
Routing Strategies
Tier-Based Routing
The simplest approach: define model tiers and route by task complexity.
| Tier | Model | Cost (per 1M input tokens) | When to Use |
|---|---|---|---|
| Primary | Kimi K2.6 (Cloudflare) | ~$0.50 | Default for all tasks |
| Fallback 1 | Claude Sonnet 4 | ~$3.00 | Complex reasoning, architecture |
| Fallback 2 | GPT-4o | ~$2.50 | General coding, broad knowledge |
| Local | Qwen 32B (Ollama) | $0 | Sensitive/offline work |
Intent-Based Routing
More sophisticated: analyze the prompt intent and route to the model best suited for that intent.
- Code generation → Kimi K2.6 (fast, cheap, good at structured output)
- Architecture review → Claude Sonnet 4 (best reasoning)
- Refactoring → Kimi K2.6 or GPT-4o (pattern matching)
- Debugging → Claude Sonnet 4 (best error analysis)
- Documentation → Kimi K2.6 (cheap for long-form text)
- Security audit → Claude Sonnet 4 (thorough analysis)
Fallback Chains
When a model fails (rate limit, timeout, 5xx), automatically retry with the next model in the chain. Open SWE's ModelFallbackMiddleware handles this at the middleware layer.
| Provider | Retryable Errors | Typical Fallback |
|---|---|---|
| Anthropic (Claude) | 529, 429, 5xx, timeout | OpenAI (GPT-4o) |
| OpenAI (GPT) | 429, 5xx, timeout | Anthropic (Claude) |
| Cloudflare (Kimi) | 429, 5xx, timeout | Together AI or Anthropic |
| Local (Ollama) | OOM, timeout | Cloud provider |
LiteLLM Proxy
LiteLLM is the industry-standard solution for unified model routing. It exposes a single OpenAI-compatible API and routes to 100+ providers.
- Unified API — Call any model through OpenAI-compatible interface.
- Budget tracking — Per-key, per-team, per-model cost limits.
- Rate limiting — Token-per-minute and request-per-minute controls.
- Fallback chains — Configurable retry logic across providers.
- Load balancing — Distribute traffic across multiple API keys.
Cost Optimization
With smart routing, a typical agent task costs $0.01-0.05 instead of $0.10-0.50:
| Strategy | Savings | Implementation |
|---|---|---|
| Kimi K2.6 as primary | 6x cheaper than Claude | Route 80% of tasks to Kimi |
| Caching repetitions | 2-5x for repeated patterns | Semantic cache with LiteLLM |
| Shorter prompts | 1.5-3x | Repo map instead of full files |
| Batching tool calls | 1.2-2x | Parallel tool execution |
xCoder's Approach
xCoder uses a three-tier model strategy:
- LiteLLM proxy — First-class backend (Issue #72). All model calls go through LiteLLM for unified routing and cost tracking.
- Intent-based routing — A lightweight classifier (running on Kimi K2.6 itself) analyzes the prompt and selects the optimal model.
- Team overrides — Per-repo and per-user model preferences stored in PostgreSQL.
- Cloudflare/Dial credits — Kimi K2.6 via Cloudflare startup credits makes the primary tier essentially free for early usage.
Kimi K2.6 advantage