Model Routing

LLM routing strategies: cost optimization, intent-based fallbacks, provider unification, and the Kimi K2.6 advantage.

Overview

Model routing is the art of matching each request to the right LLM at the right cost. Not every task needs GPT-4 — simple refactorings can run on lightweight models, while architectural decisions need the best reasoning available. Smart routing reduces costs by 5-10x without sacrificing quality.

Routing Strategies

Tier-Based Routing

The simplest approach: define model tiers and route by task complexity.

TierModelCost (per 1M input tokens)When to Use
PrimaryKimi K2.6 (Cloudflare)~$0.50Default for all tasks
Fallback 1Claude Sonnet 4~$3.00Complex reasoning, architecture
Fallback 2GPT-4o~$2.50General coding, broad knowledge
LocalQwen 32B (Ollama)$0Sensitive/offline work

Intent-Based Routing

More sophisticated: analyze the prompt intent and route to the model best suited for that intent.

  • Code generation → Kimi K2.6 (fast, cheap, good at structured output)
  • Architecture review → Claude Sonnet 4 (best reasoning)
  • Refactoring → Kimi K2.6 or GPT-4o (pattern matching)
  • Debugging → Claude Sonnet 4 (best error analysis)
  • Documentation → Kimi K2.6 (cheap for long-form text)
  • Security audit → Claude Sonnet 4 (thorough analysis)

Fallback Chains

When a model fails (rate limit, timeout, 5xx), automatically retry with the next model in the chain. Open SWE's ModelFallbackMiddleware handles this at the middleware layer.

ProviderRetryable ErrorsTypical Fallback
Anthropic (Claude)529, 429, 5xx, timeoutOpenAI (GPT-4o)
OpenAI (GPT)429, 5xx, timeoutAnthropic (Claude)
Cloudflare (Kimi)429, 5xx, timeoutTogether AI or Anthropic
Local (Ollama)OOM, timeoutCloud provider

LiteLLM Proxy

LiteLLM is the industry-standard solution for unified model routing. It exposes a single OpenAI-compatible API and routes to 100+ providers.

  • Unified API — Call any model through OpenAI-compatible interface.
  • Budget tracking — Per-key, per-team, per-model cost limits.
  • Rate limiting — Token-per-minute and request-per-minute controls.
  • Fallback chains — Configurable retry logic across providers.
  • Load balancing — Distribute traffic across multiple API keys.

Cost Optimization

With smart routing, a typical agent task costs $0.01-0.05 instead of $0.10-0.50:

StrategySavingsImplementation
Kimi K2.6 as primary6x cheaper than ClaudeRoute 80% of tasks to Kimi
Caching repetitions2-5x for repeated patternsSemantic cache with LiteLLM
Shorter prompts1.5-3xRepo map instead of full files
Batching tool calls1.2-2xParallel tool execution

xCoder's Approach

xCoder uses a three-tier model strategy:

  • LiteLLM proxy — First-class backend (Issue #72). All model calls go through LiteLLM for unified routing and cost tracking.
  • Intent-based routing — A lightweight classifier (running on Kimi K2.6 itself) analyzes the prompt and selects the optimal model.
  • Team overrides — Per-repo and per-user model preferences stored in PostgreSQL.
  • Cloudflare/Dial credits — Kimi K2.6 via Cloudflare startup credits makes the primary tier essentially free for early usage.

Kimi K2.6 advantage

Kimi K2.6 via Cloudflare Workers AI costs ~$0.50 per million input tokens vs Claude Sonnet 4 at ~$3.00. For an average agent session using 50K input tokens, that's $0.025 vs $0.15 — a 6x cost reduction with negligible quality difference for coding tasks.