xCoder // the shock collar for your coding agent

LLM routing strategies: cost optimization, intent-based fallbacks, provider unification, and the Kimi K2.6 advantage.

Overview

Model routing is the art of matching each request to the right LLM at the right cost. Not every task needs GPT-4 — simple refactorings can run on lightweight models, while architectural decisions need the best reasoning available. Smart routing reduces costs by 5-10x without sacrificing quality.

Routing Strategies

Tier-Based Routing

The simplest approach: define model tiers and route by task complexity.

Tier	Model	Cost (per 1M input tokens)	When to Use
Primary	Kimi K2.6 (Cloudflare)	~$0.50	Default for all tasks
Fallback 1	Claude Sonnet 4	~$3.00	Complex reasoning, architecture
Fallback 2	GPT-4o	~$2.50	General coding, broad knowledge
Local	Qwen 32B (Ollama)	$0	Sensitive/offline work

Intent-Based Routing

More sophisticated: analyze the prompt intent and route to the model best suited for that intent.

Code generation → Kimi K2.6 (fast, cheap, good at structured output)
Architecture review → Claude Sonnet 4 (best reasoning)
Refactoring → Kimi K2.6 or GPT-4o (pattern matching)
Debugging → Claude Sonnet 4 (best error analysis)
Documentation → Kimi K2.6 (cheap for long-form text)
Security audit → Claude Sonnet 4 (thorough analysis)

Fallback Chains

When a model fails (rate limit, timeout, 5xx), automatically retry with the next model in the chain. Open SWE's ModelFallbackMiddleware handles this at the middleware layer.

Provider	Retryable Errors	Typical Fallback
Anthropic (Claude)	529, 429, 5xx, timeout	OpenAI (GPT-4o)
OpenAI (GPT)	429, 5xx, timeout	Anthropic (Claude)
Cloudflare (Kimi)	429, 5xx, timeout	Together AI or Anthropic
Local (Ollama)	OOM, timeout	Cloud provider

LiteLLM Proxy

LiteLLM is the industry-standard solution for unified model routing. It exposes a single OpenAI-compatible API and routes to 100+ providers.

Unified API — Call any model through OpenAI-compatible interface.
Budget tracking — Per-key, per-team, per-model cost limits.
Rate limiting — Token-per-minute and request-per-minute controls.
Fallback chains — Configurable retry logic across providers.
Load balancing — Distribute traffic across multiple API keys.

Cost Optimization

With smart routing, a typical agent task costs $0.01-0.05 instead of $0.10-0.50:

Strategy	Savings	Implementation
Kimi K2.6 as primary	6x cheaper than Claude	Route 80% of tasks to Kimi
Caching repetitions	2-5x for repeated patterns	Semantic cache with LiteLLM
Shorter prompts	1.5-3x	Repo map instead of full files
Batching tool calls	1.2-2x	Parallel tool execution

xCoder's Approach

xCoder uses a three-tier model strategy:

LiteLLM proxy — First-class backend (Issue #72). All model calls go through LiteLLM for unified routing and cost tracking.
Intent-based routing — A lightweight classifier (running on Kimi K2.6 itself) analyzes the prompt and selects the optimal model.
Team overrides — Per-repo and per-user model preferences stored in PostgreSQL.
Cloudflare/Dial credits — Kimi K2.6 via Cloudflare startup credits makes the primary tier essentially free for early usage.

Kimi K2.6 advantage

Kimi K2.6 via Cloudflare Workers AI costs ~$0.50 per million input tokens vs Claude Sonnet 4 at ~$3.00. For an average agent session using 50K input tokens, that's $0.025 vs $0.15 — a 6x cost reduction with negligible quality difference for coding tasks.

Model Routing