Local Router
The local router automatically routes simple, short queries to a local model (Ollama or vLLM) and complex or long queries to a cloud provider. This reduces cloud API costs while maintaining high quality for tasks that require it.
Routing logic
For each incoming message, the router evaluates two signals:
- Complexity score — a heuristic score from 0–1 based on query length, vocabulary complexity, and the presence of reasoning keywords (e.g. "analyze", "compare", "design"). Queries above the complexity threshold are routed to cloud.
- Context length — if the assembled context (system prompt + memory + history) exceeds the context length threshold, the query is routed to cloud (local models typically have shorter context windows).
If either threshold is exceeded, the query goes to the cloud provider. If both are below threshold, it goes to the local model.
Configuration
yaml
localModels:
enabled: true
complexityThreshold: 0.6 # Queries scoring above 0.6 go to cloud
contextLengthThreshold: 4096 # Assembled context > 4096 tokens goes to cloud
localProvider: ollama
localModel: llama3.2
fallbackProvider: openai
fallbackModel: gpt-4oComplexity scoring
The complexity score is computed from several features:
| Feature | Weight | Description |
|---|---|---|
| Token count | 0.2 | Longer queries score higher |
| Reasoning keywords | 0.3 | Presence of words like "analyze", "compare", "synthesize" |
| Multi-step indicators | 0.25 | Presence of words like "first", "then", "finally", "step" |
| Technical vocabulary | 0.25 | Presence of domain-specific technical terms |
Routing examples
| Query | Score | Route |
|---|---|---|
| "What's the weather today?" | 0.05 | Local |
| "Summarize this file" | 0.2 | Local |
| "Analyze the performance implications of switching from IVFFlat to HNSW indexing in pgvector at our scale" | 0.85 | Cloud |
| "Design a migration strategy to move from a monolith to microservices" | 0.9 | Cloud |
Monitoring routing decisions
Every routing decision emits a routing.decision event with the complexity score, chosen provider, and reason. You can monitor this in the cost dashboard to see what percentage of queries are being served locally.
bash
# Get routing stats for the last 24 hours
GET /costs/routing?period=day
# Example response:
# {
# "totalRequests": 1204,
# "localRequests": 876 (72.8%),
# "cloudRequests": 328 (27.2%),
# "estimatedSavings": "$4.32"
# }💡Start with a higher
complexityThreshold (0.8) and lower it gradually as you gain confidence in your local model's output quality. Monitor response quality alongside the cost savings.