Inference

Local Router

The local router automatically routes simple, short queries to a local model (Ollama or vLLM) and complex or long queries to a cloud provider. This reduces cloud API costs while maintaining high quality for tasks that require it.

Routing logic

For each incoming message, the router evaluates two signals:

  1. Complexity score — a heuristic score from 0–1 based on query length, vocabulary complexity, and the presence of reasoning keywords (e.g. "analyze", "compare", "design"). Queries above the complexity threshold are routed to cloud.
  2. Context length — if the assembled context (system prompt + memory + history) exceeds the context length threshold, the query is routed to cloud (local models typically have shorter context windows).

If either threshold is exceeded, the query goes to the cloud provider. If both are below threshold, it goes to the local model.

Configuration

yaml
localModels:
  enabled: true
  complexityThreshold: 0.6      # Queries scoring above 0.6 go to cloud
  contextLengthThreshold: 4096  # Assembled context > 4096 tokens goes to cloud
  localProvider: ollama
  localModel: llama3.2
  fallbackProvider: openai
  fallbackModel: gpt-4o

Complexity scoring

The complexity score is computed from several features:

FeatureWeightDescription
Token count0.2Longer queries score higher
Reasoning keywords0.3Presence of words like "analyze", "compare", "synthesize"
Multi-step indicators0.25Presence of words like "first", "then", "finally", "step"
Technical vocabulary0.25Presence of domain-specific technical terms

Routing examples

QueryScoreRoute
"What's the weather today?"0.05Local
"Summarize this file"0.2Local
"Analyze the performance implications of switching from IVFFlat to HNSW indexing in pgvector at our scale"0.85Cloud
"Design a migration strategy to move from a monolith to microservices"0.9Cloud

Monitoring routing decisions

Every routing decision emits a routing.decision event with the complexity score, chosen provider, and reason. You can monitor this in the cost dashboard to see what percentage of queries are being served locally.

bash
# Get routing stats for the last 24 hours
GET /costs/routing?period=day

# Example response:
# {
#   "totalRequests": 1204,
#   "localRequests": 876 (72.8%),
#   "cloudRequests": 328 (27.2%),
#   "estimatedSavings": "$4.32"
# }
💡Start with a higher complexityThreshold (0.8) and lower it gradually as you gain confidence in your local model's output quality. Monitor response quality alongside the cost savings.