Performance Tuning Handbook
This guide covers the most impactful performance levers in Open Astra, with data-backed trade-off guidance for each one.
Where time goes in a turn
| Phase | Typical time | Tunable? |
|---|---|---|
| Memory retrieval | 80–300 ms | Yes — topK, tiers, boosting |
| Context assembly | 10–30 ms | Limited |
| LLM inference (TTFT) | 300–2000 ms | Yes — provider, model size, caching |
| Tool execution | 0–5000 ms | Yes — batching, async dispatch |
| Post-turn save | 50–200 ms (async) | Yes — async, selective extraction |
Batching tool calls vs. latency
The trade-off: Batching reduces total tool execution time by running independent calls in parallel, but adds a small wait window to form the batch. For latency-sensitive agents, set maxWaitMs low or disable batching. For throughput-optimized agents (e.g. batch research), larger batches win.
| Scenario | Recommended setting |
|---|---|
| Interactive chat | maxBatchSize: 3, maxWaitMs: 20 |
| Bulk research swarm | maxBatchSize: 12, maxWaitMs: 100 |
| Sequential pipeline | Batching disabled |
tools:
batching:
enabled: true
maxBatchSize: 8 # group up to 8 independent calls
maxWaitMs: 50 # wait up to 50 ms to form a batch
strategies:
file_read: parallel # all reads in one batch
web_fetch: parallel
db_query: sequential # DB queries are sequential within a batchSee Batching for the full config reference.
Memory tiers vs. retrieval accuracy
The trade-off: Searching more memory tiers improves recall but increases latency and token consumption. Searching fewer tiers is faster but may miss relevant context.
| Configuration | Latency | Context quality | Token cost |
|---|---|---|---|
| All 5 tiers, topK=10 each | 250–400 ms | High | High |
| Tiers 1+4+5, topK=5 each | 100–180 ms | Good | Medium |
| Tier 1 only, topK=3 | 40–80 ms | Basic | Low |
For most production agents, a combination of tiers 1, 3, and 4 with conservative topK covers 95% of relevant context at half the cost of searching all tiers.
RRF weights vs. relevance
The trade-off: Reciprocal Rank Fusion combines Typesense (keyword/hybrid) and pgvector (semantic) results. Higher Typesense weight favors exact matches; higher pgvector weight favors semantic similarity.
- Technical/code queries: Boost Typesense (
0.7/0.3) — exact identifiers matter. - Conversational queries: Balance equally (
0.5/0.5) — meaning over exact words. - Long-form research: Boost pgvector (
0.35/0.65) — concepts span varied phrasings.
memory:
search:
rrfK: 60 # default — good general-purpose balance
weights:
typesense: 0.6 # boost keyword/hybrid search
pgvector: 0.4 # downweight pure semantic
topK:
tier1: 3 # session messages
tier2: 5 # daily notes
tier3: 1 # user profile (always injected in full)
tier4: 8 # knowledge graph
tier5: 4 # procedural memoryPrompt caching and semantic cache
Prompt caching (provider-level) and semantic caching (query-level) are complementary and should both be enabled in production. Together they can eliminate 60–80% of LLM calls on typical workloads with repeated system prompts or similar user queries.
inference:
caching:
enabled: true
providers:
- anthropic # supports prompt caching natively
- openai # gpt-4o supports cached prefixes
memory:
semanticCache:
enabled: true
similarityThreshold: 0.92 # higher = stricter match required
ttlSeconds: 3600 # cache entries expire after 1 hourSee Prompt Caching and Semantic Cache for configuration details.
Async dispatch for non-blocking tools
Long-running tools (web crawls, code indexing, report generation) should use Async Dispatch. The agent dispatches the job, continues responding, and polls for the result on the next turn or via webhook — eliminating the tool call from the critical path.