Performance Tuning Handbook

This guide covers the most impactful performance levers in Open Astra, with data-backed trade-off guidance for each one.

Where time goes in a turn

Phase	Typical time	Tunable?
Memory retrieval	80–300 ms	Yes — topK, tiers, boosting
Context assembly	10–30 ms	Limited
LLM inference (TTFT)	300–2000 ms	Yes — provider, model size, caching
Tool execution	0–5000 ms	Yes — batching, async dispatch
Post-turn save	50–200 ms (async)	Yes — async, selective extraction

Batching tool calls vs. latency

The trade-off: Batching reduces total tool execution time by running independent calls in parallel, but adds a small wait window to form the batch. For latency-sensitive agents, set maxWaitMs low or disable batching. For throughput-optimized agents (e.g. batch research), larger batches win.

Scenario	Recommended setting
Interactive chat	`maxBatchSize: 3`, `maxWaitMs: 20`
Bulk research swarm	`maxBatchSize: 12`, `maxWaitMs: 100`
Sequential pipeline	Batching disabled

yaml

tools:
  batching:
    enabled: true
    maxBatchSize: 8          # group up to 8 independent calls
    maxWaitMs: 50            # wait up to 50 ms to form a batch
    strategies:
      file_read: parallel    # all reads in one batch
      web_fetch: parallel
      db_query: sequential   # DB queries are sequential within a batch

See Batching for the full config reference.

Memory tiers vs. retrieval accuracy

The trade-off: Searching more memory tiers improves recall but increases latency and token consumption. Searching fewer tiers is faster but may miss relevant context.

Configuration	Latency	Context quality	Token cost
All 5 tiers, topK=10 each	250–400 ms	High	High
Tiers 1+4+5, topK=5 each	100–180 ms	Good	Medium
Tier 1 only, topK=3	40–80 ms	Basic	Low

For most production agents, a combination of tiers 1, 3, and 4 with conservative topK covers 95% of relevant context at half the cost of searching all tiers.

RRF weights vs. relevance

The trade-off: Reciprocal Rank Fusion combines Typesense (keyword/hybrid) and pgvector (semantic) results. Higher Typesense weight favors exact matches; higher pgvector weight favors semantic similarity.

Technical/code queries: Boost Typesense (0.7/0.3) — exact identifiers matter.
Conversational queries: Balance equally (0.5/0.5) — meaning over exact words.
Long-form research: Boost pgvector (0.35/0.65) — concepts span varied phrasings.

yaml

memory:
  search:
    rrfK: 60                 # default — good general-purpose balance
    weights:
      typesense: 0.6         # boost keyword/hybrid search
      pgvector: 0.4          # downweight pure semantic
    topK:
      tier1: 3               # session messages
      tier2: 5               # daily notes
      tier3: 1               # user profile (always injected in full)
      tier4: 8               # knowledge graph
      tier5: 4               # procedural memory

Prompt caching and semantic cache

Prompt caching (provider-level) and semantic caching (query-level) are complementary and should both be enabled in production. Together they can eliminate 60–80% of LLM calls on typical workloads with repeated system prompts or similar user queries.

yaml

inference:
  caching:
    enabled: true
    providers:
      - anthropic            # supports prompt caching natively
      - openai               # gpt-4o supports cached prefixes

memory:
  semanticCache:
    enabled: true
    similarityThreshold: 0.92   # higher = stricter match required
    ttlSeconds: 3600            # cache entries expire after 1 hour

See Prompt Caching and Semantic Cache for configuration details.

Async dispatch for non-blocking tools

Long-running tools (web crawls, code indexing, report generation) should use Async Dispatch. The agent dispatches the job, continues responding, and polls for the result on the next turn or via webhook — eliminating the tool call from the critical path.