Learning & Best Practices

Performance Tuning Handbook

This guide covers the most impactful performance levers in Open Astra, with data-backed trade-off guidance for each one.

Where time goes in a turn

PhaseTypical timeTunable?
Memory retrieval80–300 msYes — topK, tiers, boosting
Context assembly10–30 msLimited
LLM inference (TTFT)300–2000 msYes — provider, model size, caching
Tool execution0–5000 msYes — batching, async dispatch
Post-turn save50–200 ms (async)Yes — async, selective extraction

Batching tool calls vs. latency

The trade-off: Batching reduces total tool execution time by running independent calls in parallel, but adds a small wait window to form the batch. For latency-sensitive agents, set maxWaitMs low or disable batching. For throughput-optimized agents (e.g. batch research), larger batches win.

ScenarioRecommended setting
Interactive chatmaxBatchSize: 3, maxWaitMs: 20
Bulk research swarmmaxBatchSize: 12, maxWaitMs: 100
Sequential pipelineBatching disabled
yaml
tools:
  batching:
    enabled: true
    maxBatchSize: 8          # group up to 8 independent calls
    maxWaitMs: 50            # wait up to 50 ms to form a batch
    strategies:
      file_read: parallel    # all reads in one batch
      web_fetch: parallel
      db_query: sequential   # DB queries are sequential within a batch

See Batching for the full config reference.

Memory tiers vs. retrieval accuracy

The trade-off: Searching more memory tiers improves recall but increases latency and token consumption. Searching fewer tiers is faster but may miss relevant context.

ConfigurationLatencyContext qualityToken cost
All 5 tiers, topK=10 each250–400 msHighHigh
Tiers 1+4+5, topK=5 each100–180 msGoodMedium
Tier 1 only, topK=340–80 msBasicLow

For most production agents, a combination of tiers 1, 3, and 4 with conservative topK covers 95% of relevant context at half the cost of searching all tiers.

RRF weights vs. relevance

The trade-off: Reciprocal Rank Fusion combines Typesense (keyword/hybrid) and pgvector (semantic) results. Higher Typesense weight favors exact matches; higher pgvector weight favors semantic similarity.

  • Technical/code queries: Boost Typesense (0.7/0.3) — exact identifiers matter.
  • Conversational queries: Balance equally (0.5/0.5) — meaning over exact words.
  • Long-form research: Boost pgvector (0.35/0.65) — concepts span varied phrasings.
yaml
memory:
  search:
    rrfK: 60                 # default — good general-purpose balance
    weights:
      typesense: 0.6         # boost keyword/hybrid search
      pgvector: 0.4          # downweight pure semantic
    topK:
      tier1: 3               # session messages
      tier2: 5               # daily notes
      tier3: 1               # user profile (always injected in full)
      tier4: 8               # knowledge graph
      tier5: 4               # procedural memory

Prompt caching and semantic cache

Prompt caching (provider-level) and semantic caching (query-level) are complementary and should both be enabled in production. Together they can eliminate 60–80% of LLM calls on typical workloads with repeated system prompts or similar user queries.

yaml
inference:
  caching:
    enabled: true
    providers:
      - anthropic            # supports prompt caching natively
      - openai               # gpt-4o supports cached prefixes

memory:
  semanticCache:
    enabled: true
    similarityThreshold: 0.92   # higher = stricter match required
    ttlSeconds: 3600            # cache entries expire after 1 hour

See Prompt Caching and Semantic Cache for configuration details.

Async dispatch for non-blocking tools

Long-running tools (web crawls, code indexing, report generation) should use Async Dispatch. The agent dispatches the job, continues responding, and polls for the result on the next turn or via webhook — eliminating the tool call from the critical path.