Learning & Best Practices

Testing Strategies

Testing AI agents requires a different approach than testing deterministic code. This guide covers unit testing individual agents with mocked dependencies, integration testing swarm behavior under concurrent conditions, and end-to-end testing full workflows against a running gateway.

The agent testing pyramid

LayerWhat it testsSpeedLLM calls?
UnitAgent config, budget enforcement, context assembly, tool selectionFast (<100 ms)No — mocked
IntegrationSwarm coordination, blackboard reads/writes, conflict resolutionMedium (1–5 s)Optional
E2EFull turn lifecycle, memory persistence, channel deliverySlow (5–30 s)Yes

Unit testing agents

Use createTestAgent from the testing package. It mounts the full agent runtime with mocked inference, memory, and tools. This lets you test budget enforcement, context injection, and tool selection logic without real LLM calls.

typescript
import { describe, it, expect, vi } from 'vitest'
import { createTestAgent } from '@open-astra/testing'

describe('researcher agent', () => {
  it('respects budget limits', async () => {
    const agent = createTestAgent({
      id: 'researcher',
      budget: { maxToolCallsPerTurn: 3 }
    })

    // Mock memory to return empty results
    agent.mockMemory({ results: [] })

    // Run a turn that would normally trigger many tool calls
    const result = await agent.run('Research the history of the internet')

    expect(result.toolCallCount).toBeLessThanOrEqual(3)
    expect(result.status).toBe('completed')
  })

  it('injects workspace context', async () => {
    const agent = createTestAgent({ id: 'assistant' })
    agent.mockWorkspace({ name: 'ACME Corp', domain: 'acme.com' })

    const result = await agent.run('Who am I working for?')

    expect(result.content).toContain('ACME Corp')
  })
})

Key things to unit test:

  • Budget limits are enforced (maxToolCallsPerTurn, maxCostUsdPerTurn)
  • Workspace template variables resolve correctly
  • Required tools are available and called in the expected order
  • Memory injection includes expected tier results
  • Ethical check blocks expected harmful prompts

Integration testing swarms

Swarm integration tests should specifically exercise concurrent write scenarios — the most common source of bugs in multi-agent systems. createTestSwarm lets you simulate concurrent blackboard writes and verify that the coordinating agent handles them correctly.

typescript
import { describe, it, expect } from 'vitest'
import { createTestSwarm } from '@open-astra/testing'

describe('code review swarm', () => {
  it('resolves conflicts on the blackboard', async () => {
    const swarm = createTestSwarm({
      agents: ['reviewer-a', 'reviewer-b', 'synthesizer'],
      blackboard: {}
    })

    // Simulate conflicting reviews written concurrently
    await swarm.simulateConcurrent([
      () => swarm.agent('reviewer-a').writeTo('review', { verdict: 'approve' }),
      () => swarm.agent('reviewer-b').writeTo('review', { verdict: 'reject', reason: 'missing tests' }),
    ])

    const synthesis = await swarm.agent('synthesizer').run('Synthesize the reviews')

    // Synthesizer must handle conflict — not silently drop one verdict
    expect(synthesis.content).toContain('missing tests')
    expect(synthesis.blackboardReads).toContain('review')
  })
})

Key swarm scenarios to test:

  • Conflicting writes to the same blackboard key
  • Agent completing before a dependency is ready (required: true reads)
  • Swarm completing with one agent timed out
  • Meta-controller correctly aggregating results

End-to-end workflow testing

E2E tests run against a real gateway (typically a local test instance) and verify the full turn lifecycle: message in → inference → tool calls → memory write → response out. These are slower and make real LLM calls, so keep them focused on critical paths.

typescript
import { test, expect } from '@playwright/test'
import { AstraTestClient } from '@open-astra/testing/e2e'

test('end-to-end: user message triggers memory write', async () => {
  const client = new AstraTestClient({ baseUrl: 'http://localhost:3000' })

  // Authenticate and create a session
  const token = await client.auth('test@example.com', 'password')
  const session = await client.createSession({ agentId: 'assistant' })

  // Send a message that should be remembered
  const response = await client.sendMessage(session.id, 'My name is Jordan')
  expect(response.content).toBeTruthy()

  // Wait for async post-turn save
  await client.waitForMemoryWrite(session.id, { key: 'name', value: 'Jordan' })

  // Verify memory was persisted
  const profile = await client.getUserProfile()
  expect(profile.name).toBe('Jordan')
})

Key E2E scenarios to cover:

  • New user session creates memory entries
  • Follow-up questions use memory from previous turns
  • Channel message (Telegram/Slack) routes to correct agent
  • Quota exceeded returns correct error response
  • Approval required flow blocks and resumes correctly

Testing memory in isolation

Memory bugs are subtle — retrieved context that's wrong, stale, or missing is hard to catch without dedicated memory tests. Write separate tests that:

  1. Insert known entries into each tier
  2. Run a query that should retrieve them
  3. Assert exact entries appear in the assembled context

Use agent.mockMemory({ results: [] }) in unit tests to isolate agent logic from memory retrieval, and write a separate memory integration test suite that tests retrieval directly against a test database.

CI recommendations

  • Run unit tests on every commit — fast and free.
  • Run integration tests on pull requests — catch swarm regressions early.
  • Run E2E tests on merge to main — use a test workspace with a low-cost model (gpt-4o-mini) to keep costs manageable.
  • Set a per-run budget in the test workspace config to prevent runaway test costs.