Testing Strategies

Testing AI agents requires a different approach than testing deterministic code. This guide covers unit testing individual agents with mocked dependencies, integration testing swarm behavior under concurrent conditions, and end-to-end testing full workflows against a running gateway.

The agent testing pyramid

Layer	What it tests	Speed	LLM calls?
Unit	Agent config, budget enforcement, context assembly, tool selection	Fast (<100 ms)	No — mocked
Integration	Swarm coordination, blackboard reads/writes, conflict resolution	Medium (1–5 s)	Optional
E2E	Full turn lifecycle, memory persistence, channel delivery	Slow (5–30 s)	Yes

Unit testing agents

Use createTestAgent from the testing package. It mounts the full agent runtime with mocked inference, memory, and tools. This lets you test budget enforcement, context injection, and tool selection logic without real LLM calls.

typescript

import { describe, it, expect, vi } from 'vitest'
import { createTestAgent } from '@open-astra/testing'

describe('researcher agent', () => {
  it('respects budget limits', async () => {
    const agent = createTestAgent({
      id: 'researcher',
      budget: { maxToolCallsPerTurn: 3 }
    })

    // Mock memory to return empty results
    agent.mockMemory({ results: [] })

    // Run a turn that would normally trigger many tool calls
    const result = await agent.run('Research the history of the internet')

    expect(result.toolCallCount).toBeLessThanOrEqual(3)
    expect(result.status).toBe('completed')
  })

  it('injects workspace context', async () => {
    const agent = createTestAgent({ id: 'assistant' })
    agent.mockWorkspace({ name: 'ACME Corp', domain: 'acme.com' })

    const result = await agent.run('Who am I working for?')

    expect(result.content).toContain('ACME Corp')
  })
})

Key things to unit test:

Budget limits are enforced (maxToolCallsPerTurn, maxCostUsdPerTurn)
Workspace template variables resolve correctly
Required tools are available and called in the expected order
Memory injection includes expected tier results
Ethical check blocks expected harmful prompts

Integration testing swarms

Swarm integration tests should specifically exercise concurrent write scenarios — the most common source of bugs in multi-agent systems. createTestSwarm lets you simulate concurrent blackboard writes and verify that the coordinating agent handles them correctly.

typescript

import { describe, it, expect } from 'vitest'
import { createTestSwarm } from '@open-astra/testing'

describe('code review swarm', () => {
  it('resolves conflicts on the blackboard', async () => {
    const swarm = createTestSwarm({
      agents: ['reviewer-a', 'reviewer-b', 'synthesizer'],
      blackboard: {}
    })

    // Simulate conflicting reviews written concurrently
    await swarm.simulateConcurrent([
      () => swarm.agent('reviewer-a').writeTo('review', { verdict: 'approve' }),
      () => swarm.agent('reviewer-b').writeTo('review', { verdict: 'reject', reason: 'missing tests' }),
    ])

    const synthesis = await swarm.agent('synthesizer').run('Synthesize the reviews')

    // Synthesizer must handle conflict — not silently drop one verdict
    expect(synthesis.content).toContain('missing tests')
    expect(synthesis.blackboardReads).toContain('review')
  })
})

Key swarm scenarios to test:

Conflicting writes to the same blackboard key
Agent completing before a dependency is ready (required: true reads)
Swarm completing with one agent timed out
Meta-controller correctly aggregating results

End-to-end workflow testing

E2E tests run against a real gateway (typically a local test instance) and verify the full turn lifecycle: message in → inference → tool calls → memory write → response out. These are slower and make real LLM calls, so keep them focused on critical paths.

typescript

import { test, expect } from '@playwright/test'
import { AstraTestClient } from '@open-astra/testing/e2e'

test('end-to-end: user message triggers memory write', async () => {
  const client = new AstraTestClient({ baseUrl: 'http://localhost:3000' })

  // Authenticate and create a session
  const token = await client.auth('test@example.com', 'password')
  const session = await client.createSession({ agentId: 'assistant' })

  // Send a message that should be remembered
  const response = await client.sendMessage(session.id, 'My name is Jordan')
  expect(response.content).toBeTruthy()

  // Wait for async post-turn save
  await client.waitForMemoryWrite(session.id, { key: 'name', value: 'Jordan' })

  // Verify memory was persisted
  const profile = await client.getUserProfile()
  expect(profile.name).toBe('Jordan')
})

Key E2E scenarios to cover:

New user session creates memory entries
Follow-up questions use memory from previous turns
Channel message (Telegram/Slack) routes to correct agent
Quota exceeded returns correct error response
Approval required flow blocks and resumes correctly

Testing memory in isolation

Memory bugs are subtle — retrieved context that's wrong, stale, or missing is hard to catch without dedicated memory tests. Write separate tests that:

Insert known entries into each tier
Run a query that should retrieve them
Assert exact entries appear in the assembled context

Use agent.mockMemory({ results: [] }) in unit tests to isolate agent logic from memory retrieval, and write a separate memory integration test suite that tests retrieval directly against a test database.

CI recommendations

Run unit tests on every commit — fast and free.
Run integration tests on pull requests — catch swarm regressions early.
Run E2E tests on merge to main — use a test workspace with a low-cost model (gpt-4o-mini) to keep costs manageable.
Set a per-run budget in the test workspace config to prevent runaway test costs.