Deployment

Backup & Recovery Runbook

This runbook covers backup procedures for PostgreSQL and Typesense, workspace-scoped memory restore, RTO/RPO targets, and platform-specific guidance for DigitalOcean Managed Databases.

RTO / RPO targets

ScenarioRPO (data loss tolerance)RTO (downtime tolerance)
Full instance failureUp to 24 hours (daily backup)30–60 min (restore + verify)
Accidental workspace deletionUp to 24 hours (daily backup)15–30 min (workspace restore)
Corrupt memory dataUp to 24 hours (daily backup)10–20 min (table-level restore)
DigitalOcean PITR (Point-in-Time Recovery)Up to 1 minute (WAL streaming)20–40 min (cluster rebuild)

The default daily backup schedule achieves 24-hour RPO. Enable WAL archiving or use DigitalOcean's PITR to reduce RPO to minutes.

What to back up

SystemContainsBackup methodCriticality
PostgreSQLAll agent data: sessions, turns, memory, knowledge graph, workspace configpg_dump / PITRCritical
TypesenseSearch index — a derived copy of memory_entries and documentsCollection export (JSONL)High (can be rebuilt from PG, but takes time)
astra.ymlAgent, skill, tool, channel configurationGit version controlHigh
Environment variables / secretsAPI keys, DB credentialsSecrets manager snapshotCritical
Typesense data is a derived index — it can be fully rebuilt from PostgreSQL using npx astra reindex. Back it up to reduce recovery time, but PostgreSQL is the source of truth.

PostgreSQL backup

Use pg_dump for logical backups. Schedule daily runs with a cron job and store the output in object storage (S3, DigitalOcean Spaces):

bash
# Full logical backup (all workspaces)
pg_dump \
  --format=custom \
  --compress=9 \
  --no-acl \
  --no-owner \
  "$DATABASE_URL" \
  -f backup-$(date +%Y%m%d-%H%M%S).dump

# Workspace-scoped backup (single tenant)
pg_dump \
  --format=custom \
  --table='memory_entries' \
  --table='knowledge_graph_*' \
  --table='sessions' \
  --table='turns' \
  --where="workspace_id = 'ws_acme'" \
  "$DATABASE_URL" \
  -f ws-acme-$(date +%Y%m%d).dump

PostgreSQL restore

bash
# Full restore to a new database
pg_restore \
  --clean \
  --if-exists \
  --no-acl \
  --no-owner \
  -d "$DATABASE_URL" \
  backup-20260101-120000.dump

# Restore a single workspace's memory to a running instance
# 1. Restore into a staging table
pg_restore \
  --table=memory_entries \
  --table=knowledge_graph_nodes \
  --table=knowledge_graph_edges \
  -d "$DATABASE_URL" \
  ws-acme-20260101.dump

# 2. Promote staging rows to live (in psql)
INSERT INTO memory_entries SELECT * FROM memory_entries_staging
  WHERE workspace_id = 'ws_acme'
  ON CONFLICT (id) DO NOTHING;

Restoring a single workspace's memory

To restore one workspace without touching others:

  1. Restore the dump into temporary tables (using --table flags and a staging schema).
  2. Filter rows by workspace_id and insert into live tables with ON CONFLICT DO NOTHING to avoid overwriting newer data.
  3. Re-trigger search indexing for the workspace: npx astra reindex --workspace ws_acme.
  4. Verify memory is accessible: send a test agent turn and check retrieved memory results.

Typesense backup

Export each collection as JSONL. Store alongside the PostgreSQL dump so they're from the same point in time:

bash
# Export all collections (run against your Typesense host)
for collection in $(curl -s "http://localhost:8108/collections" \
    -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" | jq -r '.[].name'); do
  curl -s "http://localhost:8108/collections/$collection/documents/export" \
    -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
    > "typesense-$collection-$(date +%Y%m%d).jsonl"
  echo "Exported $collection"
done

Typesense restore

bash
# Re-import a collection (collection must exist with correct schema first)
curl -X POST "http://localhost:8108/collections/memory_ws_acme/documents/import?action=upsert" \
  -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
  -H "Content-Type: text/plain" \
  --data-binary @typesense-memory_ws_acme-20260101.jsonl

If the collection schema has changed since the backup was taken, recreate the collection with the new schema before importing. The collection schema is defined in src/search/collections.ts.

DigitalOcean Managed Database specifics

DigitalOcean Managed PostgreSQL provides automated daily backups and optional Point-in-Time Recovery (PITR). PITR is available on Business-tier clusters and allows restoring to any second within the last 7 days.

bash
# DigitalOcean Managed DB: enable automated backups via doctl
doctl databases backups list <database-id>

# Trigger a manual backup (before a risky migration)
doctl databases maintenance-window update <database-id> \
  --day saturday --hour "02:00"

# Restore to a new cluster from a backup
doctl databases create astra-restore \
  --engine pg \
  --version 16 \
  --restore-from-database-name astra-db \
  --restore-from-timestamp "2026-01-01T12:00:00Z"

Key DO-specific notes:

  • pgvector is pre-installed on DigitalOcean Managed PostgreSQL 14+. You do not need to install it manually after restore.
  • PITR restores create a new cluster — update DATABASE_URL in your environment after the restore cluster is ready.
  • Connection pooling (PgBouncer) is a separate service — after restoring to a new cluster, update the pooler to point to the new host.
  • Backups are stored in the same region as your cluster. For disaster recovery across regions, enable cross-region replica or use pg_dump + Spaces to a different region.

Recommended backup schedule

FrequencyMethodRetention
HourlyWAL archiving to S3/Spaces (if enabled)7 days
Dailypg_dump + Typesense JSONL export30 days
WeeklyFull pg_dump snapshot90 days
Before migrationsManual pg_dumpKeep until migration verified

Recovery checklist

  1. Stop the gateway (docker compose down astra) to prevent writes during restore.
  2. Restore PostgreSQL from backup.
  3. Run migrations to ensure schema is current: npx drizzle-kit migrate.
  4. Restore Typesense collections, or rebuild from PG: npx astra reindex.
  5. Restart the gateway: docker compose up -d astra.
  6. Run npx astra doctor and verify all agents report healthy.
  7. Send a test message to a representative agent and verify memory retrieval is working.