PII Detection

Open Astra scans agent turn content for Personally Identifiable Information (PII) and secrets before persisting to memory. The scanner runs as part of the post-turn save pipeline and can be configured to block, redact, or allow each type of sensitive data per agent.

What is detected

The scanner uses two complementary techniques: regex pattern matching for known formats and Shannon entropy analysis for high-entropy secrets.

PII patterns

text

# PII types detected
EMAIL         — email@example.com
PHONE_E164    — +14155552671  (international format)
PHONE_US      — (415) 555-2671  (US format)
SSN           — 123-45-6789  (validated: no 000/666/9xx prefix)
CREDIT_CARD   — 4111 1111 1111 1111  (Luhn-validated)
IP_ADDRESS    — 192.168.1.1  (validated: each octet 0-255)
PASSPORT      — AB1234567

Credit card numbers are additionally validated with the Luhn algorithm — cards that fail checksum validation are not flagged. IP addresses are validated to ensure each octet is 0–255.

Secret patterns

text

# Secret types detected
aws-access-key  — AKIA + 16 uppercase alphanumeric chars
jwt             — three base64url segments (eyJ...)
stripe-key      — sk_live_ or sk_test_ prefix + 24+ chars
high-entropy    — any 20+ char token with Shannon entropy > 3.5 bits/char

Shannon entropy analysis catches encoded secrets (base64 API keys, hex tokens, bcrypt hashes) that don't match a known prefix pattern. Any token of 20+ characters with entropy above 3.5 bits per character is treated as a potential secret.

Detection policy

Set piiDetectionPolicy per agent. The policy applies to content written to memory tiers during post-turn save:

yaml

agents:
  - id: assistant
    piiDetectionPolicy: redact   # block | redact | allow (default)

# block  — turns containing PII are rejected with an error before memory write
# redact — PII is replaced with [REDACTED:TYPE] before persisting to memory
# allow  — PII passes through unchanged (not recommended for production)

How redaction works

Redacted content replaces each detected PII value with a typed marker. The original value is never stored. Markers include the PII type so the redacted text remains readable:

text

# Input turn content:
"My name is Alex, email me at alex@example.com and call +14155552671"

# After redaction (stored in memory):
"My name is Alex, email me at [REDACTED:EMAIL] and call [REDACTED:PHONE_E164]"

For secret scanning, the redaction format preserves the first 4 characters for debugging: [REDACTED:AKIA***].

Where it runs

PII detection runs in post-turn-save.ts — after inference completes, before memory entries are written. It does not inspect the content sent to the model (the live inference context), only the content being persisted to the memory tiers. This means:

Users can still reference PII in conversation — the agent will see it during the turn.
The PII will not be stored in memory and will not appear in future retrieved context.
With block policy, the entire turn save is rejected, which may prevent the conversation from being recorded.

Reliability guarantee

The scanner is designed to never throw. If pattern matching or entropy analysis fails for any reason, the scanner returns hasPotentialSecrets: false and the original content passes through unchanged. This ensures a scanner bug cannot silently break the memory pipeline.

Limitations

Pattern detection is heuristic — it will produce false positives on high-entropy non-secrets (e.g. long UUIDs, checksums). Tune by adjusting the ENTROPY_THRESHOLD in src/secrets/scanner.ts if needed.
PII inside embedded documents (PDFs, images) is not scanned — only the text content of agent turns.
Detection runs on persisted content only, not on the real-time agent context window.