Multimodal Input

The multimodal input handler accepts image, audio, PDF, and video inputs and routes them to the appropriate processing pipeline before injecting them into agent context.

Supported formats

Type	Formats	Max size	Processing pipeline
Image	`png`, `jpg`, `webp`, `gif`	20 MB	Resize → base64 encode → vision model routing
Audio	`mp3`, `wav`, `m4a`	25 MB	Whisper transcription → text injection
PDF	`pdf`	50 MB	Text extraction → chunking → context injection
Video	`mp4`, `mov`	100 MB	Frame sampling → vision model routing

How routing works

When a file is received, the handler performs MIME type detection from the file header bytes (not the extension). The detected MIME type determines which pipeline is selected. After processing, the output is injected into the agent's context window as structured content before the model is invoked.

MIME type detection — file header bytes are read to determine the true content type
Pipeline selection — the detected MIME type maps to one of four processing pipelines
Processing — the file passes through the selected pipeline (transcription, extraction, encoding, or frame sampling)
Context injection — processed output is appended to the agent's context as a structured content block before inference

Sending multimodal input

Send files alongside a message by posting to /chat with a multipart/form-data body. The files array accepts any number of attachments up to the configured maxFileSizeMb per file.

bash

curl -X POST http://localhost:3000/chat \
  -H 'Authorization: Bearer ${JWT_TOKEN}' \
  -F 'message=Summarize the attached document and describe the diagram' \
  -F 'agentId=research-agent' \
  -F 'files=@report.pdf' \
  -F 'files=@architecture.png'

Image handling

Images are resized to fit within the model's maximum image dimension before encoding. The resized image is base64-encoded and passed directly to the vision model. Animated GIFs are sampled at the first frame only.

Step	Details
Resize	Longest edge capped at 2048 px; aspect ratio preserved
Encode	Base64-encoded as `data:image/<type>;base64,...`
Routing	Injected as a vision content block; model must support vision

Audio handling

Audio files are sent to the Whisper transcription endpoint. The returned transcript is injected into the agent context as a plain text block prefixed with [Transcript]. Language detection is automatic; pass language in the request to override.

PDF handling

PDFs are processed with a text extraction layer. Extracted text is split into overlapping chunks and injected as sequential context blocks. Scanned PDFs without embedded text fall back to OCR processing when available.

Configuration

yaml

multimodal:
  enabled: true
  maxFileSizeMb: 50
  allowedTypes:
    - image/png
    - image/jpeg
    - image/webp
    - image/gif
    - audio/mpeg
    - audio/wav
    - audio/mp4
    - application/pdf
    - video/mp4
    - video/quicktime