Multimodal Input
The multimodal input handler accepts image, audio, PDF, and video inputs and routes them to the appropriate processing pipeline before injecting them into agent context.
Supported formats
| Type | Formats | Max size | Processing pipeline |
|---|---|---|---|
| Image | png, jpg, webp, gif | 20 MB | Resize → base64 encode → vision model routing |
| Audio | mp3, wav, m4a | 25 MB | Whisper transcription → text injection |
pdf | 50 MB | Text extraction → chunking → context injection | |
| Video | mp4, mov | 100 MB | Frame sampling → vision model routing |
How routing works
When a file is received, the handler performs MIME type detection from the file header bytes (not the extension). The detected MIME type determines which pipeline is selected. After processing, the output is injected into the agent's context window as structured content before the model is invoked.
- MIME type detection — file header bytes are read to determine the true content type
- Pipeline selection — the detected MIME type maps to one of four processing pipelines
- Processing — the file passes through the selected pipeline (transcription, extraction, encoding, or frame sampling)
- Context injection — processed output is appended to the agent's context as a structured content block before inference
Sending multimodal input
Send files alongside a message by posting to /chat with a multipart/form-data body. The files array accepts any number of attachments up to the configured maxFileSizeMb per file.
curl -X POST http://localhost:3000/chat \
-H 'Authorization: Bearer ${JWT_TOKEN}' \
-F 'message=Summarize the attached document and describe the diagram' \
-F 'agentId=research-agent' \
-F 'files=@report.pdf' \
-F 'files=@architecture.png'Image handling
Images are resized to fit within the model's maximum image dimension before encoding. The resized image is base64-encoded and passed directly to the vision model. Animated GIFs are sampled at the first frame only.
| Step | Details |
|---|---|
| Resize | Longest edge capped at 2048 px; aspect ratio preserved |
| Encode | Base64-encoded as data:image/<type>;base64,... |
| Routing | Injected as a vision content block; model must support vision |
Audio handling
Audio files are sent to the Whisper transcription endpoint. The returned transcript is injected into the agent context as a plain text block prefixed with [Transcript]. Language detection is automatic; pass language in the request to override.
PDF handling
PDFs are processed with a text extraction layer. Extracted text is split into overlapping chunks and injected as sequential context blocks. Scanned PDFs without embedded text fall back to OCR processing when available.
Configuration
multimodal:
enabled: true
maxFileSizeMb: 50
allowedTypes:
- image/png
- image/jpeg
- image/webp
- image/gif
- audio/mpeg
- audio/wav
- audio/mp4
- application/pdf
- video/mp4
- video/quicktime