Tools

Multimodal Input

The multimodal input handler accepts image, audio, PDF, and video inputs and routes them to the appropriate processing pipeline before injecting them into agent context.

Supported formats

TypeFormatsMax sizeProcessing pipeline
Imagepng, jpg, webp, gif20 MBResize → base64 encode → vision model routing
Audiomp3, wav, m4a25 MBWhisper transcription → text injection
PDFpdf50 MBText extraction → chunking → context injection
Videomp4, mov100 MBFrame sampling → vision model routing

How routing works

When a file is received, the handler performs MIME type detection from the file header bytes (not the extension). The detected MIME type determines which pipeline is selected. After processing, the output is injected into the agent's context window as structured content before the model is invoked.

  1. MIME type detection — file header bytes are read to determine the true content type
  2. Pipeline selection — the detected MIME type maps to one of four processing pipelines
  3. Processing — the file passes through the selected pipeline (transcription, extraction, encoding, or frame sampling)
  4. Context injection — processed output is appended to the agent's context as a structured content block before inference

Sending multimodal input

Send files alongside a message by posting to /chat with a multipart/form-data body. The files array accepts any number of attachments up to the configured maxFileSizeMb per file.

bash
curl -X POST http://localhost:3000/chat \
  -H 'Authorization: Bearer ${JWT_TOKEN}' \
  -F 'message=Summarize the attached document and describe the diagram' \
  -F 'agentId=research-agent' \
  -F 'files=@report.pdf' \
  -F 'files=@architecture.png'

Image handling

Images are resized to fit within the model's maximum image dimension before encoding. The resized image is base64-encoded and passed directly to the vision model. Animated GIFs are sampled at the first frame only.

StepDetails
ResizeLongest edge capped at 2048 px; aspect ratio preserved
EncodeBase64-encoded as data:image/<type>;base64,...
RoutingInjected as a vision content block; model must support vision

Audio handling

Audio files are sent to the Whisper transcription endpoint. The returned transcript is injected into the agent context as a plain text block prefixed with [Transcript]. Language detection is automatic; pass language in the request to override.

PDF handling

PDFs are processed with a text extraction layer. Extracted text is split into overlapping chunks and injected as sequential context blocks. Scanned PDFs without embedded text fall back to OCR processing when available.

Configuration

yaml
multimodal:
  enabled: true
  maxFileSizeMb: 50
  allowedTypes:
    - image/png
    - image/jpeg
    - image/webp
    - image/gif
    - audio/mpeg
    - audio/wav
    - audio/mp4
    - application/pdf
    - video/mp4
    - video/quicktime