Summary
Add video understanding capability through frame extraction (FFmpeg) and multi-image processing with vision models. This is the emerging-maturity tier of multimodal — viable for async/batch processing, not real-time streaming.
Requirements
- Video content block support in
ChatCompletionRequest
VideoPreprocessor service/middleware: accepts video, extracts N representative frames via FFmpeg/PyAV, passes as multi-image payload to vision model
- Configurable frame extraction strategy (uniform sampling, keyframe detection, scene change)
- Video upload integration with the file upload endpoint (
/v1/files)
- Size limits and format validation (MP4, WebM, MOV)
FIPS Considerations
No blockers. FFmpeg and PyAV are not cryptographic modules. Video codecs are unaffected by FIPS mode. Media at rest encrypted via MinIO SSE (AES-256-GCM).
Implementation Notes
Implement as optional middleware or as an agent-plane tool. Not a core framework requirement — deploy as a separate preprocessing service for deployments that need it. Best vision model for video frames: Qwen3-VL on vLLM (native video support, 256K context). Part of the multimodal initiative. Production maturity: Emerging — suitable for batch workflows, not real-time.
Companion Issues
Companion issues will be filed on fips-agents/gateway-template, fips-agents/ui-template, fips-agents/fips-agents-cli, and fips-agents/examples.
Size
M
Summary
Add video understanding capability through frame extraction (FFmpeg) and multi-image processing with vision models. This is the emerging-maturity tier of multimodal — viable for async/batch processing, not real-time streaming.
Requirements
ChatCompletionRequestVideoPreprocessorservice/middleware: accepts video, extracts N representative frames via FFmpeg/PyAV, passes as multi-image payload to vision model/v1/files)FIPS Considerations
No blockers. FFmpeg and PyAV are not cryptographic modules. Video codecs are unaffected by FIPS mode. Media at rest encrypted via MinIO SSE (AES-256-GCM).
Implementation Notes
Implement as optional middleware or as an agent-plane tool. Not a core framework requirement — deploy as a separate preprocessing service for deployments that need it. Best vision model for video frames: Qwen3-VL on vLLM (native video support, 256K context). Part of the multimodal initiative. Production maturity: Emerging — suitable for batch workflows, not real-time.
Companion Issues
Companion issues will be filed on fips-agents/gateway-template, fips-agents/ui-template, fips-agents/fips-agents-cli, and fips-agents/examples.
Size
M