Skip to content

feat: Video input via frame extraction and vision model processing #103

@rdwj

Description

@rdwj

Summary

Add video understanding capability through frame extraction (FFmpeg) and multi-image processing with vision models. This is the emerging-maturity tier of multimodal — viable for async/batch processing, not real-time streaming.

Requirements

  • Video content block support in ChatCompletionRequest
  • VideoPreprocessor service/middleware: accepts video, extracts N representative frames via FFmpeg/PyAV, passes as multi-image payload to vision model
  • Configurable frame extraction strategy (uniform sampling, keyframe detection, scene change)
  • Video upload integration with the file upload endpoint (/v1/files)
  • Size limits and format validation (MP4, WebM, MOV)

FIPS Considerations

No blockers. FFmpeg and PyAV are not cryptographic modules. Video codecs are unaffected by FIPS mode. Media at rest encrypted via MinIO SSE (AES-256-GCM).

Implementation Notes

Implement as optional middleware or as an agent-plane tool. Not a core framework requirement — deploy as a separate preprocessing service for deployments that need it. Best vision model for video frames: Qwen3-VL on vLLM (native video support, 256K context). Part of the multimodal initiative. Production maturity: Emerging — suitable for batch workflows, not real-time.

Companion Issues

Companion issues will be filed on fips-agents/gateway-template, fips-agents/ui-template, fips-agents/fips-agents-cli, and fips-agents/examples.

Size

M

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions