Replace MoviePy with ffmpeg for 10-100x performance improvement#15
Replace MoviePy with ffmpeg for 10-100x performance improvement#15cyberb wants to merge 65 commits into
Conversation
- Parallel downloads (4 concurrent) instead of sequential - Increase download chunk size from 8KB to 1MB - Replace MoviePy video processing with direct ffmpeg calls - Use ffmpeg concat demuxer with -c copy (no re-encoding) - Normalize segments to common resolution for reliable concat - Handle gaps with lightweight ffmpeg-generated black segments - Merge audio-only tracks using ffmpeg filter_complex - Remove moviepy and numpy dependencies - Add ffmpeg to Dockerfile - Bump version to 2.0.0 Fixes motattack#8
Some MTS Link recordings store only audio in the direct mp4 files, while the HLS delivery endpoint has both video and audio streams. Check the HLS playlist for a video track and download via ffmpeg when detected.
The old _merge_audio_tracks passed all audio files (up to 63+) as simultaneous inputs to a single ffmpeg amix command, which required ffmpeg to hold all delayed audio streams in memory for the full recording duration — causing OOM kills on long recordings. Now audio tracks are pre-delayed individually, then mixed via tree reduction in batches of 8. Also routes all subprocess calls through _run_ffmpeg which logs stderr on failure instead of silently swallowing it.
Recordings with multiple simultaneous feeds (webcam + screen share) have segments with overlapping timestamps. The old code laid them out sequentially, turning a 3hr recording into 10+ hours of concat video. This also caused the audio merge WAVs to be padded to 10hrs each, requiring ~400GB of disk. Added _deduplicate_overlapping() which keeps only the longest segment per time window (186 -> 7 segments in a real test case). Also pass total_duration from the API to _merge_audio_tracks so WAVs are padded to the correct recording length, not the (potentially inflated) concat file duration.
The previous approach materialized each audio track as a full-duration WAV (~1.8GB each for a 3hr recording). With 63 tracks that's ~113GB, filling the disk and crashing with "No space left on device". Now audio tracks are mixed in batches directly with adelay inside the ffmpeg filter graph, outputting compressed m4a (~15MB each). No intermediate WAVs are created. Batch results are tree-reduced and intermediates are deleted immediately after each round.
- Add _validate_downloaded_file() to check files with ffprobe after download - Re-download corrupt files (missing moov atom) up to 2 retries - Validate existing cached files on disk, re-download if corrupt - Add _is_valid_media() in processor to skip corrupt files during classification - Audio batch mixing catches errors and skips failed batches instead of crashing - If all audio batches fail, output video without audio overlay
Extract presentation.update events from the MTS API to get slide images and their timestamps. Download pre-rendered slide JPGs and composite them with the webcam video in a 1280x720 layout: - Left 960px: presentation slide - Top-right 320x180: webcam - Slides are pre-encoded as 1fps video segments and concatenated into a single track, then overlaid with the webcam in one pass. Recordings without presentations are unaffected (existing behavior).
Some recordings have tiny thumbnail-sized video segments (192x108) as the first file. The old code used the first segment's resolution for all normalization, resulting in a blurry output. Now scans all segments and picks the largest, with a 640x360 floor.
When multiple webcams overlap at the same timestamp, the old code kept the longest segment (often a random participant). Now tracks conference ID from the API and prefers the user with the most total segments across the recording — typically the presenter/instructor. Falls back to longest segment when conf_id is unavailable.
The -loop 1 -framerate 1 -t approach could produce millions of frames for long-duration slides (e.g., last slide staying up for 3 hours), causing ffmpeg to spin for hours and write gigabytes. Now uses -frames:v to strictly cap frame count to match duration at 1fps.
Detects h264_nvenc at startup and uses it for all encoding steps if available. Falls back to libx264 CPU encoding if no GPU. Massively reduces CPU load and encoding time on systems with NVIDIA GPUs, while keeping the CPU cool.
The overlay step was CPU-bound (97°C). Now uses hwupload_cuda, scale_cuda, and overlay_cuda to do the compositing entirely on GPU. Falls back to CPU filters if CUDA overlay is not available.
Two changes: 1. Swap inputs in slide compositing so webcam (25fps) drives the output frame clock instead of the slide track (1fps). Fixes choppy webcam playback in presentation videos. 2. For recordings without presentation slides that have multiple concurrent webcams (ПЗ sessions), composite all active webcams into a grid layout using xstack instead of discarding all but one. Grid size adapts to the number of concurrent webcams (2x1, 2x2, 3x3 etc). Audio from all participants is mixed.
…ipeline Webcam inputs may lack audio tracks, causing ffmpeg to fail with 'Stream specifier :a matches no streams'. Since _merge_audio_tracks handles all audio separately, the grid step should output video only.
The old scoring picked the conference with the most segments, which favored participants toggling their cameras (many short segments) over the presenter (few long segments). Also had a window-shrinking bug where replacing a long segment with a short higher-ranked one let subsequent segments leak through. New approach: identify the main conference by total recorded duration, keep its segments, and fill gaps from other conferences.
Extracts ADMIN role from userlist events, maps to conference IDs via conference.add events, and passes is_admin flag through the download pipeline. Dedup now prefers ADMIN conferences (the presenter), falling back to total duration when no admin is found. Also fixes download_chunks_parallel to preserve the is_admin flag.
…ebcam layout Dedup gap-fill: clamp "other" conference segments to actual gap boundaries instead of using raw file duration, preventing timeline overflow. Compile: skip segments starting before current_time (safety net for overlaps), and truncate segments via -t so they can't overflow into the next segment. Slide composite: scale webcam proportionally to 320px wide (was fixed 320x180), so portrait webcams render at a usable size instead of being squished.
Grid fix: cell dimensions from integer division could be odd, causing
ffmpeg's scale filter to round up and produce dimensions larger than
the pad target ("Padded dimensions cannot be smaller than input").
Now forces even dimensions and uses min() to cap scale output.
Audio fix: amix divides volume by number of inputs at each stage.
After 3 levels of mixing (batch->reduce->overlay), audio was
attenuated to near-silence (-91 dB). Added volume=N compensation
after each amix to restore original loudness.
normalize=0 already prevents amix from dividing by N, so the volume=N multiplier was over-amplifying (~x112 across 3 pipeline stages), turning noise from silent tracks into interference. Also filter out silent audio-only segments (<-80 dB) before mixing so they don't waste processing time or add noise floor.
-80 dB was filtering out participant microphone audio that sits around -80 to -60 dB. Only -91 dB is true digital silence.
With normalize=0 and no volume=N, mixing silent segments with real audio just gives real audio. The filter was incorrectly dropping participant microphone tracks. Removing it simplifies the pipeline and ensures all audio-only segments are included.
When slides + multiple webcams are present, analyzes audio levels per participant to detect who is talking. Switches the right-side webcam to show the active speaker, defaulting to presenter when nobody else talks. Uses 2s analysis windows with 4s minimum hold to prevent flickering.
NVENC + complex overlay filter on 3+ hour videos consumes ~7GB, triggering OOM killer. libx264 uses ~300MB for the same operation. All other encoding steps still use NVENC.
The 720p cap + fast preset still OOM-kills on 3.5h recordings with many segments (e.g. 1197678196: 125 chunks, 23 participants, 34 segments).
Split compositing into 30-min chunks so ffmpeg never holds the full video in memory. This allows using NVENC again (faster) and restores 720p resolution cap. Each chunk is composited independently then concatenated with stream copy.
_get_video_encoder_fast() set _NVENC_AVAILABLE directly, bypassing _detect_gpu(). This left _CUDA_OVERLAY_AVAILABLE as None, so compositing always used CPU overlay even with CUDA support available.
Each participant's audio segments are analyzed by independent ffmpeg calls. Running 4 in parallel instead of sequentially speeds up speaker detection ~4x on multi-core systems.
Speaker switching can produce segments whose combined duration exceeds the original recording. Cap the concat at total_duration from the API to ensure the output matches the expected length.
Split monolithic processor.py (1915 lines) into 6 focused classes: - FFmpegRunner: ffmpeg execution, GPU detection, encoder selection - MediaProber: file probing, duration, streams, audio levels - GridCompositor: multi-webcam grid layout - SlideCompositor: presentation slide overlay - AudioMerger: batched audio mixing with tree-reduce - SegmentBuilder: normalize, gaps, dedup, admin detection VideoProcessor composes all classes via constructor injection. No static methods, no underscore prefixes on public methods. processor.py is now a thin orchestrator with backward-compatible module-level functions. 33 tests across 5 test files, all passing. Added requirements-dev.txt with pytest.
|
@cyberb Awesome work on the refactoring and adding 33 tests! The class decomposition with dependency injection is perfect for testing. Two suggestions for future iterations:
If the current tests mock FFmpegRunner and MediaProber (which makes sense for fast unit tests), consider adding a few integration tests with Testcontainers to verify that complex ffmpeg filter graphs work with real binaries. Benefits:
Trade-off: slower — can be marked @pytest.mark.integration and run separately.
Adding pytest-cov would help track which parts of the pipeline are well-tested vs. untested. Example setup: pytest --cov=mtslinker --cov-report=term --cov-report=html This gives visibility into coverage for audio mixing, grid composition, slide overlay, and edge case handling. Both are just ideas for the roadmap — not blockers for this PR. Thanks again for the massive effort on this! |
|
GitHub Actions integration suggestion I see you added requirements-dev.txt — great first step toward CI. Here's a complete setup you could add in a future PR if you want automated testing on every push. Create .github/workflows/ci.yml with: name: CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run tests with coverage
run: pytest --cov=mtslinker --cov-report=xml --cov-report=term
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
files: ./coverage.xmlNotes:
Not a blocker — just a template for when you're ready to automate the test suite. |
Grid composite already includes audio from input 0 (sorted so audio-bearing webcam is first). Extracting the same webcam audio again for _merge_audio_tracks caused the voice to play twice. Audio-only tracks already cover other participants.
|
@cyberb Let's wrap this up — amazing work! I think we should cap this PR at the current functionality and move any remaining edge cases to follow-up PRs or issues. This will let us merge the massive performance improvements now rather than chasing the last 1% of fringe scenarios indefinitely. Here's a summary of what's been accomplished: Performance
Architecture
Edge cases handled
Testing
Security
What's left (can be follow-up issues/PRs)
This PR is already a massive win. Let's merge it and iterate on the rest in smaller, focused PRs. 🚀 |
Grid input 0 already has audio — but other webcams' audio was lost. Now extracts audio from all webcam files EXCEPT the one used as input 0 in each grid segment. This captures all voices without echo. Added test_audio_pipeline.py with integration tests: - Grid audio no echo (same source not duplicated) - Grid takes audio from first input - Audio merge preserves timing with adelay - Segment duration matches plan - Black segments have silent audio stream
|
@cyberb Have you looked at the actual MTS Link player JavaScript code in the browser? I'm wondering if we could simplify (or even eliminate) most of the complex reconstruction logic by understanding how the official client does it. The browser player must have a source of truth for:
If we can find where the player builds its internal playlist, we could replicate that logic instead of heuristically fixing:
|
Grid input 0 already has audio — other webcams need extraction. Tracks which path is input 0 per grid segment and excludes only those. Added test_audio_pipeline.py: echo, timing, duration, silence tests.
Replace all guesswork (overlap detection, dedup, speaker switching) with StreamTimeline that builds playback windows from API mediasession events — matching exactly what the MTS-Link web player does. - Add StreamTimeline class with dataclasses (MediaSession, TimeWindow, GridSource, AudioTrack, DownloadChunk, SlideEvent) - GridCompositor now mixes all audio streams inline via amix - Remove dedup strategy, overlap heuristics, webcam audio extraction - Remove dead code: deduplicate, extract_admin_conf_ids, is_valid, is_silent, analyze_audio_levels, legacy compat wrappers - Fix .gitignore (was too broad, ignored tests/) - 49 tests passing
yokidjo
left a comment
There was a problem hiding this comment.
@cyberb StreamTimeline approach looks great. Code is cleaner, logic matches the actual player, tests pass.
@motattack LGTM. Ready for merge.
Audio-only streams were downloaded as raw binary from the storage URL, which returns valid MP4 containers with silent audio (-91 dB). The real audio lives in the HLS playlist variants. Now tries HLS first for all streams (video and audio-only), falling back to direct download only if HLS is unavailable.
The variable was removed in the mediasession rewrite (9a8a657) but the logging line still referenced it. Strategy is now always 'timeline'.
When multiple streams are active: - Screenshare → main area, admin → PIP overlay - Admin (no screenshare) → main area, participant → PIP overlay - No admin/screenshare → fall back to grid Also fix grid xstack to always output exact target resolution, preventing concat corruption from mismatched segment sizes.
Split GridCompositor into smaller classes: - GridLayout: xstack grid compositing - PresenterLayout: main + PIP overlay compositing - GridCompositor: backward-compatible facade delegating to both - _build_audio_filter: shared audio mixing helper - _even: shared utility Add 12 new tests covering presenter layout (main-only, PIP, extra audio, resolution consistency), audio filter builder, grid resolution matching, and facade backward compat. 61 tests passing.
The final amix step assumed the video always has an audio stream and that it matches the mixed audio's 44100 Hz sample rate. HLS sources come in at 48000 Hz, causing "Invalid argument" in the filter graph. Now checks for audio presence and resamples before mixing.
The old heuristic (has_video && !has_audio = screenshare) was wrong — it matched webcams with muted mics. The API provides explicit stream.screensharing data on mediasession.add events. Now uses that to correctly identify screen share streams for presenter layout.
|
I think I am done with this, feel free to take it or leave it, thanks! |
|
@motattack I don't have write access to this repository, so I cannot merge this PR myself. GitHub API returns 403 Forbidden with the message "Must have push access", confirming that I lack the necessary permissions. @cyberb has done an enormous amount of work here Please merge this Pull Request yourself, as only you (or someone else with write/admin access) can do it. Thanks! |
anullsrc is infinite; with -c:v copy the video muxes faster than realtime so -shortest alone lets ffmpeg's interleaving buffer grow until av_interleaved_write_frame fails with Cannot allocate memory. Probe the input duration and pass -t to bound the silent track, falling back to -shortest-only when the probe yields no duration. Strengthen test_ensure_audio_adds_silent to assert the output has an audio stream and that the -t bound does not truncate the video.
The bc9b12a -t output bound did not help: with -c:v copy the muxer receives all video packets at once while anullsrc audio is generated lazily, so it buffers the whole copied segment in RAM to interleave, dying with av_interleaved_write_frame: Cannot allocate memory on long segments. Add -max_interleave_delta 0 so packets are written without buffering, and make anullsrc a finite input (-t before -i) so the fallback path is bounded too. Add regression tests asserting the command keeps both safeguards; the existing real-ffmpeg test uses a short clip and cannot trigger the length-proportional OOM.
Root cause of the av_interleaved_write_frame OOM: when a manifest segment
has source_offset > source file duration (planner asks normalize to seek
past end-of-source), ffmpeg decodes zero frames, writes a valid 262-byte
empty container, and exits 0. ensure_audio then sees duration=0, takes
the unbounded anullsrc branch with -c:v copy, and since no video packets
ever arrive, -shortest never fires — the muxer buffers silent audio
forever until RAM+swap are exhausted. The previous fix tuned ffmpeg
muxer flags but never closed the duration=0 vector that bypassed them.
- normalize(): after ffmpeg.run, probe output and raise
CalledProcessError if duration <= 0.
- ensure_audio(): raise CalledProcessError on non-positive-duration
input and remove the unbounded anullsrc fallback entirely (it had
no safe semantics).
- processor.execute() video branch: wrap normalize+ensure_audio in
the same try/except → generate_black fallback already used by the
grid and presenter branches; a seek-past-EOF segment becomes a
correctly-sized black gap instead of crashing the whole webinar.
Reproduction (segment 73, source duration 67.24s, source_offset 104.17s):
before: normalize produces 262B duration-0 file exit 0; ensure_audio
runs unbounded and hits ENOMEM. After: normalize raises with a clear
message naming the input and seek offset; processor falls back to black.
The pycache was accidentally added before __pycache__/ was in .gitignore, so the rule never applied (gitignore is bypassed for already-tracked files). Untrack them and add *.pyc / *.egg-info/ so they stay out.
When the kernel OOM killer reaped ffmpeg mid slide-composite chunk (exit -9 / 137), the whole video failed. NVENC's pinned host buffers push long filter_complex graphs over the memory budget on some videos. Catch SIGKILL only (not generic ffmpeg errors), re-run that chunk with libx264, and stay on CPU for the remaining chunks of the same video so the next chunk does not pay the kill cost again. Non-OOM failures still propagate. Adds get_video_encoder_cpu() and a tests/test_slides.py spy for the retry path.
Summary
-c copyfor near-instant concatenationhttpxandtqdmas Python deps. ffmpeg is the only new system requirementamixfilter with proper delay offsetsWhat changed
downloader.pydownload_chunks_parallel()using ThreadPoolExecutor, chunk size 8KB → 1MBprocessor.pywebinar.pyrequirements.txtmoviepysetup.pymoviepydep, bumped version to 2.0.0Dockerfileffmpegpackage installationPerformance
Tested on a real 3-hour webinar (139 chunks, 106 video + 33 audio-only segments):
-c copy(stream copy)For recordings without audio-only tracks, the improvement is even larger since the audio mixing step (the slowest remaining part) is skipped entirely.
Requirements
ffmpegandffprobemust be installed (apt install ffmpeg/brew install ffmpeg)Fixes #8
Test plan