diff --git a/DMP2026 Detailed Proposal.md b/DMP2026 Detailed Proposal.md new file mode 100644 index 0000000..db7650c --- /dev/null +++ b/DMP2026 Detailed Proposal.md @@ -0,0 +1,1038 @@ +# C4GT DMP 2026 — Project Proposal + +## Intelligent Closed Caption (CC) Suggestion Tool +**Organization:** PlanetRead +**Program:** Code for Good Tech (C4GT) — Digital Mentorship Programme 2026 +**Project Category:** AI / Accessibility / Regional Language Media + +--- + +## Table of Contents + +1. [Contributor Information](#1-contributor-information) +2. [Executive Summary](#2-executive-summary) +3. [Problem Statement](#3-problem-statement) +4. [Proposed Solution](#4-proposed-solution) +5. [Multi-Language Support](#5-multi-language-support) +6. [Deployment Strategy](#6-deployment-strategy) +7. [Technical Implementation Detail](#7-technical-implementation-detail) +8. [Goals and Deliverables](#8-goals-and-deliverables) +9. [Project Timeline — 16 Weeks](#9-project-timeline--16-weeks) +10. [Mid-Point Milestone](#10-mid-point-milestone) +11. [Expected Outcomes](#11-expected-outcomes) +12. [Future Work (Post-DMP)](#12-future-work-post-dmp) +13. [Why This Contributor](#13-why-this-contributor) +14. [Setup and Installation](#14-setup-and-installation) +15. [References](#15-references) + +--- + +## 1. Contributor Information + +| Field | Details | +|-------|---------| +| **Project Title** | Intelligent Closed Caption (CC) Suggestion Tool | +| **Organization** | PlanetRead | +| **Programme** | C4GT Digital Mentorship Programme 2026 | +| **Primary Language** | Python 3.10+ | +| **Deployment Targets** | CLI Tool + Production Web Application | +| **Contributor** | Aditi Prabakaran, 3rd Year CSE Student | +| **Tech Stack** | Python, Tensorflow, Pytorchm OpenCV, Flask, SQL, Supabase, Firebase | +| **Repository** | *(https://github.com/Aditi2k5/Intelligent-cc-generation)* | + +--- + +## 2. Executive Summary + +Closed Captioning (CC) for non-speech audio events — `[ Glass Breaking ]`, `[ Laughter ]`, `[ Alarm ]` — is a critical accessibility requirement for deaf and hard-of-hearing audiences worldwide. For organizations like PlanetRead, which produce and distribute educational video content in Hindi and regional Indian languages at scale, manually adding these CC annotations is time-consuming, inconsistent, and expensive. Crucially, it has historically been done only in English — leaving Hindi, Tamil, Telugu, and other regional-language audiences without CC labels in their own language. + +This project delivers an **AI-powered, fully automated, multilingual CC Suggestion Tool** that: + +1. **Detects non-speech audio events** in any video file using YAMNet with a sliding-window approach, category-aware confidence boosting, and an aggressive blacklist tuned for Indian content. +2. **Assesses visual reactions** to those events by analysing facial expressions across a temporal window around each detected sound using MediaPipe Face Mesh. +3. **Makes an intelligent CC/no-CC decision** via a priority-aware weighted fusion engine that avoids over-captioning. +4. **Outputs subtitles in the user's chosen language** — including Hindi (Devanagari), Tamil, Telugu, Bengali, Marathi, Kannada, Malayalam, Gujarati, and Punjabi — making the tool genuinely useful for India's linguistic diversity. +5. **Ships in two deployment modes:** a full-featured production web application with a visual editor interface, and a standalone CLI tool for power users and automation pipelines. + +The tool is designed to reduce manual CC annotation effort by **an estimated 60–80%** while keeping humans in the loop for final quality checks. + +--- + +## 3. Problem Statement + +### 3.1 The Accessibility Gap in Indian Regional Media + +The vast majority of Hindi and regional-language video content — particularly educational and public-interest media — lacks non-speech CC annotations. The primary reason is operational: adding CC by hand requires a trained human editor to watch every second of video, decide whether a sound is narratively significant, and type a label at the correct timestamp. For a one-hour video, this typically takes 2–4 hours of editor time. + +### 3.2 The Language Exclusion Problem + +Even when non-speech CC exists in Indian media, it is almost universally written in English — `[ Screaming ]`, `[ Glass Breaking ]` — regardless of the language of the surrounding content. A hearing-impaired viewer watching a Tamil film or a Hindi educational video must read CC labels in a language that may not be their first. For a tool designed to improve literacy and accessibility, this is a significant shortcoming. A viewer whose primary language is Telugu should see `[ అరుపు ]` (screaming), not `[ Screaming ]`. + +### 3.3 The Over-Captioning Problem + +Existing automated tools tend to caption everything — including wind, background hum, or ambient traffic — producing CC files that are more distracting than helpful and that editors must heavily prune before use. The ideal tool should only suggest captions for sounds that genuinely affect the speaker or the scene, and should silently skip low-impact ambient noise. + +### 3.4 The Indian Content Challenge + +Standard sound classification models (trained primarily on English-language Western media) frequently misfire on Indian content: +- Tabla and dholak rhythms get classified as generic "drum" or "noise" +- Street sounds in Indian cities (autorickshaw horns, vendor calls) confuse vehicle/traffic classifiers +- Regional-language speech patterns affect voice activity detection around sound events +- Firecrackers during Diwali are misclassified as gunshots or explosions + +A robust solution for PlanetRead's use case must explicitly account for and mitigate these failure modes. + +### 3.5 The Accessibility Gap in Tooling + +Currently, there is no open-source, freely available, production-quality tool that an Indian accessibility editor can simply open in a browser, upload a video, and receive a multilingual SRT file ready for review. The gap is not just in the ML models — it is in the complete workflow from upload to usable output. + +--- + +## 4. Proposed Solution + +### 4.1 Architecture Overview + +The tool is a three-module Python pipeline that accepts any video file and produces a multilingual subtitle-ready output: + +``` +Input Video + │ + ▼ +┌──────────────────────────────────────────────────────┐ +│ Module 1: Sound Event Detection │ +│ YAMNet + sliding window + blacklist filter │ +│ + priority-aware boost + temporal merge + cap │ +└─────────────────────────┬────────────────────────────┘ + │ [AudioEvent list] + ▼ +┌──────────────────────────────────────────────────────┐ +│ Module 2: Visual Reaction Detection │ +│ MediaPipe Face Mesh + temporal window │ +│ + EAR / MAR / Brow Raise + weighted aggregation │ +└─────────────────────────┬────────────────────────────┘ + │ [VisualScore per timestamp] + ▼ +┌──────────────────────────────────────────────────────┐ +│ Module 3: Fusion Decision Engine │ +│ Weighted fusion + priority threshold │ +│ + deduplication + SRT gap enforcement │ +└─────────────────────────┬────────────────────────────┘ + │ [CaptionEntry list] + ▼ +┌──────────────────────────────────────────────────────┐ +│ Module 4: Multilingual CC Renderer [NEW] │ +│ Language selector → translated CC labels │ +│ Script-aware rendering (Devanagari, Tamil, etc.) │ +└─────────────────────────┬────────────────────────────┘ + │ + ┌────────────────┼────────────────┐ + ▼ ▼ ▼ + output.srt report.json annotated frames + (chosen language) (full report) (with labels) +``` + +### 4.2 Module 1 — Sound Event Detection (Goal 1) + +**Status: Implemented and tested** + +The audio detection pipeline goes significantly beyond a naive single-pass YAMNet call: + +**Sliding window with 50% overlap:** +A 0.96-second analysis window slides across the audio track with a 0.48-second hop (50% overlap), meaning any sound lasting as little as 0.5 seconds will appear in at least one window. This is critical for short sounds like rat squeaks, glass breaking, and chair creaks that a single-pass approach misses. + +**21 semantic sound categories** spanning three priority tiers: + +| Priority | Categories | +|----------|-----------| +| HIGH | Scream, Explosion, Gunshot, Glass Breaking, Crash, Alarm/Siren | +| MEDIUM | Laughter, Applause, Crying, Knock, Doorbell, Phone, Dog, Cat, Rodent Squeak, Chair Creak, Footsteps, Door, Thunder, Music | +| LOW | Ambient/Background Noise | + +**Blacklist filter:** 28 YAMNet class substrings are pre-emptively discarded — including all vehicle/transport classes, rain, wind, crowd noise, and speech — which are the primary sources of false positives on Indian content. + +**Priority-aware confidence boosting:** Each category applies a multiplier (1.0×–1.9×) to the raw YAMNet score before thresholding. This compensates for YAMNet's documented under-confidence on rare or unusual sounds. + +**Temporal merging:** Events of the same category within 1.5 seconds are collapsed into a single event spanning the full detection window. + +**Per-category cap:** At most 8 events per category per clip prevents any single sound type from flooding the output. + +### 4.3 Module 2 — Visual Reaction Detection (Goal 2 / Mid-Point Milestone) + +**Status: Implemented and tested** + +For each detected audio event, the visual module samples up to 8 frames across a temporal window of [timestamp − 0.5 s, timestamp + 1.2 s] and runs MediaPipe Face Mesh on each frame. + +**Three facial action features computed per frame:** + +- **Eye Aspect Ratio (EAR):** Measures how wide-open the eyes are. Wide eyes (EAR significantly above a 0.25 neutral baseline) indicate surprise or fear. +- **Mouth Aspect Ratio (MAR):** Measures mouth opening. Open mouth (MAR above a 0.05 baseline) indicates shock or laughter. +- **Brow Raise:** Normalised distance from eyebrow landmarks to eye top, relative to face height. Raised brows combined with wide eyes strongly indicate startle or surprise. + +**Temporal weighting:** Frames closer to the audio event receive exponentially higher weight, so a reaction occurring 0.2 s after the sound dominates over a neutral expression 1.0 s before it. + +**Multi-face support:** Up to 4 faces are tracked simultaneously. The most reactive face is used as the representative score, handling group reaction scenarios common in Indian educational videos. + +**Graceful degradation:** When no face is found, the module returns a zero visual score and the fusion engine automatically switches to audio-only mode. + +### 4.4 Module 3 — Fusion Decision Engine (Goal 3) + +**Status: Implemented and tested** + +**Weighted fusion formula:** +``` +fusion_score = 0.65 × audio_confidence + 0.35 × visual_reaction_score +``` + +**Priority-aware thresholds:** + +| Priority | Threshold | Rationale | +|----------|-----------|-----------| +| HIGH | 0.28 | Screams and explosions must not be missed | +| MEDIUM | 0.40 | Standard signal-to-noise balance | +| LOW | 0.60 | Ambient sounds require very strong evidence | + +**Audio-only fallback:** When no face is detected, all thresholds are reduced by 20% to maintain reasonable coverage. + +**Temporal deduplication:** Two accepted captions of the same category within 3 seconds are compared by fusion score; the lower-scoring one is suppressed. + +**SRT gap enforcement:** Guarantees no two subtitle entries overlap and maintains a minimum 0.3-second gap between any two entries. + +--- + +## 5. Multi-Language Support + +### 5.1 Overview + +This is a new, dedicated module (Module 4) that transforms the English-language CC labels produced by the fusion engine into the user's chosen output language. It runs as the final step of the pipeline, after all ML processing is complete, and adds zero latency to the detection phase. + +### 5.2 Supported Languages + +The tool will support **10 Indian languages** at launch, covering over 90% of India's population: + +| Code | Language | Script | Script Name | Sample CC | +|------|----------|--------|-------------|-----------| +| `en` | English | Latin | — | `[ Screaming ]` | +| `hi` | Hindi | Devanagari | देवनागरी | `[ चीख ]` | +| `ta` | Tamil | Tamil | தமிழ் | `[ கத்துகிறார்கள் ]` | +| `te` | Telugu | Telugu | తెలుగు | `[ అరుపు ]` | +| `bn` | Bengali | Bengali | বাংলা | `[ চিৎকার ]` | +| `mr` | Marathi | Devanagari | देवनागरी | `[ ओरडणे ]` | +| `kn` | Kannada | Kannada | ಕನ್ನಡ | `[ ಕಿರುಚಾಡು ]` | +| `ml` | Malayalam | Malayalam | മലയാളം | `[ നിലവിളി ]` | +| `gu` | Gujarati | Gujarati | ગુજરાતી | `[ ચીખ ]` | +| `pa` | Punjabi | Gurmukhi | ਗੁਰਮੁਖੀ | `[ ਚੀਕ ]` | + +### 5.3 Translation Architecture + +**Static translation dictionary (primary method):** +The CC label set is finite and controlled — there are exactly 21 sound categories, each with one display string. Rather than using a live translation API (which introduces latency, cost, and network dependency), all translations are stored as a static dictionary in `translations.py`: + +```python +# translations.py (excerpt) +CC_LABELS = { + "SCREAM": { + "en": "[ Screaming ]", + "hi": "[ चीख ]", + "ta": "[ கத்துகிறார்கள் ]", + "te": "[ అరుపు ]", + "bn": "[ চিৎকার ]", + "mr": "[ ओरडणे ]", + "kn": "[ ಕಿರುಚಾಡು ]", + "ml": "[ നിലവിളി ]", + "gu": "[ ચીખ ]", + "pa": "[ ਚੀਕ ]", + }, + "LAUGHTER": { + "en": "[ Laughter ]", + "hi": "[ हँसी ]", + "ta": "[ சிரிப்பு ]", + "te": "[ నవ్వు ]", + "bn": "[ হাসি ]", + "mr": "[ हास्य ]", + "kn": "[ ನಗು ]", + "ml": "[ ചിരി ]", + "gu": "[ હાસ્ય ]", + "pa": "[ ਹਾਸਾ ]", + }, + # ... all 21 categories translated into all 10 languages +} +``` +**Fallback chain:** If a translation for a specific category is missing in the chosen language, the tool falls back to English, logs a warning, and adds a note to the JSON report flagging the missing translation. + +### 5.4 CLI Usage + +```bash +# English (default) +python main.py --video clip.mp4 + +# Hindi subtitles +python main.py --video clip.mp4 --lang hi + +# Tamil subtitles +python main.py --video clip.mp4 --lang ta + +# Telugu subtitles +python main.py --video clip.mp4 --lang te + +# List all supported languages +python main.py --list-languages +``` + +### 5.5 Web App Usage + +In the web application, the language selector is a prominent dropdown on the upload page: + +``` +[ Upload Video ] ───────────────────────────────────────────── + Select output language: [ Hindi ▾ ] + ┌──────────────────────┐ + │ ✓ Hindi │ + │ Tamil │ + │ Telugu │ + │ Bengali │ + │ Marathi │ + │ Kannada │ + │ Malayalam │ + │ Gujarati │ + │ Punjabi │ + │ English │ + └──────────────────────┘ + [ Generate CC ] +``` + +The selected language is passed through the entire pipeline and applied at the rendering step. The preview panel in the editor interface shows labels in the chosen script. The downloaded SRT file contains the correct Unicode characters for the selected language. + +--- + +## 6. Deployment Strategy + +The tool ships in two distinct deployment modes targeting different user types. Both share the same underlying Python pipeline — the deployment layer is a thin wrapper over the core modules. + +### 6.1 Deployment Mode A — Production Web Application + +#### 6.1.1 Target User + +Accessibility editors, content teams, and non-technical users at PlanetRead, broadcasters, EdTech companies, and government accessibility departments who need a zero-install, browser-based workflow. + +#### 6.1.2 Technology Stack + +| Layer | Technology | Reason | +|-------|-----------|--------| +| Backend API | **FastAPI** (Python) | Async, fast, auto-generates OpenAPI docs, native Python so pipeline imports work directly | +| Task queue | **Celery + Redis** | Video processing is long-running (30 s – 5 min); tasks must be async and non-blocking | +| Frontend | **React + Tailwind CSS** | Component-based, responsive, handles Indian script rendering out of the box | +| File storage | **Local filesystem** (dev) / **AWS S3** (prod) | S3 for scalable video upload/download; pre-signed URLs for security | +| Database | **PostgreSQL** | Stores job history, user feedback, and review decisions | +| Deployment | **Docker Compose** | Single `docker-compose up` starts API + worker + Redis + DB + frontend | +| Hosting | **Railway / Render / AWS EC2** | PaaS options for zero-downtime deployment | + +#### 6.1.3 User Workflow + +``` +User opens browser + │ + ▼ +┌─────────────────────────────────────────────────────┐ +│ UPLOAD PAGE │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Drop video file here (MP4, AVI, MKV, MOV) │ │ +│ └──────────────────────────────────────────────┘ │ +│ Output language: [ Hindi ▾ ] │ +│ Processing mode: ● Full (Audio + Visual) │ +│ ○ Audio-only (faster) │ +│ [ Generate Captions ] │ +└─────────────────────────────────────────────────────┘ + │ POST /api/jobs (upload + params) + ▼ +┌─────────────────────────────────────────────────────┐ +│ PROCESSING PAGE │ +│ Job ID: cc-2026-05-09-001 │ +│ │ +│ ████████████░░░░░░░░ 63% │ +│ Stage: Visual Reaction Detection … │ +│ Detected so far: 3 events │ +└─────────────────────────────────────────────────────┘ + │ GET /api/jobs/{id}/status (polling) + ▼ +┌─────────────────────────────────────────────────────┐ +│ REVIEW PAGE │ +│ ┌─────────────────────────────────────────────┐ │ +│ │ 0:02.4 [ चीख ] fusion=0.70 ✓│ │ +│ │ 0:05.7 [ हँसी ] fusion=0.61 ✓│ │ +│ │ 0:08.1 [ शीशा टूटना ] fusion=0.54 ✓│ │ +│ └─────────────────────────────────────────────┘ │ +│ [ Accept All ] [ Edit Selected ] [ Download SRT ]│ +└─────────────────────────────────────────────────────┘ +``` + +#### 6.1.4 API Design + +``` +POST /api/jobs Upload video + start processing job +GET /api/jobs/{id}/status Poll job progress (0–100%) + stage name +GET /api/jobs/{id}/result Retrieve completed caption entries +PATCH /api/jobs/{id}/review Submit editor Accept/Reject/Edit decisions +GET /api/jobs/{id}/srt Download the final SRT file +GET /api/languages List all supported output languages +GET /api/health Health check for load balancer +``` + +All endpoints return JSON. The SRT download endpoint returns `Content-Type: text/plain; charset=utf-8` with `Content-Disposition: attachment; filename="output.srt"`. + +#### 6.1.5 Job Processing Flow + +``` +Upload → FastAPI endpoint + → Save video to /uploads/{job_id}/input.mp4 + → Create job record in PostgreSQL (status: QUEUED) + → Push task to Celery queue via Redis + → Return {job_id, status: "queued"} + +Celery worker picks up task: + → Update status: PROCESSING, stage: "audio_detection" + → Run Module 1 (sound_detector.py) + → Update status: PROCESSING, stage: "visual_analysis", progress: 40% + → Run Module 2 (visual_detector.py) + → Update status: PROCESSING, stage: "fusion", progress: 80% + → Run Module 3 (fusion_engine.py) + → Apply language translation (translations.py) + → Write output.srt + report.json + → Update status: COMPLETE, progress: 100% + +Client polls GET /status until complete, then redirects to review page. +``` + +#### 6.1.6 Editor Review Interface + +The review page presents each CC suggestion as a card with: +- Timestamp and duration (synced to a video player embed) +- CC label in the chosen language +- Audio/visual/fusion score breakdown (expandable) +- Three action buttons: **✓ Accept**, **✗ Reject**, **✎ Edit** + +The "Edit" button opens an inline text field where the editor can correct or rephrase the CC text in the chosen language, including typing in Devanagari or other scripts via the OS input method. + +All decisions are saved to PostgreSQL and used as future training data for threshold calibration. + +--- + +### 6.2 Deployment Mode B — CLI Tool + +#### 6.2.1 Target User + +Developers, researchers, pipeline integrators, and power users who need to process videos programmatically, in batch, or as part of a larger automation workflow. + +#### 6.2.2 Installation + +```bash +pip install cc-suggestion-tool +``` + +The package is published to PyPI with all dependencies declared in `setup.cfg`. ffmpeg is documented as a system dependency in the README. + +#### 6.2.3 Full CLI Reference + +``` +usage: cctools [-h] COMMAND [OPTIONS] + +Commands: + run Process a single video file + batch Process all videos in a folder + review Open interactive review session for a completed job + languages List all supported output languages + version Print version and dependency info + +──────────────────────────────────────────────────── +cctools run [OPTIONS] + + --video PATH Input video file (required) + --lang CODE Output language [default: en] + Choices: en hi ta te bn mr kn ml gu pa + --output DIR Output directory [default: demo_results/] + --no-visual Skip visual reaction detection (faster) + --no-frames Skip annotated frame export + --debug Enable DEBUG-level logs + --format {srt,sls} Output subtitle format [default: srt] + --threshold FLOAT Override fusion threshold (0.0–1.0) + +Examples: + cctools run --video lecture.mp4 --lang hi + cctools run --video clip.mp4 --lang ta --no-visual --output /tmp/out/ + cctools run --video movie.mp4 --debug + +──────────────────────────────────────────────────── +cctools batch [OPTIONS] + + --input-dir DIR Folder containing video files (required) + --output-dir DIR Root folder for results [default: batch_results/] + --lang CODE Output language [default: en] + --workers INT Parallel worker processes [default: CPU count] + --no-visual Skip visual detection for all videos + --format {srt,sls} Output subtitle format [default: srt] + +Examples: + cctools batch --input-dir /videos/ --lang hi --workers 4 + cctools batch --input-dir /videos/ --lang te --output-dir /out/ + +──────────────────────────────────────────────────── +cctools languages + + Prints a table of all supported language codes and scripts: + en English Latin + hi Hindi Devanagari + ta Tamil Tamil + te Telugu Telugu + ... (all 10) +``` + +#### 6.2.4 Terminal Output Example + +``` +╔══════════════════════════════════════════════════════════╗ +║ Intelligent CC Suggestion Tool • PlanetRead / C4GT ║ +║ Language: Hindi (हिन्दी) • Format: SRT ║ +╚══════════════════════════════════════════════════════════╝ + +[1/3] Running sound event detection … +[2/3] Running visual reaction detection … +[3/3] Running fusion engine … + +════════════════════════════════════════════════════════════ + Pipeline complete in 18.3 s + Audio events detected : 4 + Captions emitted : 3 + Language : Hindi (hi) + + Caption preview: + ────────────────────────────────────────────────────── + 2.40s → 4.40s [ चीख ] fusion=0.703 + 5.76s → 7.76s [ हँसी ] fusion=0.612 + 9.12s → 11.12s [ शीशा टूटना ] fusion=0.541 + ────────────────────────────────────────────────────── + SRT → demo_results/output.srt + JSON → demo_results/report.json +════════════════════════════════════════════════════════════ +``` + +#### 6.2.5 Batch Processing Output + +``` +cctools batch --input-dir /videos/ --lang hi --workers 4 + +Processing 12 videos with 4 workers … + + [1/12] lecture_01.mp4 ✓ 3 captions (22.1 s) + [2/12] classroom_demo.mp4 ✓ 1 caption (14.6 s) + [3/12] interview_segment.mp4 ✓ 5 captions (31.4 s) + ... + [12/12] community_video.mp4 ✓ 2 captions (19.8 s) + +Batch complete in 3m 41s +Total videos processed : 12 +Total captions emitted : 38 +Output directory : batch_results/ +Batch summary report : batch_results/batch_summary.json +``` + +#### 6.2.6 Programmatic Python API + +The CLI is a thin wrapper over a Python API that can be imported directly: + +```python +from cc_suggestion_tool import CCPipeline + +pipeline = CCPipeline(lang="hi", output_dir="results/") +result = pipeline.run("video.mp4") + +print(result.captions) +# [ +# CaptionEntry(start=2.4, end=4.4, text="[ चीख ]", fusion=0.703), +# CaptionEntry(start=5.76, end=7.76, text="[ हँसी ]", fusion=0.612), +# ] + +pipeline.write_srt(result, "output.srt") +pipeline.write_json(result, "report.json") +``` + +--- + +### 6.3 Shared Infrastructure + +Both deployment modes share the same underlying pipeline. The only difference is the layer above it: + +``` +┌─────────────────────────┐ ┌─────────────────────────┐ +│ Web Application │ │ CLI Tool │ +│ FastAPI + React │ │ cctools run / batch │ +│ Celery + Redis │ │ Python API │ +└────────────┬────────────┘ └────────────┬────────────┘ + │ │ + └──────────┬───────────────────┘ + ▼ + ┌──────────────────────────────┐ + │ Core Pipeline (shared) │ + │ sound_detector.py │ + │ visual_detector.py │ + │ fusion_engine.py │ + │ translations.py │ + │ srt_writer.py │ + └──────────────────────────────┘ +``` + +This means every bug fix, model improvement, or new language added to the core pipeline automatically benefits both deployment modes. + +--- + +## 7. Technical Implementation Detail + +### 7.1 Complete Tech Stack + +| Component | Technology | Mode | +|-----------|-----------|------| +| Audio classification | YAMNet (TF Hub) | Both | +| Audio extraction | ffmpeg + soundfile + librosa | Both | +| Face mesh | MediaPipe Face Mesh | Both | +| Video decoding | OpenCV | Both | +| Language translation | Static dictionary (translations.py) | Both | +| Output format | SRT + JSON | Both | +| Testing | pytest (100+ tests) | Both | +| Backend API | FastAPI | Web only | +| Task queue | Celery + Redis | Web only | +| Frontend | React + Tailwind CSS | Web only | +| Database | PostgreSQL | Web only | +| Containerisation | Docker + Docker Compose | Web only | +| CLI packaging | setuptools + PyPI | CLI only | +| Batch processing | multiprocessing.Pool | CLI only | + +### 7.2 Complete Project File Structure + +``` +cc_suggestion_tool/ +│ +├── config.py # All tunable parameters +├── main.py # CLI entry point +├── requirements.txt # Python dependencies +├── setup.cfg # PyPI package config +│ +├── modules/ +│ ├── sound_detector.py # Module 1: YAMNet audio detection +│ ├── visual_detector.py # Module 2: MediaPipe face reaction +│ ├── fusion_engine.py # Module 3: Weighted decision engine +│ └── translations.py # Module 4: Multilingual CC labels ← NEW +│ +├── utils/ +│ ├── logger.py # Colour-coded structured logging +│ └── srt_writer.py # SRT + JSON output writers +│ +├── tests/ +│ ├── conftest.py # Shared fixtures and markers +│ ├── test_sound_detector.py # 34 unit tests +│ ├── test_visual_detector.py # 19 unit tests +│ ├── test_fusion_engine.py # 47 unit tests +│ └── test_translations.py # 20 unit tests ← NEW +│ +├── webapp/ ← NEW +│ ├── api/ +│ │ ├── main.py # FastAPI application +│ │ ├── routes/ +│ │ │ ├── jobs.py # Job CRUD endpoints +│ │ │ ├── review.py # Editor review endpoints +│ │ │ └── languages.py # Language listing endpoint +│ │ ├── models.py # SQLAlchemy ORM models +│ │ ├── tasks.py # Celery task definitions +│ │ └── schemas.py # Pydantic request/response schemas +│ │ +│ └── frontend/ +│ ├── src/ +│ │ ├── App.jsx +│ │ ├── pages/ +│ │ │ ├── Upload.jsx # Video upload + language picker +│ │ │ ├── Processing.jsx # Progress bar + live stage updates +│ │ │ └── Review.jsx # Caption cards + Accept/Reject/Edit +│ │ └── components/ +│ │ ├── LanguagePicker.jsx +│ │ ├── CaptionCard.jsx +│ │ └── ScoreBreakdown.jsx +│ └── package.json +│ +├── Dockerfile # Pipeline + API container +├── docker-compose.yml # Full stack: API + worker + Redis + DB + frontend +└── demo_results/ # Auto-created per pipeline run + ├── output.srt + ├── report.json + ├── pipeline.log + └── frames/ +``` + +### 7.3 Key Design Decisions + +**Single config file:** Every threshold, weight, window size, and category definition lives in `config.py`. A reviewer, mentor, or future contributor can change the pipeline's entire behaviour without opening any module file. New sound categories with translations in all 10 languages can be added in one block with no code changes. + +**Static translation dictionary:** All 21 × 10 = 210 CC translations are stored as a plain Python dictionary. This is offline, zero-latency, auditable, and correctable by a PlanetRead editor with no coding knowledge. + +**Shared core pipeline:** The web app and CLI are both thin wrappers over the same `modules/` directory. There is no code duplication between deployment modes. + +**Dataclass-driven data flow:** `AudioEvent`, `VisualScore`, `FaceFrameScore`, and `CaptionEntry` are Python dataclasses with explicit types. This makes inter-module contracts clear and enables `asdict()` serialisation for JSON. + +**Mock-safe test design:** All 100+ tests stub TensorFlow, MediaPipe, and OpenCV before importing modules. The test suite runs in approximately 1 second with no internet connection and no ML model downloads required. + +**Graceful degradation chain:** If ffmpeg absent → librosa fallback. If Module 2 fails → audio-only mode. If language missing → English fallback. If no events → empty SRT written. The pipeline always produces output. + +--- + +## 8. Goals and Deliverables + +### Goal 1 — Sound Event Detection Module ✅ Completed + +| Deliverable | Status | +|------------|--------| +| Audio extraction via ffmpeg with librosa fallback | Done | +| YAMNet sliding-window inference (50% overlap) | Done | +| 28-entry blacklist for Indian content false positives | Done | +| 21 semantic sound categories across 3 priority tiers | Done | +| Priority-aware confidence boosting (1.0×–1.9×) | Done | +| Temporal merging and per-category capping | Done | +| 34 unit tests with full mock coverage | Done | + +### Goal 2 — Visual Reaction Detection Module ✅ Completed (Mid-Point Milestone) + +| Deliverable | Status | +|------------|--------| +| Temporal sampling window around each audio event | Done | +| MediaPipe Face Mesh landmark extraction | Done | +| EAR (Eye Aspect Ratio) computation | Done | +| MAR (Mouth Aspect Ratio) computation | Done | +| Brow Raise computation | Done | +| Temporal weighting | Done | +| Multi-face support (up to 4 faces) | Done | +| 19 unit tests with synthetic geometry | Done | + +### Goal 3 — CC Decision Engine & SRT/SLS Output ✅ Completed + +| Deliverable | Status | +|------------|--------| +| Weighted fusion formula (65/35 audio/visual split) | Done | +| Priority-aware thresholds (HIGH/MEDIUM/LOW) | Done | +| Audio-only fallback mode | Done | +| Temporal deduplication | Done | +| SRT gap enforcement | Done | +| Annotated frame export | Done | +| SRT + JSON output writers | Done | +| 47 unit tests including integration tests | Done | +| CLI entry point with all flags | Done | + +### Goal 4 — Multi-Language Support 🔲 Planned (DMP Phase 2) + +| Deliverable | Status | +|------------|--------| +| `translations.py` with all 21 categories × 10 languages | Planned | +| `--lang CODE` CLI flag | Planned | +| Language listing command (`cctools languages`) | Planned | +| UTF-8 SRT output for all Indian scripts | Planned | +| Language picker in web UI | Planned | +| 20 unit tests for translation module | Planned | +| Fallback to English when translation missing | Planned | + +### Goal 5 — Production Web Application 🔲 Planned (DMP Phase 3–4) + +| Deliverable | Status | +|------------|--------| +| FastAPI backend with async job processing | Planned | +| Celery + Redis task queue | Planned | +| PostgreSQL job database | Planned | +| React + Tailwind CSS frontend | Planned | +| Upload / Processing / Review page flow | Planned | +| Editor Accept/Reject/Edit interface | Planned | +| SRT download endpoint | Planned | +| Docker Compose deployment config | Planned | + +### Goal 6 — CLI Package & Batch Processing 🔲 Planned (DMP Phase 3) + +| Deliverable | Status | +|------------|--------| +| `cctools` CLI command via PyPI install | Planned | +| `cctools batch` with parallel workers | Planned | +| Python API for programmatic use | Planned | +| Progress bars via tqdm | Planned | +| Batch summary JSON report | Planned | + +--- + +## 9. Project Timeline — 16 Weeks + +### Phase 0 — Pre-DMP Work Already Completed + +| Area | Completed Work | +|------|---------------| +| Module 1 | YAMNet pipeline with blacklist, boost, merge, cap | +| Module 2 | MediaPipe face mesh with EAR/MAR/Brow + temporal aggregation | +| Module 3 | Fusion engine with priority thresholds, dedup, SRT enforcement | +| Testing | 100 unit tests, all passing, run in ~1 second | +| CLI | `main.py` with `--video`, `--output`, `--no-visual`, `--debug` flags | +| Config | `config.py` with all tunable parameters | + +--- + +### Phase 1 — Validation & Baseline (Weeks 1–2) + +**Objective:** Establish a quantitative performance baseline on real content. + +| Week | Tasks | +|------|-------| +| **Week 1** | Run the pipeline on 15 sample Hindi/regional video clips (Creative Commons or PlanetRead-provided). Document all false positives and false negatives from Module 1. Build a simple annotation spreadsheet: video, timestamp, ground-truth CC, predicted CC, correct Y/N. | +| **Week 2** | Analyse Module 1 errors. Extend the blacklist with observed Indian false-positive classes. Validate Module 2 on clips with varying lighting and face angles. Compute baseline precision, recall, and F1 score. Write the baseline evaluation report. | + +**Milestone:** Baseline evaluation report with precision/recall/F1 on ≥ 15 test clips. + +--- + +### Phase 2 — Multi-Language Support (Weeks 3–5) + +**Objective:** Deliver complete multilingual CC output as a production-ready feature. + +| Week | Tasks | +|------|-------| +| **Week 3** | Create `modules/translations.py`. Populate all 21 sound category labels for English, Hindi, and Tamil (highest priority languages). Write the 20-unit test suite for the translation module. Add `--lang` flag to `main.py`. Verify SRT files render correctly in VLC for all three languages. | +| **Week 4** | Add translations for Telugu, Bengali, Marathi. Test SRT files in Subtitle Edit and YouTube's caption upload flow. Verify Devanagari and Bengali scripts render correctly on Windows, macOS, and Android. Add `cctools languages` command. | +| **Week 5** | Add translations for Kannada, Malayalam, Gujarati, Punjabi. Test all 10 languages end-to-end. Write the fallback logic (missing translation → English + warning). Integrate the language selector into the CLI's `--help` output. | + +**Milestone:** All 10 languages producing correct, UTF-8 SRT output verified on 3 platforms. + +--- + +### Phase 3 — CLI Package & Batch Processing (Weeks 6–8) + +**Objective:** Ship a proper installable CLI package with batch support. + +| Week | Tasks | +|------|-------| +| **Week 6** | Refactor `main.py` into a proper `CCPipeline` class with a clean Python API. Write `setup.cfg` and `pyproject.toml` for PyPI packaging. Publish a test release to TestPyPI. Verify `pip install cc-suggestion-tool && cctools run --video test.mp4` works on a clean virtual environment. | +| **Week 7** | Implement `cctools batch` with `multiprocessing.Pool` for parallel video processing. Add `tqdm` progress bars for both single-video and batch modes. Write the batch summary JSON report (total videos, total captions, per-video stats, duration). | +| **Week 8** | Performance profiling on a 10-minute video. Implement optional spectrogram-based silence skipping in Module 1 to reduce processing time by 20–40% on sparse audio. Document processing speed benchmarks in the README. Target: < 0.5× real-time on CPU. | + +**Milestone:** `pip install cc-suggestion-tool` works cleanly; a 12-video batch completes in under 5 minutes on CPU. + +--- + +### Phase 4 — Web Application (Weeks 9–13) + +**Objective:** Build and deploy the production web application. + +| Week | Tasks | +|------|-------| +| **Week 9** | Set up FastAPI project structure. Implement `POST /api/jobs` (upload + enqueue) and `GET /api/jobs/{id}/status` (polling). Set up Celery + Redis. Write the Celery task that wraps the CCPipeline. Test the async job flow with curl. | +| **Week 10** | Implement `GET /api/jobs/{id}/result`, `PATCH /api/jobs/{id}/review`, and `GET /api/jobs/{id}/srt` endpoints. Set up PostgreSQL with SQLAlchemy. Write Pydantic schemas for all request/response bodies. Add OpenAPI documentation. | +| **Week 11** | Build the React frontend: Upload page with drag-and-drop, language picker dropdown, and processing mode selector. Build the Processing page with a polling progress bar showing the current stage name and live event count. | +| **Week 12** | Build the Review page: caption cards in the chosen language script, video player embed synced to timestamps, Accept/Reject/Edit buttons, SRT download button. Build the `ScoreBreakdown` expandable component showing audio/visual/fusion scores. | +| **Week 13** | Write `Dockerfile` and `docker-compose.yml` for the full stack. Deploy to Railway or Render. End-to-end test: upload a Hindi video on the live URL, select Hindi, download the SRT, verify content. Run a load test with 5 concurrent jobs. | + +**Milestone:** Live deployment at a public URL; 5 concurrent jobs processing correctly; Hindi SRT downloads working. + +--- + +### Phase 5 — Indian Content Adaptation & Editor Feedback (Weeks 14–15) + +**Objective:** Tune the tool specifically for Indian content and collect real editor feedback. + +| Week | Tasks | +|------|-------| +| **Week 14** | Add 8 India-specific sound categories: Dhol/Dholak, Tabla, Shehnai, Firecrackers (Diwali), Autorickshaw Horn, Conch Shell (Shankha), Crowd Chanting, Train Whistle. Add translations for all new categories in all 10 languages. Test on public-domain Indian media clips. Curate a 50-clip Hindi/regional benchmark dataset with manual CC ground truth. | +| **Week 15** | Conduct a structured usability session with 2–3 PlanetRead editors using the web application. Collect 100+ Accept/Reject/Edit decisions. Analyse patterns: which categories over-caption, which under-caption. Adjust thresholds in `config.py`. Measure editor acceptance rate (target: ≥ 80%). | + +**Milestone:** 50-clip benchmark dataset published; editor acceptance rate ≥ 80% measured on review session data. + +--- + +### Phase 6 — Final Hardening & Submission (Week 16) + +**Objective:** Polish, document, and submit all deliverables. + +| Week | Tasks | +|------|-------| +| **Week 16** | Write complete README covering both deployment modes, all CLI flags, and web app workflow. Write contributor guide for future DMP participants. Record a 5-minute video walkthrough. Final precision/recall report comparing Phase 1 baseline to final numbers. Tag a v1.0.0 release on GitHub. Submit all deliverables to the C4GT portal. | + +**Final Milestone:** v1.0.0 tagged; all deliverables submitted; live web app running; CLI installable from PyPI. + +--- + +## 10. Mid-Point Milestone + +As specified in the project brief, the mid-point milestone is the completion of **Goal 1 (Sound Event Detection)** and **Goal 2 (Visual Reaction Detection)**. + +**Both goals are already fully implemented** as of programme start, with verifiable evidence: + +| Evidence | Detail | +|----------|--------| +| `modules/sound_detector.py` | 453 lines — full YAMNet pipeline | +| `modules/visual_detector.py` | 425 lines — MediaPipe + EAR/MAR/Brow | +| `tests/test_sound_detector.py` | 34 passing tests | +| `tests/test_visual_detector.py` | 19 passing tests | +| `modules/fusion_engine.py` | 543 lines — full fusion + SRT output | +| `tests/test_fusion_engine.py` | 47 passing tests | +| `main.py` | Complete CLI, 367 lines | + +This means the DMP period is used entirely for **multi-language support, web deployment, batch processing, Indian content adaptation, editor feedback, and production polish** — rather than catching up on core functionality. + +--- + +## 11. Expected Outcomes + +### 11.1 Primary Deliverables + +| # | Deliverable | Description | +|---|------------|-------------| +| 1 | **Production pipeline** | Tested 4-module Python pipeline (3 implemented + translations module) | +| 2 | **10-language SRT output** | CC labels in English, Hindi, Tamil, Telugu, Bengali, Marathi, Kannada, Malayalam, Gujarati, Punjabi | +| 3 | **Production web app** | FastAPI + React web application, Docker-deployed, publicly accessible | +| 4 | **CLI tool on PyPI** | `pip install cc-suggestion-tool` → `cctools run --video clip.mp4 --lang hi` | +| 5 | **Batch processing** | `cctools batch` processes a folder with parallel workers | +| 6 | **120+ unit tests** | pytest suite running in < 10 seconds, 100% pass rate | +| 7 | **50-clip benchmark dataset** | Hindi/regional content with manual CC ground truth — public release | +| 8 | **JSON event report** | Full structured report for analytics and audit | +| 9 | **Editor review interface** | Web UI for Accept/Reject/Edit with decision storage | +| 10 | **Full documentation** | README, contributor guide, architecture diagram, 5-minute demo video | + +### 11.2 Quantitative Targets + +| Metric | Target | +|--------|--------| +| Precision (valid CC / total suggested) | ≥ 75% | +| Recall (valid CC / total ground-truth CCs) | ≥ 70% | +| Editor acceptance rate | ≥ 80% | +| Processing speed (CPU laptop) | < 0.5× real-time | +| Language coverage | 10 languages, 21 categories each | +| Translation accuracy (reviewed by native speakers) | ≥ 95% | +| Test suite pass rate | 100% | +| Test suite run time | < 10 seconds | +| Web app concurrent job capacity | ≥ 5 simultaneous | +| Batch processing speed (CPU, 4 workers) | ≥ 3 videos/minute for 30-second clips | + +### 11.3 Acceptance Criteria Mapping + +**Criterion 1: Detect non-speech audio events** +→ Module 1 detects 21 categories (expandable to 29 with Indian sounds) with sliding-window YAMNet, boost, blacklist, and merge. Validated with 34 unit tests. + +**Criterion 2: Assess speaker/scene reaction** +→ Module 2 computes EAR + MAR + Brow Raise across a temporal window. Validated with 19 unit tests. + +**Criterion 3: Produce CC-annotated SRT avoiding over-captioning** +→ Module 3 applies priority-aware fusion thresholds with deduplication. Module 4 applies the chosen language. Validated with 47 unit tests. Output importable into any standard subtitle tool. + +--- + +## 12. Future Work (Post-DMP) + +| Enhancement | Description | Complexity | +|-------------|------------|------------| +| **Real-time streaming mode** | Process live video stream; output CC events via WebSocket | High | +| **YAMNet fine-tuning** | Fine-tune on curated Indian sound dataset (dholak, shehnai, firecrackers) | High | +| **LLM-enhanced CC text** | Generate contextually richer captions: `[ Chair creaking as host shifts nervously ]` | Medium | +| **SLS integration** | Direct API integration with PlanetRead's Same Language Subtitling pipeline | Medium | +| **More regional languages** | Add Odia, Assamese, Sindhi, Kashmiri, Urdu (Nastaliq script) | Medium | +| **Confidence calibration** | Platt scaling on editor decision data for calibrated probability output | Medium | +| **Mobile application** | React Native app for on-device processing of short clips | High | +| **Speaker diarisation** | Detect who is reacting (speaker A vs speaker B) in multi-person content | High | +| **Emotion classification** | Classify the type of reaction (fear, joy, surprise, disgust) for richer CC text | Medium | +| **Community translation portal** | Web interface for native speakers to review and correct translations | Low | + +--- + +## 13. Why This Contributor + +### 13.1 Work Already Done + +Before submitting my proposal, I have already: + +- Implemented all three core detection modules from scratch (5 Python files, ~1,600 lines of production-quality, well-commented code) +- Written 100 unit tests that all pass with mocked dependencies in under 1 second +- Designed a category system with 21 sound classes, 3 priority tiers, and 28 blacklist entries specifically tuned for Indian content failure modes +- Implemented a temporal visual reaction window, EAR/MAR/Brow facial scoring, and weighted temporal aggregation across up to 4 simultaneous faces +- Built a full CLI with graceful degradation (Module 2 failure → audio-only; no face detected → auto-fallback; missing library → clear error message) +- Produced annotated output frames, a JSON event report, and a valid SRT file on real test clips + +### 13.2 Architectural Thinking + +The deployment strategy (shared core pipeline, thin wrapper for web and CLI) is not an afterthought — it is a deliberate architectural decision that prevents code duplication and ensures that every improvement benefits both deployment modes simultaneously. The static translation dictionary design was chosen over a live translation API specifically to avoid per-request cost and network dependency, making the tool viable for PlanetRead's offline and low-bandwidth use cases. + +### 13.3 Understanding of PlanetRead's Mission + +PlanetRead's core mission is Same Language Subtitling (SLS) for literacy improvement in India — a domain where every on-screen text element must be accurate enough to not mislead a learner reading along, sparse enough to not distract from the primary speech subtitle, and culturally calibrated to Indian audio environments. This proposal addresses all three requirements through the blacklist system, priority-tier thresholds, multi-language output, and the planned Indian content adaptation phase. + +### 13.4 Commitment + +I am available for the full 16-week DMP 2026 programme and commit to: +- Weekly progress updates via the C4GT platform +- Bi-weekly mentor sync calls +- All code submitted via pull request with passing CI checks +- A public GitHub repository with Issues tracking progress against this timeline +- Final submission including working demo URL, PyPI package, documentation, and benchmark dataset + +--- + +## 14. Setup and Installation + +### CLI + +```bash +# Install system dependency +sudo apt install ffmpeg # Ubuntu/Debian +brew install ffmpeg # macOS + +# Install the tool +pip install cc-suggestion-tool + +# Run on a video in Hindi +cctools run --video lecture.mp4 --lang hi + +# Run on a video in Tamil (audio-only mode) +cctools run --video clip.mp4 --lang ta --no-visual + +# Batch process a folder in Telugu +cctools batch --input-dir /videos/ --lang te --workers 4 + +# List all supported languages +cctools languages +``` + +### Web Application (Docker) + +```bash +git clone https://github.com//cc-suggestion-tool.git +cd cc-suggestion-tool + +# Start the full stack (API + worker + Redis + DB + frontend) +docker-compose up --build + +# Open in browser +open http://localhost:3000 +``` + +### Development Setup + +```bash +git clone https://github.com//cc-suggestion-tool.git +cd cc-suggestion-tool +pip install -r requirements.txt + +# Run on a video +python main.py --video clip.mp4 --lang hi + +# Run the test suite (no model downloads required — all mocked) +python -m pytest tests/ -v +``` + +--- + +## 15. References + +- PlanetRead — Same Language Subtitling: https://planetread.org +- YAMNet (Yet Another Mobile Network): https://tfhub.dev/google/yamnet/1 +- AudioSet (Google): Gemmeke et al., 2017 — https://research.google.com/audioset/ +- MediaPipe Face Mesh: Kartynnik et al., 2019 — https://google.github.io/mediapipe/solutions/face_mesh +- PANNs (Pretrained Audio Neural Networks): Kong et al., 2020 — https://github.com/qiuqiangkong/audioset_tagging_cnn +- C4GT DMP 2026: https://codeforgovtech.in + +--- + +*Proposal submitted for C4GT DMP 2026 | PlanetRead | Intelligent CC Suggestion Tool* +*Version: 2.0 | Date: May 2026* diff --git a/README.md b/README.md new file mode 100644 index 0000000..894f47c --- /dev/null +++ b/README.md @@ -0,0 +1,356 @@ +# [DMP 2026] PlanetRead - Intelligent Closed Caption Suggestion Tool + +Three-module, multi-modal pipeline that generates non-speech closed captions from video. + +This project detects important non-dialog sounds (for example, scream, glass break, door slam, laughter), verifies human reaction around those moments, and writes final subtitle suggestions as an SRT file. + +## Why this project matters + +Many subtitle pipelines focus heavily on speech but under-represent meaningful non-speech events. This can reduce accessibility and context for deaf and hard-of-hearing audiences. + +This project addresses that gap by: + +1. Detecting candidate sound events from audio. +2. Checking visual reaction around those timestamps. +3. Fusing both signals to decide whether a caption should be emitted. + +The result is a stronger captioning signal than audio-only heuristics for many cinematic/social scenes. + +## Problem it solves + +Traditional approaches often produce one of two failures: + +1. Too many captions: ambient or low-value sounds flood subtitles. +2. Too few captions: brief but important sound effects are missed. + +This pipeline reduces both by combining category-aware audio confidence with visual reaction evidence, then applying priority-based thresholds. + +## High-level architecture + +1. Module 1 ([modules/sound_detector.py](modules/sound_detector.py)) +2. Module 2 ([modules/visual_detector.py](modules/visual_detector.py)) +3. Module 3 ([modules/fusion_engine.py](modules/fusion_engine.py)) +4. Output writers ([utils/srt_writer.py](utils/srt_writer.py)) +5. Pipeline orchestrator ([main.py](main.py)) +6. Tunable parameters ([config.py](config.py)) + +## Flow diagram + +```mermaid +flowchart TD + A[Input Video] --> B[main.py CLI + Orchestration] + B --> C[Module 1: SoundEventDetector] + C --> C1[Audio extraction + ffmpeg/librosa] + C1 --> C2[YAMNet sliding-window inference] + C2 --> C3[Category mapping + blacklist + boost] + C3 --> C4[AudioEvent list + timestamps] + + C4 --> D[Module 2: VisualReactionDetector] + D --> D1[Frame sampling around each timestamp] + D1 --> D2[MediaPipe Face Mesh] + D2 --> D3[EAR + MAR + Brow features] + D3 --> D4[Temporal weighted aggregation] + D4 --> D5[VisualScore per timestamp] + + C4 --> E[Module 3: FusionEngine] + D5 --> E + E --> E1[Weighted fusion score] + E1 --> E2[Priority thresholds HIGH/MEDIUM/LOW] + E2 --> E3[Dedup + SRT gap enforcement] + E3 --> F[CaptionEntry list] + + F --> G1[output.srt] + F --> G2[report.json] + F --> G3[Annotated frames] +``` + +## Full directory structure (workspace snapshot) + +```text +Intelligent-cc-generation/ +├── demo_results/ +│ ├── frames/ +│ │ ├── frame_00000_t0.00s_music.jpg +│ │ ├── frame_00817_t34.08s_rat_squeak.jpg +│ │ ├── frame_00851_t35.52s_door.jpg +│ │ ├──(...remaining annotated frames of video) +│ ├── final_results.json +│ ├── output.srt +│ ├── pipeline.log +│ └── report.json +├── modules/ +│ ├── fusion_engine.py +│ ├── sound_detector.py +│ └── visual_detector.py +├── utils/ +│ ├── __init__.py +│ ├── logger.py +│ └── srt_writer.py +├── config.py +├── fight.mp4 +├── main.py +├── README.md +└── requirements.txt +``` + +## End-to-end pipeline (actual implementation) + +1. `main.py` parses CLI args (`--video`, `--output`, `--no-visual`, `--debug`, `--no-frames`). +2. Module 1 extracts mono 16 kHz audio and runs YAMNet in sliding windows. +3. Module 1 maps YAMNet classes to project categories, applies boost factors, filters blacklist classes, merges close duplicates, and caps per-category events. +4. Module 2 receives Module 1 timestamps and samples frames in a local temporal window around each event. +5. Module 2 computes face-based reaction features (EAR, MAR, brow raise), normalizes and aggregates into per-event visual scores. +6. Module 3 computes fused scores using weighted audio + visual signals and priority-dependent thresholds. +7. Module 3 de-duplicates temporally similar accepted captions and enforces SRT timeline gaps. +8. Writers generate: +9. `output.srt` +10. `report.json` +11. Optional annotated reaction frames in `demo_results/frames/`. + +## Deep scan of the 3 modules + +### Module 1: Sound Event Detection + +Source: [modules/sound_detector.py](modules/sound_detector.py) + +Core logic: + +1. Extract audio from video with ffmpeg + soundfile (fallback to librosa). +2. Resample/standardize to 16 kHz mono. +3. Sliding-window inference: +4. Window: `0.96s` (`AUDIO_WINDOW_SEC`) +5. Hop: `0.48s` (`AUDIO_HOP_SEC`) +6. Run YAMNet and keep top classes above `YAMNET_RAW_THRESHOLD`. +7. Discard known noisy classes via `YAMNET_BLACKLIST`. +8. Map class tokens to `SOUND_CATEGORIES` entries. +9. Apply category boost and `AUDIO_EMIT_THRESHOLD`. +10. Merge near events (`EVENT_MERGE_GAP_SEC`) and cap category floods (`MAX_EVENTS_PER_CATEGORY`). + +Output: list of `AudioEvent` dataclasses with timestamp, category, display label, priority, confidence, and raw class evidence. + +Why this design helps: + +1. Better capture of short events than one-shot full-clip inference. +2. More controllable false-positive behavior through category map, blacklist, and thresholds. + +### Module 2: Visual Reaction Detection + +Source: [modules/visual_detector.py](modules/visual_detector.py) + +Core logic: + +1. For each audio timestamp, open a temporal window: +2. Start: `t - VISUAL_WINDOW_BEFORE_SEC` +3. End: `t + VISUAL_WINDOW_AFTER_SEC` +4. Uniformly sample up to `VISUAL_MAX_FRAMES_PER_WINDOW` frames. +5. Use MediaPipe Face Mesh per frame. +6. Compute facial reaction primitives: +7. EAR (eye opening) +8. MAR (mouth opening) +9. Brow raise +10. Normalize with baselines from config and clamp to `[0, 1]`. +11. Compute per-frame composite score from weighted sub-scores. +12. Aggregate with temporal weighting (frames near the event are weighted more). +13. Return `VisualScore` with `reaction_score`, valid frame count, confidence tier, and note. + +Output: `Dict[timestamp, VisualScore]`. + +Why this design helps: + +1. Reactions are time-distributed, not single-frame; temporal aggregation is more robust. +2. Better resilience against isolated bad frames or brief face-tracking misses. + +### Module 3: Fusion Decision Engine + +Source: [modules/fusion_engine.py](modules/fusion_engine.py) + +Core logic: + +1. For each audio event, locate matching visual score (with float tolerance fallback). +2. Compute fusion score: + +`fusion = FUSION_AUDIO_WEIGHT * audio_conf + FUSION_VISUAL_WEIGHT * visual_reaction` + +3. Apply priority threshold (`FUSION_THRESHOLD`): +4. `HIGH`: lenient +5. `MEDIUM`: moderate +6. `LOW`: strict +7. If no faces are detected globally, switch to audio-only mode and reduce thresholds by 20%. +8. Accept/reject each candidate based on score vs threshold. +9. De-duplicate same-category events within `CAPTION_DEDUP_SEC`. +10. Build caption entries with priority-informed durations. +11. Enforce non-overlap and minimum subtitle gap (`SRT_MIN_GAP_SEC`). +12. Optionally annotate frames for accepted captions. + +Output: list of final `CaptionEntry` objects for SRT/JSON reporting. + +Why this design helps: + +1. Keeps critical sounds sensitive while suppressing low-value noise. +2. De-duplication and gap handling make subtitle output more readable. + +## Output Images +frame_01668_t69 60s_door +frame_04522_t188 64s_glass_break +frame_05581_t232 80s_glass_break +## Features +1. Three-stage audio-visual fusion architecture. +2. Priority-aware emission logic. +3. Audio-only fallback when faces are absent or visual step is skipped. +4. Config-driven thresholds and category mapping. +5. Structured machine-readable report output (`report.json`). +6. Optional visual debugging via frame annotation. +7. CLI flags for quick experimentation and debugging. + +## Tech stack + +1. Python 3.x +2. TensorFlow 2.15 + TensorFlow Hub (YAMNet) +3. MediaPipe Face Mesh +4. OpenCV +5. NumPy +6. ffmpeg (system or `imageio-ffmpeg` fallback path used by code) +7. soundfile / librosa for audio handling + +Dependencies tracked in [requirements.txt](requirements.txt). + +## Project structure (important files) + +1. [main.py](main.py): CLI and orchestration of all modules. +2. [config.py](config.py): all thresholds, windows, weights, categories, and output defaults. +3. [modules/sound_detector.py](modules/sound_detector.py): Module 1. +4. [modules/visual_detector.py](modules/visual_detector.py): Module 2. +5. [modules/fusion_engine.py](modules/fusion_engine.py): Module 3. +6. [utils/srt_writer.py](utils/srt_writer.py): SRT + JSON writers. +7. [demo_results/](demo_results/): generated outputs. + +## Setup + +### 1. Create and activate virtual environment + +Windows PowerShell: + +```powershell +python -m venv .venv +.\.venv\Scripts\Activate.ps1 +``` + +### 2. Install dependencies + +```powershell +pip install -r requirements.txt +``` + +### 3. Ensure ffmpeg is available + +The code first checks system ffmpeg. If unavailable, it attempts to use `imageio-ffmpeg` as a fallback. + +## How to run + +### Default run + +```powershell +python main.py +``` + +Uses default video path from [main.py](main.py) and default output directory from [config.py](config.py). + +### Run on a specific video + +```powershell +python main.py --video path\to\clip.mp4 +``` + +### Audio-only mode (skip Module 2) + +```powershell +python main.py --video path\to\clip.mp4 --no-visual +``` + +### Custom output directory + +```powershell +python main.py --video path\to\clip.mp4 --output demo_results +``` + +### Debug logs and no frame dumps + +```powershell +python main.py --video path\to\clip.mp4 --debug --no-frames +``` + +## Output artifacts + +1. `demo_results/output.srt`: final subtitle suggestions. +2. `demo_results/report.json`: full report with metadata, audio events, visual scores, and final captions. +3. `demo_results/frames/*.jpg`: optional annotated frames. +4. `demo_results/pipeline.log` (if enabled by logger setup): processing details. + +## Key configuration knobs + +All in [config.py](config.py): + +1. Audio sensitivity: +2. `YAMNET_RAW_THRESHOLD` +3. `AUDIO_EMIT_THRESHOLD` +4. `EVENT_MERGE_GAP_SEC` +5. Visual analysis window: +6. `VISUAL_WINDOW_BEFORE_SEC` +7. `VISUAL_WINDOW_AFTER_SEC` +8. `VISUAL_MAX_FRAMES_PER_WINDOW` +9. Fusion behavior: +10. `FUSION_AUDIO_WEIGHT` +11. `FUSION_VISUAL_WEIGHT` +12. `FUSION_THRESHOLD` +13. Subtitle timeline: +14. `SRT_DISPLAY_DURATION` +15. `SRT_MIN_GAP_SEC` + +## Advantages + +1. Multi-modal decisioning improves over simple audio-only triggers. +2. Priority-aware thresholds preserve critical event recall. +3. Config-first design makes tuning easy without changing code. +4. Explainable behavior with intermediate artifacts and logs. +5. Graceful fallbacks (visual skip/error handling and audio-only mode). + +## Limitations + +1. YAMNet class biases may affect niche/regional sound coverage. +2. Visual detection depends on clear, detectable faces; occlusion/low light hurts performance. +3. Heuristic facial scoring (EAR/MAR/brow) is lightweight but not emotion-model level. +4. Fixed thresholds can require per-domain tuning. +5. Real-time/long-duration optimization is limited; current flow is clip-oriented. +6. No built-in benchmark harness in this repository for precision/recall/F1 tracking. + +## Future work + +1. Add stronger audio backbones (for example, PANNs/BEATs or fine-tuned models). +2. Add broader visual cues (pose/body motion) beyond face-only reactions. +3. Introduce adaptive thresholds by scene context and sound type. +4. Build quantitative evaluation suite with labeled datasets. +5. Add batch processing and lightweight service API/GUI. +6. Expand language/domain tuning for local/regional content. +7. Add confidence calibration and uncertainty reporting in outputs. + +## Practical use cases + +1. Accessibility-first subtitle assistance for edited video content. +2. Pre-captioning support for post-production teams. +3. Assistive indexing of high-salience non-speech events. +4. Educational demos of multi-modal event fusion. + +## Troubleshooting + +1. No captions generated: +2. Lower `AUDIO_EMIT_THRESHOLD` and/or `FUSION_THRESHOLD` in [config.py](config.py). +3. Verify input has audible non-speech events. +4. Visual scores all low or missing: +5. Try `--no-visual` to verify audio path independently. +6. Check lighting/face visibility in source video. +7. Audio extraction fails: +8. Ensure ffmpeg is installed or fallback dependencies are available. + +## Current status + +This repository already includes the complete 3-module flow (audio detection, visual reaction scoring, and fusion-based final caption emission) executed by [main.py](main.py). diff --git a/config.py b/config.py new file mode 100644 index 0000000..8798e09 --- /dev/null +++ b/config.py @@ -0,0 +1,260 @@ +OUTPUT_DIR = "demo_results" +FRAMES_DIR = "demo_results/frames" +LOG_LEVEL = "INFO" + +# ───────────────────────────────────────────── +# MODULE 1 — Sound Event Detection +# ───────────────────────────────────────────── +YAMNET_MODEL_PATH = "https://tfhub.dev/google/yamnet/1" + +# Sliding window for audio analysis +AUDIO_WINDOW_SEC = 0.96 # YAMNet native window (960 ms) +AUDIO_HOP_SEC = 0.48 # 50 % overlap → smoother detection +AUDIO_SAMPLE_RATE = 16000 # YAMNet requires 16 kHz mono + +# Minimum raw YAMNet confidence to even consider a class +YAMNET_RAW_THRESHOLD = 0.10 + +# After priority boosting, minimum score to emit an event +AUDIO_EMIT_THRESHOLD = 0.25 + +# Merge nearby events of the same category within this window (seconds) +EVENT_MERGE_GAP_SEC = 1.5 + +# Maximum events per category across the whole clip (avoids flooding) +MAX_EVENTS_PER_CATEGORY = 8 + +# Default caption duration when no end-time is available (seconds) +DEFAULT_CAPTION_DURATION_SEC = 2.0 + +SOUND_CATEGORIES = { + "SCREAM": { + "display": "[ Screaming ]", + "priority": "HIGH", + "boost": 1.8, + "yamnet": ["scream", "shout", "yell", "shriek", "wail", "cry"], + }, + "EXPLOSION": { + "display": "[ Explosion ]", + "priority": "HIGH", + "boost": 1.9, + "yamnet": ["explosion", "bang", "burst", "blast", "boom"], + }, + "GUNSHOT": { + "display": "[ Gunshot ]", + "priority": "HIGH", + "boost": 1.9, + "yamnet": ["gunshot", "gunfire", "shot", "ricochet", "firearm"], + }, + "GLASS_BREAK": { + "display": "[ Glass Breaking ]", + "priority": "HIGH", + "boost": 1.7, + "yamnet": ["glass", "shatter", "breaking"], + }, + "CRASH": { + "display": "[ Crash ]", + "priority": "HIGH", + "boost": 1.6, + "yamnet": ["crash", "collision", "impact", "smash"], + }, + "ALARM": { + "display": "[ Alarm / Siren ]", + "priority": "HIGH", + "boost": 1.7, + "yamnet": ["alarm", "siren", "beep", "buzzer", "alert", "horn"], + }, + + "LAUGHTER": { + "display": "[ Laughter ]", + "priority": "MEDIUM", + "boost": 1.4, + "yamnet": ["laugh", "giggle", "chuckle", "cackle"], + }, + "APPLAUSE": { + "display": "[ Applause ]", + "priority": "MEDIUM", + "boost": 1.4, + "yamnet": ["applause", "clapping", "clap"], + }, + "CRYING": { + "display": "[ Crying ]", + "priority": "MEDIUM", + "boost": 1.5, + "yamnet": ["crying", "sobbing", "weeping", "whimper"], + }, + "KNOCK": { + "display": "[ Knocking ]", + "priority": "MEDIUM", + "boost": 1.3, + "yamnet": ["knock", "tap", "rap", "pound"], + }, + "DOORBELL": { + "display": "[ Doorbell ]", + "priority": "MEDIUM", + "boost": 1.5, + "yamnet": ["doorbell", "ding dong", "bell"], + }, + "PHONE": { + "display": "[ Phone Ringing ]", + "priority": "MEDIUM", + "boost": 1.4, + "yamnet": ["telephone", "ringtone", "phone", "mobile"], + }, + + "DOG": { + "display": "[ Dog Barking ]", + "priority": "MEDIUM", + "boost": 1.3, + "yamnet": ["dog", "bark", "howl", "growl", "whine"], + }, + "CAT": { + "display": "[ Cat ]", + "priority": "MEDIUM", + "boost": 1.2, + "yamnet": ["cat", "meow", "purr", "hiss"], + }, + "RAT_SQUEAK": { + "display": "[ Squeak / Rodent ]", + "priority": "MEDIUM", + "boost": 1.5, + "yamnet": ["squeak", "squeal", "rodent", "mouse", "rat"], + }, + + "CHAIR_CREAK": { + "display": "[ Creaking ]", + "priority": "MEDIUM", + "boost": 1.4, + "yamnet": ["creak", "squeak", "crunch", "groan", "grind"], + }, + "FOOTSTEPS": { + "display": "[ Footsteps ]", + "priority": "MEDIUM", + "boost": 1.2, + "yamnet": ["footstep", "walk", "stomp", "running"], + }, + "DOOR": { + "display": "[ Door ]", + "priority": "MEDIUM", + "boost": 1.3, + "yamnet": ["door", "slam", "close", "open"], + }, + "THUNDER": { + "display": "[ Thunder ]", + "priority": "MEDIUM", + "boost": 1.5, + "yamnet": ["thunder", "lightning", "storm"], + }, + "MUSIC": { + "display": "[ Music ]", + "priority": "MEDIUM", + "boost": 1.0, + "yamnet": [ + "music", "song", "melody", "beat", "drum", "guitar", "piano", + "sitar", "tabla", "flute", "violin", "instrument", + ], + }, + + "AMBIENT": { + "display": "[ Background Noise ]", + "priority": "LOW", + "boost": 0.5, + "yamnet": [ + "silence", "noise", "hum", "murmur", "ambient", + "white noise", "static", + ], + }, +} +YAMNET_BLACKLIST = [ + "vehicle", "car", "truck", "motorcycle", "bus", "train", "aircraft", + "engine", "bicycle", "traffic", "road", "rail", "boat", "ship", + "rain", "drizzle", "thunder shower", # handled separately above + "wind", "rustle", "leaves", + "television", "radio", # usually speech channel bleed + "printer", "keyboard", # office ambient + "air conditioning", "fan", "vacuum cleaner", + "crowd", "hubbub", # too generic + "speech", "conversation", "narration", # speech — not CC + "singing", # usually part of lyrics track +] + +# ───────────────────────────────────────────── +# MODULE 2 — Visual Reaction Detection +# ───────────────────────────────────────────── + +# Seconds to sample around each audio event timestamp +VISUAL_WINDOW_BEFORE_SEC = 0.5 +VISUAL_WINDOW_AFTER_SEC = 1.2 + +# Max frames sampled per window (evenly spaced) +VISUAL_MAX_FRAMES_PER_WINDOW = 8 + +# Minimum number of valid (face-detected) frames to trust a score +VISUAL_MIN_VALID_FRAMES = 2 + +# MediaPipe Face Mesh confidence thresholds +MEDIAPIPE_DETECTION_CONFIDENCE = 0.5 +MEDIAPIPE_TRACKING_CONFIDENCE = 0.5 + +# Eye Aspect Ratio (EAR) — wide eyes = surprise / fear +EYE_LANDMARKS = { + "left": {"top": 159, "bottom": 145, "inner": 133, "outer": 33}, + "right": {"top": 386, "bottom": 374, "inner": 362, "outer": 263}, +} +# Mouth Aspect Ratio (MAR) — open mouth = surprise / reaction +MOUTH_LANDMARKS = { + "top": 13, "bottom": 14, "left": 78, "right": 308, + "top2": 12, "bottom2": 15, +} +# Brow Raise — upper brow vs eye-corner distance +BROW_LANDMARKS = { + "left_brow": [70, 63, 105, 66, 107], + "right_brow": [300, 293, 334, 296, 336], + "left_eye_top": 159, + "right_eye_top": 386, +} + +# Baseline EAR (neutral relaxed eye) +EAR_BASELINE = 0.25 +# EAR delta that counts as "wide eyes" reaction +EAR_REACTION_DELTA = 0.05 + +# Baseline MAR (closed mouth) +MAR_BASELINE = 0.05 +# MAR delta that counts as "open mouth" reaction +MAR_REACTION_DELTA = 0.06 + +# Brow raise threshold (normalized by face height) +BROW_RAISE_THRESHOLD = 0.02 + +# Weights for combining sub-scores into the final visual reaction score +VISUAL_WEIGHT_EAR = 0.35 +VISUAL_WEIGHT_MAR = 0.40 +VISUAL_WEIGHT_BROW = 0.25 + +# ───────────────────────────────────────────── +# MODULE 3 — Fusion Decision Engine +# ───────────────────────────────────────────── + +# Weight given to audio confidence vs visual reaction score +FUSION_AUDIO_WEIGHT = 0.65 +FUSION_VISUAL_WEIGHT = 0.35 + +# Priority-tier overrides: lower threshold → easier to emit caption +FUSION_THRESHOLD = { + "HIGH": 0.28, # scream, explosion → lenient + "MEDIUM": 0.40, # laughter, animal → normal + "LOW": 0.60, # ambient → strict +} + +# Suppress duplicate captions of the same category within N seconds +CAPTION_DEDUP_SEC = 3.0 + +# SRT subtitle display duration (seconds) per priority tier +SRT_DISPLAY_DURATION = { + "HIGH": 2.5, + "MEDIUM": 2.0, + "LOW": 1.5, +} + +SRT_MIN_GAP_SEC = 0.3 \ No newline at end of file diff --git a/demo-module2/module2.py b/demo-module2/module2.py new file mode 100644 index 0000000..bcaa0b3 --- /dev/null +++ b/demo-module2/module2.py @@ -0,0 +1,95 @@ +import cv2 +import mediapipe as mp +import numpy as np +import json +import os +from typing import List, Dict + +class VisualReactionAnalyzer: + """Module 2: Speaker/Scene Reaction Detection""" + + def __init__(self): + print("Module 2 - Visual Reaction Analyzer") + + self.mp_face_mesh = mp.solutions.face_mesh + self.face_mesh = self.mp_face_mesh.FaceMesh( + max_num_faces=3, + refine_landmarks=True, + min_detection_confidence=0.4, + min_tracking_confidence=0.4 + ) + + def analyze_frame(self, frame): + rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) + results = self.face_mesh.process(rgb) + + if not results.multi_face_landmarks: + return {"reaction_score": 0.0, "expression": "no_face", "confidence": 0.0} + + landmarks = results.multi_face_landmarks[0].landmark + + #metrics + eye_open = abs(landmarks[159].y - landmarks[145].y) + abs(landmarks[386].y - landmarks[374].y) + mouth_open = abs(landmarks[13].y - landmarks[14].y) + brow_raise = abs(landmarks[70].y - landmarks[300].y) + + #scoring formula + score = (eye_open * 8 + mouth_open * 12 + brow_raise * 6) + reaction_score = min(1.0, score) + + expression = "strong_surprise" if reaction_score > 0.65 else \ + "moderate_reaction" if reaction_score > 0.35 else "neutral" + + return { + "reaction_score": round(reaction_score, 3), + "expression": expression, + "eye_openness": round(float(eye_open), 3), + "mouth_openness": round(float(mouth_open), 3), + "confidence": round(reaction_score, 3) + } + + def analyze_video_at_timestamps(self, video_path: str, timestamps: List[float] = None): + if timestamps is None: + timestamps = [1, 2, 3, 4, 5, 6] + + os.makedirs("module2_results", exist_ok=True) + cap = cv2.VideoCapture(video_path) + + print(f"Video: {video_path}") + print(f"FPS: {cap.get(cv2.CAP_PROP_FPS):.2f}\n") + + results = {} + + for ts in timestamps: + frame_no = int(ts * cap.get(cv2.CAP_PROP_FPS)) + cap.set(cv2.CAP_PROP_POS_FRAMES, frame_no) + ret, frame = cap.read() + + if not ret or frame is None: + print(f"❌ Could not read frame at {ts}s") + continue + + data = self.analyze_frame(frame) + results[ts] = data + + #save frames + annotated = frame.copy() + text = f"{ts}s | {data['reaction_score']} | {data['expression']}" + cv2.putText(annotated, text, (20, 60), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 255, 0), 3) + + cv2.imwrite(f"module2_results/frame_{ts}s.jpg", annotated) + print(f"✅ {ts}s → Score: {data['reaction_score']} | {data['expression']}") + + cap.release() + + with open("module2_results/results.json", "w") as f: + json.dump(results, f, indent=2) + + print("\nModule 2 Completed") + print("Results saved in 'module2_results' folder") + return results + + +if __name__ == "__main__": + analyzer = VisualReactionAnalyzer() + analyzer.analyze_video_at_timestamps("sample_video.mp4") \ No newline at end of file diff --git a/demo-module2/module2_results/frame_1s.jpg b/demo-module2/module2_results/frame_1s.jpg new file mode 100644 index 0000000..46e5447 Binary files /dev/null and b/demo-module2/module2_results/frame_1s.jpg differ diff --git a/demo-module2/module2_results/frame_2s.jpg b/demo-module2/module2_results/frame_2s.jpg new file mode 100644 index 0000000..d09c04e Binary files /dev/null and b/demo-module2/module2_results/frame_2s.jpg differ diff --git a/demo-module2/module2_results/frame_3s.jpg b/demo-module2/module2_results/frame_3s.jpg new file mode 100644 index 0000000..02f77f0 Binary files /dev/null and b/demo-module2/module2_results/frame_3s.jpg differ diff --git a/demo-module2/module2_results/frame_4s.jpg b/demo-module2/module2_results/frame_4s.jpg new file mode 100644 index 0000000..2bc66ed Binary files /dev/null and b/demo-module2/module2_results/frame_4s.jpg differ diff --git a/demo-module2/module2_results/frame_5s.jpg b/demo-module2/module2_results/frame_5s.jpg new file mode 100644 index 0000000..f7c7f6e Binary files /dev/null and b/demo-module2/module2_results/frame_5s.jpg differ diff --git a/demo-module2/module2_results/frame_6s.jpg b/demo-module2/module2_results/frame_6s.jpg new file mode 100644 index 0000000..0d48718 Binary files /dev/null and b/demo-module2/module2_results/frame_6s.jpg differ diff --git a/demo-module2/module2_results/results.json b/demo-module2/module2_results/results.json new file mode 100644 index 0000000..23a67db --- /dev/null +++ b/demo-module2/module2_results/results.json @@ -0,0 +1,44 @@ +{ + "1": { + "reaction_score": 0.679, + "expression": "strong_surprise", + "eye_openness": 0.027, + "mouth_openness": 0.037, + "confidence": 0.679 + }, + "2": { + "reaction_score": 0.68, + "expression": "strong_surprise", + "eye_openness": 0.031, + "mouth_openness": 0.028, + "confidence": 0.68 + }, + "3": { + "reaction_score": 0.201, + "expression": "neutral", + "eye_openness": 0.017, + "mouth_openness": 0.0, + "confidence": 0.201 + }, + "4": { + "reaction_score": 0.471, + "expression": "moderate_reaction", + "eye_openness": 0.014, + "mouth_openness": 0.017, + "confidence": 0.471 + }, + "5": { + "reaction_score": 0.763, + "expression": "strong_surprise", + "eye_openness": 0.017, + "mouth_openness": 0.035, + "confidence": 0.763 + }, + "6": { + "reaction_score": 0.65, + "expression": "moderate_reaction", + "eye_openness": 0.016, + "mouth_openness": 0.022, + "confidence": 0.65 + } +} \ No newline at end of file diff --git a/demo-module2/requirements.txt b/demo-module2/requirements.txt new file mode 100644 index 0000000..9e5ac25 --- /dev/null +++ b/demo-module2/requirements.txt @@ -0,0 +1,3 @@ +mediapipe +opencv-python +numpy \ No newline at end of file diff --git a/demo-module2/sample_video.mp4 b/demo-module2/sample_video.mp4 new file mode 100644 index 0000000..ca99ac2 Binary files /dev/null and b/demo-module2/sample_video.mp4 differ diff --git a/demo_results/frames/frame_00000_t0.00s_music.jpg b/demo_results/frames/frame_00000_t0.00s_music.jpg new file mode 100644 index 0000000..26588dd Binary files /dev/null and b/demo_results/frames/frame_00000_t0.00s_music.jpg differ diff --git a/demo_results/frames/frame_00817_t34.08s_rat_squeak.jpg b/demo_results/frames/frame_00817_t34.08s_rat_squeak.jpg new file mode 100644 index 0000000..71df20c Binary files /dev/null and b/demo_results/frames/frame_00817_t34.08s_rat_squeak.jpg differ diff --git a/demo_results/frames/frame_00851_t35.52s_door.jpg b/demo_results/frames/frame_00851_t35.52s_door.jpg new file mode 100644 index 0000000..a989f78 Binary files /dev/null and b/demo_results/frames/frame_00851_t35.52s_door.jpg differ diff --git a/demo_results/frames/frame_01438_t60.00s_alarm.jpg b/demo_results/frames/frame_01438_t60.00s_alarm.jpg new file mode 100644 index 0000000..74284be Binary files /dev/null and b/demo_results/frames/frame_01438_t60.00s_alarm.jpg differ diff --git a/demo_results/frames/frame_01461_t60.96s_door.jpg b/demo_results/frames/frame_01461_t60.96s_door.jpg new file mode 100644 index 0000000..719f0e3 Binary files /dev/null and b/demo_results/frames/frame_01461_t60.96s_door.jpg differ diff --git a/demo_results/frames/frame_01668_t69.60s_door.jpg b/demo_results/frames/frame_01668_t69.60s_door.jpg new file mode 100644 index 0000000..85c18bc Binary files /dev/null and b/demo_results/frames/frame_01668_t69.60s_door.jpg differ diff --git a/demo_results/frames/frame_02301_t96.00s_door.jpg b/demo_results/frames/frame_02301_t96.00s_door.jpg new file mode 100644 index 0000000..6475083 Binary files /dev/null and b/demo_results/frames/frame_02301_t96.00s_door.jpg differ diff --git a/demo_results/frames/frame_03072_t128.16s_explosion.jpg b/demo_results/frames/frame_03072_t128.16s_explosion.jpg new file mode 100644 index 0000000..2ed5f1b Binary files /dev/null and b/demo_results/frames/frame_03072_t128.16s_explosion.jpg differ diff --git a/demo_results/frames/frame_03383_t141.12s_music.jpg b/demo_results/frames/frame_03383_t141.12s_music.jpg new file mode 100644 index 0000000..30cdf6b Binary files /dev/null and b/demo_results/frames/frame_03383_t141.12s_music.jpg differ diff --git a/demo_results/frames/frame_03429_t143.04s_glass_break.jpg b/demo_results/frames/frame_03429_t143.04s_glass_break.jpg new file mode 100644 index 0000000..4d07f3b Binary files /dev/null and b/demo_results/frames/frame_03429_t143.04s_glass_break.jpg differ diff --git a/demo_results/frames/frame_03521_t146.88s_explosion.jpg b/demo_results/frames/frame_03521_t146.88s_explosion.jpg new file mode 100644 index 0000000..aa6ed14 Binary files /dev/null and b/demo_results/frames/frame_03521_t146.88s_explosion.jpg differ diff --git a/demo_results/frames/frame_03671_t153.12s_crash.jpg b/demo_results/frames/frame_03671_t153.12s_crash.jpg new file mode 100644 index 0000000..2330be2 Binary files /dev/null and b/demo_results/frames/frame_03671_t153.12s_crash.jpg differ diff --git a/demo_results/frames/frame_03705_t154.56s_rat_squeak.jpg b/demo_results/frames/frame_03705_t154.56s_rat_squeak.jpg new file mode 100644 index 0000000..4d5f00d Binary files /dev/null and b/demo_results/frames/frame_03705_t154.56s_rat_squeak.jpg differ diff --git a/demo_results/frames/frame_03786_t157.92s_glass_break.jpg b/demo_results/frames/frame_03786_t157.92s_glass_break.jpg new file mode 100644 index 0000000..0ef191a Binary files /dev/null and b/demo_results/frames/frame_03786_t157.92s_glass_break.jpg differ diff --git a/demo_results/frames/frame_03832_t159.84s_music.jpg b/demo_results/frames/frame_03832_t159.84s_music.jpg new file mode 100644 index 0000000..65226b8 Binary files /dev/null and b/demo_results/frames/frame_03832_t159.84s_music.jpg differ diff --git a/demo_results/frames/frame_03866_t161.28s_knock.jpg b/demo_results/frames/frame_03866_t161.28s_knock.jpg new file mode 100644 index 0000000..9f419b8 Binary files /dev/null and b/demo_results/frames/frame_03866_t161.28s_knock.jpg differ diff --git a/demo_results/frames/frame_03878_t161.76s_explosion.jpg b/demo_results/frames/frame_03878_t161.76s_explosion.jpg new file mode 100644 index 0000000..e2b425f Binary files /dev/null and b/demo_results/frames/frame_03878_t161.76s_explosion.jpg differ diff --git a/demo_results/frames/frame_03947_t164.64s_music.jpg b/demo_results/frames/frame_03947_t164.64s_music.jpg new file mode 100644 index 0000000..87beaf5 Binary files /dev/null and b/demo_results/frames/frame_03947_t164.64s_music.jpg differ diff --git a/demo_results/frames/frame_04074_t169.92s_crash.jpg b/demo_results/frames/frame_04074_t169.92s_crash.jpg new file mode 100644 index 0000000..a9b20ee Binary files /dev/null and b/demo_results/frames/frame_04074_t169.92s_crash.jpg differ diff --git a/demo_results/frames/frame_04085_t170.40s_explosion.jpg b/demo_results/frames/frame_04085_t170.40s_explosion.jpg new file mode 100644 index 0000000..44c45d5 Binary files /dev/null and b/demo_results/frames/frame_04085_t170.40s_explosion.jpg differ diff --git a/demo_results/frames/frame_04166_t173.76s_explosion.jpg b/demo_results/frames/frame_04166_t173.76s_explosion.jpg new file mode 100644 index 0000000..ca199c1 Binary files /dev/null and b/demo_results/frames/frame_04166_t173.76s_explosion.jpg differ diff --git a/demo_results/frames/frame_04177_t174.24s_crash.jpg b/demo_results/frames/frame_04177_t174.24s_crash.jpg new file mode 100644 index 0000000..075d468 Binary files /dev/null and b/demo_results/frames/frame_04177_t174.24s_crash.jpg differ diff --git a/demo_results/frames/frame_04396_t183.36s_explosion.jpg b/demo_results/frames/frame_04396_t183.36s_explosion.jpg new file mode 100644 index 0000000..ee7c3f0 Binary files /dev/null and b/demo_results/frames/frame_04396_t183.36s_explosion.jpg differ diff --git a/demo_results/frames/frame_04488_t187.20s_explosion.jpg b/demo_results/frames/frame_04488_t187.20s_explosion.jpg new file mode 100644 index 0000000..6c32cc9 Binary files /dev/null and b/demo_results/frames/frame_04488_t187.20s_explosion.jpg differ diff --git a/demo_results/frames/frame_04522_t188.64s_glass_break.jpg b/demo_results/frames/frame_04522_t188.64s_glass_break.jpg new file mode 100644 index 0000000..3f5e9b1 Binary files /dev/null and b/demo_results/frames/frame_04522_t188.64s_glass_break.jpg differ diff --git a/demo_results/frames/frame_04672_t194.88s_door.jpg b/demo_results/frames/frame_04672_t194.88s_door.jpg new file mode 100644 index 0000000..686f897 Binary files /dev/null and b/demo_results/frames/frame_04672_t194.88s_door.jpg differ diff --git a/demo_results/frames/frame_04718_t196.80s_music.jpg b/demo_results/frames/frame_04718_t196.80s_music.jpg new file mode 100644 index 0000000..b3095c5 Binary files /dev/null and b/demo_results/frames/frame_04718_t196.80s_music.jpg differ diff --git a/demo_results/frames/frame_04729_t197.28s_glass_break.jpg b/demo_results/frames/frame_04729_t197.28s_glass_break.jpg new file mode 100644 index 0000000..086eac9 Binary files /dev/null and b/demo_results/frames/frame_04729_t197.28s_glass_break.jpg differ diff --git a/demo_results/frames/frame_04787_t199.68s_alarm.jpg b/demo_results/frames/frame_04787_t199.68s_alarm.jpg new file mode 100644 index 0000000..f54bdfe Binary files /dev/null and b/demo_results/frames/frame_04787_t199.68s_alarm.jpg differ diff --git a/demo_results/frames/frame_04799_t200.16s_doorbell.jpg b/demo_results/frames/frame_04799_t200.16s_doorbell.jpg new file mode 100644 index 0000000..b6480bd Binary files /dev/null and b/demo_results/frames/frame_04799_t200.16s_doorbell.jpg differ diff --git a/demo_results/frames/frame_04822_t201.12s_glass_break.jpg b/demo_results/frames/frame_04822_t201.12s_glass_break.jpg new file mode 100644 index 0000000..10be66b Binary files /dev/null and b/demo_results/frames/frame_04822_t201.12s_glass_break.jpg differ diff --git a/demo_results/frames/frame_04914_t204.96s_glass_break.jpg b/demo_results/frames/frame_04914_t204.96s_glass_break.jpg new file mode 100644 index 0000000..65329c0 Binary files /dev/null and b/demo_results/frames/frame_04914_t204.96s_glass_break.jpg differ diff --git a/demo_results/frames/frame_05040_t210.24s_explosion.jpg b/demo_results/frames/frame_05040_t210.24s_explosion.jpg new file mode 100644 index 0000000..7214754 Binary files /dev/null and b/demo_results/frames/frame_05040_t210.24s_explosion.jpg differ diff --git a/demo_results/frames/frame_05075_t211.68s_music.jpg b/demo_results/frames/frame_05075_t211.68s_music.jpg new file mode 100644 index 0000000..f8c21ae Binary files /dev/null and b/demo_results/frames/frame_05075_t211.68s_music.jpg differ diff --git a/demo_results/frames/frame_05167_t215.52s_music.jpg b/demo_results/frames/frame_05167_t215.52s_music.jpg new file mode 100644 index 0000000..c1e946c Binary files /dev/null and b/demo_results/frames/frame_05167_t215.52s_music.jpg differ diff --git a/demo_results/frames/frame_05581_t232.80s_glass_break.jpg b/demo_results/frames/frame_05581_t232.80s_glass_break.jpg new file mode 100644 index 0000000..2008f2b Binary files /dev/null and b/demo_results/frames/frame_05581_t232.80s_glass_break.jpg differ diff --git a/demo_results/frames/frame_05616_t234.24s_music.jpg b/demo_results/frames/frame_05616_t234.24s_music.jpg new file mode 100644 index 0000000..6873466 Binary files /dev/null and b/demo_results/frames/frame_05616_t234.24s_music.jpg differ diff --git a/demo_results/frames/frame_05765_t240.48s_ambient.jpg b/demo_results/frames/frame_05765_t240.48s_ambient.jpg new file mode 100644 index 0000000..df2708c Binary files /dev/null and b/demo_results/frames/frame_05765_t240.48s_ambient.jpg differ diff --git a/demo_results/output.srt b/demo_results/output.srt new file mode 100644 index 0000000..b0aa76b --- /dev/null +++ b/demo_results/output.srt @@ -0,0 +1,132 @@ +1 +00:00:00,000 --> 00:00:02,000 +[ Music ] + +2 +00:00:34,080 --> 00:00:35,220 +[ Squeak / Rodent ] + +3 +00:00:35,520 --> 00:00:38,400 +[ Door ] + +4 +00:01:00,000 --> 00:01:00,660 +[ Alarm / Siren ] + +5 +00:01:00,960 --> 00:01:02,960 +[ Door ] + +6 +00:01:09,600 --> 00:01:11,600 +[ Door ] + +7 +00:01:36,000 --> 00:01:38,000 +[ Door ] + +8 +00:02:08,160 --> 00:02:10,660 +[ Explosion ] + +9 +00:02:21,120 --> 00:02:22,740 +[ Music ] + +10 +00:02:23,040 --> 00:02:25,540 +[ Glass Breaking ] + +11 +00:02:26,880 --> 00:02:29,380 +[ Explosion ] + +12 +00:02:39,840 --> 00:02:41,460 +[ Music ] + +13 +00:02:41,760 --> 00:02:44,260 +[ Explosion ] + +14 +00:02:44,640 --> 00:02:46,640 +[ Music ] + +15 +00:02:49,920 --> 00:02:50,120 +[ Crash ] + +16 +00:02:50,400 --> 00:02:52,900 +[ Explosion ] + +17 +00:02:53,760 --> 00:02:53,960 +[ Explosion ] + +18 +00:02:54,240 --> 00:02:56,740 +[ Crash ] + +19 +00:03:03,360 --> 00:03:05,860 +[ Explosion ] + +20 +00:03:07,200 --> 00:03:08,340 +[ Explosion ] + +21 +00:03:08,640 --> 00:03:11,140 +[ Glass Breaking ] + +22 +00:03:14,880 --> 00:03:16,500 +[ Door ] + +23 +00:03:16,800 --> 00:03:17,000 +[ Music ] + +24 +00:03:17,280 --> 00:03:19,380 +[ Glass Breaking ] + +25 +00:03:19,680 --> 00:03:19,880 +[ Alarm / Siren ] + +26 +00:03:20,160 --> 00:03:20,820 +[ Doorbell ] + +27 +00:03:21,120 --> 00:03:23,620 +[ Glass Breaking ] + +28 +00:03:24,960 --> 00:03:27,460 +[ Glass Breaking ] + +29 +00:03:30,240 --> 00:03:31,380 +[ Explosion ] + +30 +00:03:31,680 --> 00:03:33,680 +[ Music ] + +31 +00:03:35,520 --> 00:03:37,520 +[ Music ] + +32 +00:03:52,800 --> 00:03:53,940 +[ Glass Breaking ] + +33 +00:03:54,240 --> 00:03:56,240 +[ Music ] + diff --git a/demo_results/pipeline.log b/demo_results/pipeline.log new file mode 100644 index 0000000..7d9d0e0 --- /dev/null +++ b/demo_results/pipeline.log @@ -0,0 +1,45 @@ +2026-05-09 07:56:55,301 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. + +2026-05-09 07:57:03,078 INFO absl Using C:\Users\ADITIP~1\AppData\Local\Temp\tfhub_modules to cache modules. +2026-05-09 07:57:03,280 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\resolver.py:120: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead. + +2026-05-09 07:57:03,288 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\module_v2.py:126: The name tf.saved_model.load_v2 is deprecated. Please use tf.compat.v2.saved_model.load instead. + +2026-05-09 07:57:06,657 INFO absl Fingerprint not found. Saved model loading will continue. +2026-05-09 07:57:06,658 INFO absl path_and_singleprint metric could not be logged. Saved model loading will continue. +2026-05-09 07:57:49,650 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. + +2026-05-09 07:57:50,718 INFO absl Using C:\Users\ADITIP~1\AppData\Local\Temp\tfhub_modules to cache modules. +2026-05-09 07:57:50,912 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\resolver.py:120: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead. + +2026-05-09 07:57:50,917 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\module_v2.py:126: The name tf.saved_model.load_v2 is deprecated. Please use tf.compat.v2.saved_model.load instead. + +2026-05-09 07:57:54,838 INFO absl Fingerprint not found. Saved model loading will continue. +2026-05-09 07:57:54,838 INFO absl path_and_singleprint metric could not be logged. Saved model loading will continue. +2026-05-09 07:58:19,248 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. + +2026-05-09 07:58:20,192 INFO absl Using C:\Users\ADITIP~1\AppData\Local\Temp\tfhub_modules to cache modules. +2026-05-09 07:58:20,362 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\resolver.py:120: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead. + +2026-05-09 07:58:20,368 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\module_v2.py:126: The name tf.saved_model.load_v2 is deprecated. Please use tf.compat.v2.saved_model.load instead. + +2026-05-09 07:58:23,385 INFO absl Fingerprint not found. Saved model loading will continue. +2026-05-09 07:58:23,386 INFO absl path_and_singleprint metric could not be logged. Saved model loading will continue. +2026-05-09 08:02:00,787 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. + +2026-05-09 08:02:03,986 INFO absl Using C:\Users\ADITIP~1\AppData\Local\Temp\tfhub_modules to cache modules. +2026-05-09 08:02:04,559 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\resolver.py:120: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead. + +2026-05-09 08:02:04,574 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\module_v2.py:126: The name tf.saved_model.load_v2 is deprecated. Please use tf.compat.v2.saved_model.load instead. + +2026-05-09 08:02:18,619 INFO absl Fingerprint not found. Saved model loading will continue. +2026-05-09 08:02:18,620 INFO absl path_and_singleprint metric could not be logged. Saved model loading will continue. +2026-05-09 08:05:16,302 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. + +2026-05-09 08:05:23,135 INFO absl Using C:\Users\ADITIP~1\AppData\Local\Temp\tfhub_modules to cache modules. +2026-05-09 08:05:24,375 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\resolver.py:120: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead. + +2026-05-09 08:05:24,399 WARNING tensorflow From C:\Users\Aditi P\Desktop\Intelligent-cc-generation\.venv\Lib\site-packages\tensorflow_hub\module_v2.py:126: The name tf.saved_model.load_v2 is deprecated. Please use tf.compat.v2.saved_model.load instead. + +2026-05-09 08:05:49,941 INFO absl Fingerprint not found. Saved model loading will continue. +2026-05-09 08:05:49,942 INFO absl path_and_singleprint metric could not be logged. Saved model loading will continue. diff --git a/demo_results/report.json b/demo_results/report.json new file mode 100644 index 0000000..ff4f4b1 --- /dev/null +++ b/demo_results/report.json @@ -0,0 +1,4682 @@ +{ + "meta": { + "tool": "Intelligent CC Suggestion Tool", + "version": "2.0.0", + "created_at": "2026-05-09T02:52:49.453153Z", + "video_path": "C:\\Users\\Aditi P\\Desktop\\Intelligent-cc-generation\\fight.mp4" + }, + "summary": { + "total_audio_events": 43, + "total_visual_windows": 43, + "total_captions": 33 + }, + "audio_events": [ + { + "timestamp_sec": 0.0, + "end_sec": 22.56, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.9986, + "raw_class": "Music", + "raw_score": 0.9986 + }, + { + "timestamp_sec": 34.08, + "end_sec": 35.04, + "category": "RAT_SQUEAK", + "display_label": "[ Squeak / Rodent ]", + "priority": "MEDIUM", + "confidence": 0.6757, + "raw_class": "Rodents, rats, mice", + "raw_score": 0.4505 + }, + { + "timestamp_sec": 35.52, + "end_sec": 38.4, + "category": "DOOR", + "display_label": "[ Door ]", + "priority": "MEDIUM", + "confidence": 0.6978, + "raw_class": "Sliding door", + "raw_score": 0.5368 + }, + { + "timestamp_sec": 60.0, + "end_sec": 60.96, + "category": "ALARM", + "display_label": "[ Alarm / Siren ]", + "priority": "HIGH", + "confidence": 0.5991, + "raw_class": "Alarm", + "raw_score": 0.3524 + }, + { + "timestamp_sec": 60.96, + "end_sec": 61.92, + "category": "DOOR", + "display_label": "[ Door ]", + "priority": "MEDIUM", + "confidence": 0.8742, + "raw_class": "Slam", + "raw_score": 0.6725 + }, + { + "timestamp_sec": 69.6, + "end_sec": 70.56, + "category": "DOOR", + "display_label": "[ Door ]", + "priority": "MEDIUM", + "confidence": 0.6825, + "raw_class": "Sliding door", + "raw_score": 0.525 + }, + { + "timestamp_sec": 72.96, + "end_sec": 73.92, + "category": "DOG", + "display_label": "[ Dog Barking ]", + "priority": "MEDIUM", + "confidence": 0.2935, + "raw_class": "Dog", + "raw_score": 0.2257 + }, + { + "timestamp_sec": 96.0, + "end_sec": 96.96, + "category": "DOOR", + "display_label": "[ Door ]", + "priority": "MEDIUM", + "confidence": 0.8374, + "raw_class": "Sliding door", + "raw_score": 0.6442 + }, + { + "timestamp_sec": 128.16, + "end_sec": 129.6, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 0.7665, + "raw_class": "Boom", + "raw_score": 0.4034 + }, + { + "timestamp_sec": 141.12, + "end_sec": 153.12, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.94, + "raw_class": "Music", + "raw_score": 0.94 + }, + { + "timestamp_sec": 143.04, + "end_sec": 144.48, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Glass", + "raw_score": 0.7823 + }, + { + "timestamp_sec": 146.88, + "end_sec": 148.32, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 0.8671, + "raw_class": "Boom", + "raw_score": 0.4564 + }, + { + "timestamp_sec": 153.12, + "end_sec": 154.08, + "category": "CRASH", + "display_label": "[ Crash ]", + "priority": "HIGH", + "confidence": 0.3395, + "raw_class": "Smash, crash", + "raw_score": 0.2122 + }, + { + "timestamp_sec": 154.56, + "end_sec": 155.52, + "category": "RAT_SQUEAK", + "display_label": "[ Squeak / Rodent ]", + "priority": "MEDIUM", + "confidence": 0.4128, + "raw_class": "Ratchet, pawl", + "raw_score": 0.2752 + }, + { + "timestamp_sec": 157.92, + "end_sec": 158.88, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 0.424, + "raw_class": "Glass", + "raw_score": 0.2494 + }, + { + "timestamp_sec": 159.84, + "end_sec": 161.28, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.8905, + "raw_class": "Music", + "raw_score": 0.8905 + }, + { + "timestamp_sec": 161.28, + "end_sec": 162.24, + "category": "KNOCK", + "display_label": "[ Knocking ]", + "priority": "MEDIUM", + "confidence": 0.3843, + "raw_class": "Scrape", + "raw_score": 0.2956 + }, + { + "timestamp_sec": 161.76, + "end_sec": 168.0, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Boom", + "raw_score": 0.5893 + }, + { + "timestamp_sec": 164.64, + "end_sec": 174.24, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.7932, + "raw_class": "Music", + "raw_score": 0.7932 + }, + { + "timestamp_sec": 169.92, + "end_sec": 170.88, + "category": "CRASH", + "display_label": "[ Crash ]", + "priority": "HIGH", + "confidence": 0.8198, + "raw_class": "Smash, crash", + "raw_score": 0.5124 + }, + { + "timestamp_sec": 170.4, + "end_sec": 171.36, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Boom", + "raw_score": 0.8836 + }, + { + "timestamp_sec": 173.76, + "end_sec": 174.72, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Boom", + "raw_score": 0.923 + }, + { + "timestamp_sec": 174.24, + "end_sec": 175.2, + "category": "CRASH", + "display_label": "[ Crash ]", + "priority": "HIGH", + "confidence": 0.6233, + "raw_class": "Smash, crash", + "raw_score": 0.3896 + }, + { + "timestamp_sec": 183.36, + "end_sec": 184.8, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 0.8607, + "raw_class": "Boom", + "raw_score": 0.453 + }, + { + "timestamp_sec": 187.2, + "end_sec": 190.56, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Boom", + "raw_score": 0.8139 + }, + { + "timestamp_sec": 188.64, + "end_sec": 189.6, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 0.9322, + "raw_class": "Shatter", + "raw_score": 0.5484 + }, + { + "timestamp_sec": 194.88, + "end_sec": 196.8, + "category": "DOOR", + "display_label": "[ Door ]", + "priority": "MEDIUM", + "confidence": 0.6888, + "raw_class": "Sliding door", + "raw_score": 0.5298 + }, + { + "timestamp_sec": 196.8, + "end_sec": 197.76, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.6886, + "raw_class": "Music", + "raw_score": 0.6886 + }, + { + "timestamp_sec": 197.28, + "end_sec": 198.24, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 0.6661, + "raw_class": "Glass", + "raw_score": 0.3918 + }, + { + "timestamp_sec": 199.68, + "end_sec": 200.64, + "category": "ALARM", + "display_label": "[ Alarm / Siren ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Beep, bleep", + "raw_score": 0.6253 + }, + { + "timestamp_sec": 200.16, + "end_sec": 201.12, + "category": "DOORBELL", + "display_label": "[ Doorbell ]", + "priority": "MEDIUM", + "confidence": 0.6458, + "raw_class": "Doorbell", + "raw_score": 0.4305 + }, + { + "timestamp_sec": 201.12, + "end_sec": 202.08, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 0.577, + "raw_class": "Breaking", + "raw_score": 0.3394 + }, + { + "timestamp_sec": 204.96, + "end_sec": 205.92, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Shatter", + "raw_score": 0.8445 + }, + { + "timestamp_sec": 205.92, + "end_sec": 206.88, + "category": "RAT_SQUEAK", + "display_label": "[ Squeak / Rodent ]", + "priority": "MEDIUM", + "confidence": 0.2612, + "raw_class": "Accelerating, revving, vroom", + "raw_score": 0.1742 + }, + { + "timestamp_sec": 206.88, + "end_sec": 207.84, + "category": "DOOR", + "display_label": "[ Door ]", + "priority": "MEDIUM", + "confidence": 0.2949, + "raw_class": "Sliding door", + "raw_score": 0.2269 + }, + { + "timestamp_sec": 210.24, + "end_sec": 212.16, + "category": "EXPLOSION", + "display_label": "[ Explosion ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Boom", + "raw_score": 0.7201 + }, + { + "timestamp_sec": 211.68, + "end_sec": 213.6, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.9598, + "raw_class": "Music", + "raw_score": 0.9598 + }, + { + "timestamp_sec": 215.52, + "end_sec": 231.36, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.9978, + "raw_class": "Music", + "raw_score": 0.9978 + }, + { + "timestamp_sec": 217.44, + "end_sec": 218.4, + "category": "AMBIENT", + "display_label": "[ Background Noise ]", + "priority": "LOW", + "confidence": 0.3082, + "raw_class": "Thump, thud", + "raw_score": 0.6163 + }, + { + "timestamp_sec": 229.92, + "end_sec": 230.88, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 0.7312, + "raw_class": "Glass", + "raw_score": 0.4301 + }, + { + "timestamp_sec": 232.8, + "end_sec": 234.24, + "category": "GLASS_BREAK", + "display_label": "[ Glass Breaking ]", + "priority": "HIGH", + "confidence": 1.0, + "raw_class": "Glass", + "raw_score": 0.6721 + }, + { + "timestamp_sec": 234.24, + "end_sec": 240.96, + "category": "MUSIC", + "display_label": "[ Music ]", + "priority": "MEDIUM", + "confidence": 0.9996, + "raw_class": "Music", + "raw_score": 0.9996 + }, + { + "timestamp_sec": 240.48, + "end_sec": 241.30175, + "category": "AMBIENT", + "display_label": "[ Background Noise ]", + "priority": "LOW", + "confidence": 0.5, + "raw_class": "Silence", + "raw_score": 1.0 + } + ], + "visual_scores": { + "0.0": { + "query_time_sec": 0.0, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 0, + "time_sec": 0.0, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4, + "time_sec": 0.17142857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 8, + "time_sec": 0.34285714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 12, + "time_sec": 0.5142857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 16, + "time_sec": 0.6857142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 20, + "time_sec": 0.8571428571428572, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 24, + "time_sec": 1.0285714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 28, + "time_sec": 1.2, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "34.08": { + "query_time_sec": 34.08, + "reaction_score": 0.2321, + "num_valid_frames": 3, + "peak_frame_time": 34.308571428571426, + "frame_scores": [ + { + "frame_no": 805, + "time_sec": 33.58, + "ear_score": 0.1712, + "mar_score": 0.041, + "brow_score": 0.3649, + "composite": 0.1675, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 810, + "time_sec": 33.82285714285714, + "ear_score": 0.0288, + "mar_score": 0.0724, + "brow_score": 0.68, + "composite": 0.209, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 816, + "time_sec": 34.065714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 822, + "time_sec": 34.308571428571426, + "ear_score": 0.3446, + "mar_score": 0.1212, + "brow_score": 0.4896, + "composite": 0.2915, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 828, + "time_sec": 34.55142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 834, + "time_sec": 34.794285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 840, + "time_sec": 35.03714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 845, + "time_sec": 35.28, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "medium", + "note": "ok" + }, + "35.52": { + "query_time_sec": 35.52, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 839, + "time_sec": 35.02, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 845, + "time_sec": 35.26285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 851, + "time_sec": 35.50571428571429, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 857, + "time_sec": 35.74857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 862, + "time_sec": 35.99142857142858, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 868, + "time_sec": 36.23428571428572, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 874, + "time_sec": 36.477142857142866, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 880, + "time_sec": 36.720000000000006, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "60.0": { + "query_time_sec": 60.0, + "reaction_score": 0.1164, + "num_valid_frames": 6, + "peak_frame_time": 59.74285714285714, + "frame_scores": [ + { + "frame_no": 1426, + "time_sec": 59.5, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1432, + "time_sec": 59.74285714285714, + "ear_score": 0.1735, + "mar_score": 0.045, + "brow_score": 0.2148, + "composite": 0.1324, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1438, + "time_sec": 59.98571428571429, + "ear_score": 0.1677, + "mar_score": 0.0199, + "brow_score": 0.22, + "composite": 0.1216, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1444, + "time_sec": 60.22857142857143, + "ear_score": 0.1049, + "mar_score": 0.0191, + "brow_score": 0.2416, + "composite": 0.1048, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1449, + "time_sec": 60.471428571428575, + "ear_score": 0.1215, + "mar_score": 0.0207, + "brow_score": 0.2304, + "composite": 0.1084, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1455, + "time_sec": 60.714285714285715, + "ear_score": 0.1169, + "mar_score": 0.0163, + "brow_score": 0.2382, + "composite": 0.107, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1461, + "time_sec": 60.95714285714286, + "ear_score": 0.1026, + "mar_score": 0.0182, + "brow_score": 0.2395, + "composite": 0.103, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1467, + "time_sec": 61.2, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "high", + "note": "ok" + }, + "60.96": { + "query_time_sec": 60.96, + "reaction_score": 0.1082, + "num_valid_frames": 3, + "peak_frame_time": 60.46, + "frame_scores": [ + { + "frame_no": 1449, + "time_sec": 60.46, + "ear_score": 0.1367, + "mar_score": 0.0363, + "brow_score": 0.2107, + "composite": 0.115, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1455, + "time_sec": 60.70285714285714, + "ear_score": 0.1253, + "mar_score": 0.0237, + "brow_score": 0.23, + "composite": 0.1109, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1461, + "time_sec": 60.94571428571429, + "ear_score": 0.1032, + "mar_score": 0.019, + "brow_score": 0.2406, + "composite": 0.1039, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1467, + "time_sec": 61.18857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1472, + "time_sec": 61.431428571428576, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1478, + "time_sec": 61.674285714285716, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1484, + "time_sec": 61.91714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1490, + "time_sec": 62.160000000000004, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "medium", + "note": "ok" + }, + "69.6": { + "query_time_sec": 69.6, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 1656, + "time_sec": 69.1, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1662, + "time_sec": 69.34285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1668, + "time_sec": 69.58571428571427, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1674, + "time_sec": 69.82857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1680, + "time_sec": 70.07142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1685, + "time_sec": 70.31428571428572, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1691, + "time_sec": 70.55714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1697, + "time_sec": 70.8, + "ear_score": 0.0, + "mar_score": 0.1457, + "brow_score": 0.5558, + "composite": 0.1972, + "face_detected": true, + "num_faces": 1 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "72.96": { + "query_time_sec": 72.96, + "reaction_score": 0.2334, + "num_valid_frames": 7, + "peak_frame_time": 73.91714285714285, + "frame_scores": [ + { + "frame_no": 1737, + "time_sec": 72.46, + "ear_score": 0.2015, + "mar_score": 0.1597, + "brow_score": 0.4664, + "composite": 0.251, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1743, + "time_sec": 72.70285714285714, + "ear_score": 0.1761, + "mar_score": 0.1394, + "brow_score": 0.4792, + "composite": 0.2372, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1748, + "time_sec": 72.94571428571427, + "ear_score": 0.1834, + "mar_score": 0.1066, + "brow_score": 0.4952, + "composite": 0.2306, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 1754, + "time_sec": 73.18857142857142, + "ear_score": 0.2186, + "mar_score": 0.0888, + "brow_score": 0.399, + "composite": 0.2118, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 1760, + "time_sec": 73.43142857142857, + "ear_score": 0.1331, + "mar_score": 0.1057, + "brow_score": 0.4856, + "composite": 0.2103, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1766, + "time_sec": 73.67428571428572, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 1772, + "time_sec": 73.91714285714285, + "ear_score": 0.2483, + "mar_score": 0.2244, + "brow_score": 0.5864, + "composite": 0.3233, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 1778, + "time_sec": 74.16, + "ear_score": 0.2529, + "mar_score": 0.0842, + "brow_score": 0.5925, + "composite": 0.2703, + "face_detected": true, + "num_faces": 1 + } + ], + "confidence": "high", + "note": "ok" + }, + "96.0": { + "query_time_sec": 96.0, + "reaction_score": 0.1957, + "num_valid_frames": 5, + "peak_frame_time": 96.95714285714286, + "frame_scores": [ + { + "frame_no": 2289, + "time_sec": 95.5, + "ear_score": 0.0687, + "mar_score": 0.0194, + "brow_score": 0.4988, + "composite": 0.1565, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 2295, + "time_sec": 95.74285714285715, + "ear_score": 0.2328, + "mar_score": 0.0171, + "brow_score": 0.3664, + "composite": 0.18, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 2301, + "time_sec": 95.98571428571428, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 2307, + "time_sec": 96.22857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 2313, + "time_sec": 96.47142857142858, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 2318, + "time_sec": 96.71428571428572, + "ear_score": 0.2237, + "mar_score": 0.0664, + "brow_score": 0.2213, + "composite": 0.1602, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 2324, + "time_sec": 96.95714285714286, + "ear_score": 0.9336, + "mar_score": 0.0879, + "brow_score": 0.2292, + "composite": 0.4192, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 2330, + "time_sec": 97.2, + "ear_score": 0.1662, + "mar_score": 0.2008, + "brow_score": 0.2019, + "composite": 0.189, + "face_detected": true, + "num_faces": 2 + } + ], + "confidence": "high", + "note": "ok" + }, + "128.16": { + "query_time_sec": 128.16, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 3060, + "time_sec": 127.66, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3066, + "time_sec": 127.90285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3072, + "time_sec": 128.1457142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3078, + "time_sec": 128.38857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3084, + "time_sec": 128.63142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3089, + "time_sec": 128.8742857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3095, + "time_sec": 129.11714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3101, + "time_sec": 129.35999999999999, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "141.12": { + "query_time_sec": 141.12, + "reaction_score": 0.3057, + "num_valid_frames": 2, + "peak_frame_time": 142.07714285714286, + "frame_scores": [ + { + "frame_no": 3371, + "time_sec": 140.62, + "ear_score": 0.2287, + "mar_score": 0.2208, + "brow_score": 0.1308, + "composite": 0.2011, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 3377, + "time_sec": 140.86285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3383, + "time_sec": 141.1057142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3388, + "time_sec": 141.34857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3394, + "time_sec": 141.59142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3400, + "time_sec": 141.8342857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3406, + "time_sec": 142.07714285714286, + "ear_score": 0.2742, + "mar_score": 0.956, + "brow_score": 0.3527, + "composite": 0.5666, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3412, + "time_sec": 142.32, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "ok" + }, + "143.04": { + "query_time_sec": 143.04, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 3417, + "time_sec": 142.54, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3423, + "time_sec": 142.78285714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.4211, + "composite": 0.1053, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3429, + "time_sec": 143.0257142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3435, + "time_sec": 143.26857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3440, + "time_sec": 143.51142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3446, + "time_sec": 143.7542857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3452, + "time_sec": 143.99714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3458, + "time_sec": 144.23999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "146.88": { + "query_time_sec": 146.88, + "reaction_score": 0.1395, + "num_valid_frames": 3, + "peak_frame_time": 146.8657142857143, + "frame_scores": [ + { + "frame_no": 3509, + "time_sec": 146.38, + "ear_score": 0.0178, + "mar_score": 0.0, + "brow_score": 0.2992, + "composite": 0.081, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3515, + "time_sec": 146.62285714285713, + "ear_score": 0.089, + "mar_score": 0.0131, + "brow_score": 0.316, + "composite": 0.1154, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3521, + "time_sec": 146.8657142857143, + "ear_score": 0.0, + "mar_score": 0.2268, + "brow_score": 0.3428, + "composite": 0.1764, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3527, + "time_sec": 147.10857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3532, + "time_sec": 147.35142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3538, + "time_sec": 147.5942857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3544, + "time_sec": 147.83714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3550, + "time_sec": 148.07999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "medium", + "note": "ok" + }, + "153.12": { + "query_time_sec": 153.12, + "reaction_score": 0.1448, + "num_valid_frames": 6, + "peak_frame_time": 153.8342857142857, + "frame_scores": [ + { + "frame_no": 3659, + "time_sec": 152.62, + "ear_score": 0.121, + "mar_score": 0.0663, + "brow_score": 0.3267, + "composite": 0.1506, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3665, + "time_sec": 152.86285714285714, + "ear_score": 0.0345, + "mar_score": 0.0197, + "brow_score": 0.4639, + "composite": 0.1359, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3670, + "time_sec": 153.1057142857143, + "ear_score": 0.0, + "mar_score": 0.0356, + "brow_score": 0.3151, + "composite": 0.093, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3676, + "time_sec": 153.34857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3682, + "time_sec": 153.59142857142857, + "ear_score": 0.0829, + "mar_score": 0.0596, + "brow_score": 0.3437, + "composite": 0.1388, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3688, + "time_sec": 153.8342857142857, + "ear_score": 0.2527, + "mar_score": 0.639, + "brow_score": 0.211, + "composite": 0.3968, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3694, + "time_sec": 154.07714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3699, + "time_sec": 154.32, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.374, + "composite": 0.0935, + "face_detected": true, + "num_faces": 2 + } + ], + "confidence": "high", + "note": "ok" + }, + "154.56": { + "query_time_sec": 154.56, + "reaction_score": 0.1974, + "num_valid_frames": 6, + "peak_frame_time": 154.5457142857143, + "frame_scores": [ + { + "frame_no": 3693, + "time_sec": 154.06, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3699, + "time_sec": 154.30285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.374, + "composite": 0.0935, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 3705, + "time_sec": 154.5457142857143, + "ear_score": 0.343, + "mar_score": 0.259, + "brow_score": 0.274, + "composite": 0.2922, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3711, + "time_sec": 154.78857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3717, + "time_sec": 155.03142857142856, + "ear_score": 0.2547, + "mar_score": 0.0952, + "brow_score": 0.2822, + "composite": 0.1978, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3722, + "time_sec": 155.2742857142857, + "ear_score": 0.0984, + "mar_score": 0.0401, + "brow_score": 0.397, + "composite": 0.1498, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3728, + "time_sec": 155.51714285714286, + "ear_score": 0.0, + "mar_score": 0.0292, + "brow_score": 0.3642, + "composite": 0.1027, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3734, + "time_sec": 155.76, + "ear_score": 0.0, + "mar_score": 0.1204, + "brow_score": 0.3821, + "composite": 0.1437, + "face_detected": true, + "num_faces": 1 + } + ], + "confidence": "high", + "note": "ok" + }, + "157.92": { + "query_time_sec": 157.92, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 3774, + "time_sec": 157.42, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3780, + "time_sec": 157.66285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3785, + "time_sec": 157.90571428571428, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3791, + "time_sec": 158.14857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3797, + "time_sec": 158.39142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3803, + "time_sec": 158.63428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3809, + "time_sec": 158.87714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3815, + "time_sec": 159.11999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "159.84": { + "query_time_sec": 159.84, + "reaction_score": 0.1703, + "num_valid_frames": 7, + "peak_frame_time": 160.5542857142857, + "frame_scores": [ + { + "frame_no": 3820, + "time_sec": 159.34, + "ear_score": 0.0434, + "mar_score": 0.0203, + "brow_score": 0.4781, + "composite": 0.1428, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3826, + "time_sec": 159.58285714285714, + "ear_score": 0.0547, + "mar_score": 0.0123, + "brow_score": 0.478, + "composite": 0.1436, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3831, + "time_sec": 159.8257142857143, + "ear_score": 0.0203, + "mar_score": 0.0, + "brow_score": 0.4926, + "composite": 0.1302, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3837, + "time_sec": 160.06857142857143, + "ear_score": 0.0605, + "mar_score": 0.0, + "brow_score": 0.4523, + "composite": 0.1342, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3843, + "time_sec": 160.31142857142856, + "ear_score": 0.0821, + "mar_score": 0.0, + "brow_score": 0.4582, + "composite": 0.1433, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3849, + "time_sec": 160.5542857142857, + "ear_score": 0.5758, + "mar_score": 0.4383, + "brow_score": 0.4147, + "composite": 0.4805, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 3855, + "time_sec": 160.79714285714286, + "ear_score": 0.1453, + "mar_score": 0.39, + "brow_score": 0.5114, + "composite": 0.3347, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3861, + "time_sec": 161.04, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "high", + "note": "ok" + }, + "161.28": { + "query_time_sec": 161.28, + "reaction_score": 0.1947, + "num_valid_frames": 3, + "peak_frame_time": 160.78, + "frame_scores": [ + { + "frame_no": 3854, + "time_sec": 160.78, + "ear_score": 0.1571, + "mar_score": 0.3727, + "brow_score": 0.5103, + "composite": 0.3316, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3860, + "time_sec": 161.02285714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3866, + "time_sec": 161.2657142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3872, + "time_sec": 161.50857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3878, + "time_sec": 161.75142857142856, + "ear_score": 0.0, + "mar_score": 0.1089, + "brow_score": 0.3972, + "composite": 0.1428, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3883, + "time_sec": 161.9942857142857, + "ear_score": 0.0, + "mar_score": 0.0441, + "brow_score": 0.2058, + "composite": 0.0691, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3889, + "time_sec": 162.23714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3895, + "time_sec": 162.48, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "medium", + "note": "ok" + }, + "161.76": { + "query_time_sec": 161.76, + "reaction_score": 0.1137, + "num_valid_frames": 2, + "peak_frame_time": 161.74571428571429, + "frame_scores": [ + { + "frame_no": 3866, + "time_sec": 161.26, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3872, + "time_sec": 161.50285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3878, + "time_sec": 161.74571428571429, + "ear_score": 0.0, + "mar_score": 0.1089, + "brow_score": 0.3972, + "composite": 0.1428, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3883, + "time_sec": 161.98857142857142, + "ear_score": 0.0, + "mar_score": 0.0441, + "brow_score": 0.2058, + "composite": 0.0691, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 3889, + "time_sec": 162.23142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3895, + "time_sec": 162.47428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3901, + "time_sec": 162.71714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3907, + "time_sec": 162.95999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "ok" + }, + "164.64": { + "query_time_sec": 164.64, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 3935, + "time_sec": 164.14, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3941, + "time_sec": 164.38285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3947, + "time_sec": 164.62571428571428, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3952, + "time_sec": 164.8685714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3958, + "time_sec": 165.11142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3964, + "time_sec": 165.35428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3970, + "time_sec": 165.59714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 3976, + "time_sec": 165.83999999999997, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "169.92": { + "query_time_sec": 169.92, + "reaction_score": 0.1633, + "num_valid_frames": 2, + "peak_frame_time": 169.90571428571428, + "frame_scores": [ + { + "frame_no": 4062, + "time_sec": 169.42, + "ear_score": 0.0, + "mar_score": 0.1067, + "brow_score": 0.3965, + "composite": 0.1418, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4067, + "time_sec": 169.66285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4073, + "time_sec": 169.90571428571428, + "ear_score": 0.2091, + "mar_score": 0.0561, + "brow_score": 0.3038, + "composite": 0.1715, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4079, + "time_sec": 170.14857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4085, + "time_sec": 170.39142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4091, + "time_sec": 170.63428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4096, + "time_sec": 170.87714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4102, + "time_sec": 171.11999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "ok" + }, + "170.4": { + "query_time_sec": 170.4, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4073, + "time_sec": 169.9, + "ear_score": 0.2091, + "mar_score": 0.0561, + "brow_score": 0.3038, + "composite": 0.1715, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4079, + "time_sec": 170.14285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4085, + "time_sec": 170.3857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4090, + "time_sec": 170.62857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4096, + "time_sec": 170.87142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4102, + "time_sec": 171.1142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4108, + "time_sec": 171.35714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4114, + "time_sec": 171.6, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "173.76": { + "query_time_sec": 173.76, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4154, + "time_sec": 173.26, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4159, + "time_sec": 173.50285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4165, + "time_sec": 173.74571428571429, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4171, + "time_sec": 173.98857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4177, + "time_sec": 174.23142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4183, + "time_sec": 174.47428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4189, + "time_sec": 174.71714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4194, + "time_sec": 174.95999999999998, + "ear_score": 0.4157, + "mar_score": 0.6284, + "brow_score": 0.2143, + "composite": 0.4504, + "face_detected": true, + "num_faces": 1 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "174.24": { + "query_time_sec": 174.24, + "reaction_score": 0.3627, + "num_valid_frames": 2, + "peak_frame_time": 174.9542857142857, + "frame_scores": [ + { + "frame_no": 4165, + "time_sec": 173.74, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4171, + "time_sec": 173.98285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4177, + "time_sec": 174.2257142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4183, + "time_sec": 174.46857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4188, + "time_sec": 174.71142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4194, + "time_sec": 174.9542857142857, + "ear_score": 0.4157, + "mar_score": 0.6284, + "brow_score": 0.2143, + "composite": 0.4504, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4200, + "time_sec": 175.19714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4206, + "time_sec": 175.44, + "ear_score": 0.197, + "mar_score": 0.0121, + "brow_score": 0.2284, + "composite": 0.1309, + "face_detected": true, + "num_faces": 1 + } + ], + "confidence": "low", + "note": "ok" + }, + "183.36": { + "query_time_sec": 183.36, + "reaction_score": 0.1072, + "num_valid_frames": 3, + "peak_frame_time": 184.0742857142857, + "frame_scores": [ + { + "frame_no": 4384, + "time_sec": 182.86, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4390, + "time_sec": 183.10285714285715, + "ear_score": 0.017, + "mar_score": 0.0496, + "brow_score": 0.262, + "composite": 0.0913, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4395, + "time_sec": 183.3457142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4401, + "time_sec": 183.58857142857144, + "ear_score": 0.0957, + "mar_score": 0.0241, + "brow_score": 0.2967, + "composite": 0.1173, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 4407, + "time_sec": 183.83142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4413, + "time_sec": 184.0742857142857, + "ear_score": 0.0211, + "mar_score": 0.0803, + "brow_score": 0.3224, + "composite": 0.1201, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4419, + "time_sec": 184.31714285714287, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4425, + "time_sec": 184.56, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "medium", + "note": "ok" + }, + "187.2": { + "query_time_sec": 187.2, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4476, + "time_sec": 186.7, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4482, + "time_sec": 186.94285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4487, + "time_sec": 187.18571428571428, + "ear_score": 0.3856, + "mar_score": 0.4446, + "brow_score": 0.3134, + "composite": 0.3911, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4493, + "time_sec": 187.42857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4499, + "time_sec": 187.67142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4505, + "time_sec": 187.91428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4511, + "time_sec": 188.15714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4517, + "time_sec": 188.39999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "188.64": { + "query_time_sec": 188.64, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4510, + "time_sec": 188.14, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4516, + "time_sec": 188.38285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4522, + "time_sec": 188.62571428571428, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4528, + "time_sec": 188.8685714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4534, + "time_sec": 189.11142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4539, + "time_sec": 189.35428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4545, + "time_sec": 189.59714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4551, + "time_sec": 189.83999999999997, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "194.88": { + "query_time_sec": 194.88, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4660, + "time_sec": 194.38, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4666, + "time_sec": 194.62285714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4672, + "time_sec": 194.8657142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4677, + "time_sec": 195.10857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4683, + "time_sec": 195.35142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4689, + "time_sec": 195.5942857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4695, + "time_sec": 195.83714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4701, + "time_sec": 196.07999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "196.8": { + "query_time_sec": 196.8, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4706, + "time_sec": 196.3, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4712, + "time_sec": 196.54285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4718, + "time_sec": 196.7857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4723, + "time_sec": 197.02857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4729, + "time_sec": 197.27142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4735, + "time_sec": 197.5142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4741, + "time_sec": 197.75714285714287, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4747, + "time_sec": 198.0, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "197.28": { + "query_time_sec": 197.28, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4718, + "time_sec": 196.78, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4723, + "time_sec": 197.02285714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4729, + "time_sec": 197.2657142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4735, + "time_sec": 197.50857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4741, + "time_sec": 197.75142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4747, + "time_sec": 197.9942857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4752, + "time_sec": 198.23714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4758, + "time_sec": 198.48, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "199.68": { + "query_time_sec": 199.68, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4775, + "time_sec": 199.18, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4781, + "time_sec": 199.42285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4787, + "time_sec": 199.6657142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4793, + "time_sec": 199.90857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4798, + "time_sec": 200.15142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4804, + "time_sec": 200.3942857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4810, + "time_sec": 200.63714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4816, + "time_sec": 200.88, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "200.16": { + "query_time_sec": 200.16, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4787, + "time_sec": 199.66, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4792, + "time_sec": 199.90285714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4798, + "time_sec": 200.1457142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4804, + "time_sec": 200.38857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4810, + "time_sec": 200.63142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4816, + "time_sec": 200.8742857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4821, + "time_sec": 201.11714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4827, + "time_sec": 201.35999999999999, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "201.12": { + "query_time_sec": 201.12, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4810, + "time_sec": 200.62, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4815, + "time_sec": 200.86285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4821, + "time_sec": 201.1057142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4827, + "time_sec": 201.34857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4833, + "time_sec": 201.59142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4839, + "time_sec": 201.8342857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4845, + "time_sec": 202.07714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4850, + "time_sec": 202.32, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "204.96": { + "query_time_sec": 204.96, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4902, + "time_sec": 204.46, + "ear_score": 0.1854, + "mar_score": 0.0314, + "brow_score": 0.2063, + "composite": 0.129, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4907, + "time_sec": 204.70285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4913, + "time_sec": 204.9457142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4919, + "time_sec": 205.18857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4925, + "time_sec": 205.43142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4931, + "time_sec": 205.6742857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4937, + "time_sec": 205.91714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4942, + "time_sec": 206.16, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "205.92": { + "query_time_sec": 205.92, + "reaction_score": 0.0, + "num_valid_frames": 1, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4925, + "time_sec": 205.42, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4930, + "time_sec": 205.66285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4936, + "time_sec": 205.90571428571428, + "ear_score": 0.0, + "mar_score": 0.0087, + "brow_score": 0.2634, + "composite": 0.0693, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 4942, + "time_sec": 206.14857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4948, + "time_sec": 206.39142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4954, + "time_sec": 206.63428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4960, + "time_sec": 206.87714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4965, + "time_sec": 207.11999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "too_few_valid_frames" + }, + "206.88": { + "query_time_sec": 206.88, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 4948, + "time_sec": 206.38, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4953, + "time_sec": 206.62285714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4959, + "time_sec": 206.8657142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4965, + "time_sec": 207.10857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4971, + "time_sec": 207.35142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4977, + "time_sec": 207.5942857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4983, + "time_sec": 207.83714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 4988, + "time_sec": 208.07999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "210.24": { + "query_time_sec": 210.24, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 5028, + "time_sec": 209.74, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5034, + "time_sec": 209.98285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5040, + "time_sec": 210.2257142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5046, + "time_sec": 210.46857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5052, + "time_sec": 210.71142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5057, + "time_sec": 210.9542857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5063, + "time_sec": 211.19714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5069, + "time_sec": 211.44, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "211.68": { + "query_time_sec": 211.68, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 5063, + "time_sec": 211.18, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5069, + "time_sec": 211.42285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5074, + "time_sec": 211.6657142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5080, + "time_sec": 211.90857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5086, + "time_sec": 212.15142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5092, + "time_sec": 212.3942857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5098, + "time_sec": 212.63714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5104, + "time_sec": 212.88, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "215.52": { + "query_time_sec": 215.52, + "reaction_score": 0.1854, + "num_valid_frames": 2, + "peak_frame_time": 216.47714285714287, + "frame_scores": [ + { + "frame_no": 5155, + "time_sec": 215.02, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5161, + "time_sec": 215.26285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5166, + "time_sec": 215.5057142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5172, + "time_sec": 215.74857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5178, + "time_sec": 215.99142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5184, + "time_sec": 216.2342857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5190, + "time_sec": 216.47714285714287, + "ear_score": 0.0677, + "mar_score": 0.2093, + "brow_score": 0.3438, + "composite": 0.1934, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 5196, + "time_sec": 216.72, + "ear_score": 0.1314, + "mar_score": 0.1309, + "brow_score": 0.2963, + "composite": 0.1724, + "face_detected": true, + "num_faces": 1 + } + ], + "confidence": "low", + "note": "ok" + }, + "217.44": { + "query_time_sec": 217.44, + "reaction_score": 0.1627, + "num_valid_frames": 4, + "peak_frame_time": 216.94, + "frame_scores": [ + { + "frame_no": 5201, + "time_sec": 216.94, + "ear_score": 0.0606, + "mar_score": 0.2349, + "brow_score": 0.2696, + "composite": 0.1826, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 5207, + "time_sec": 217.18285714285713, + "ear_score": 0.0, + "mar_score": 0.2331, + "brow_score": 0.2658, + "composite": 0.1597, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 5213, + "time_sec": 217.4257142857143, + "ear_score": 0.112, + "mar_score": 0.1435, + "brow_score": 0.2436, + "composite": 0.1575, + "face_detected": true, + "num_faces": 1 + }, + { + "frame_no": 5218, + "time_sec": 217.66857142857143, + "ear_score": 0.1336, + "mar_score": 0.0897, + "brow_score": 0.3167, + "composite": 0.1618, + "face_detected": true, + "num_faces": 2 + }, + { + "frame_no": 5224, + "time_sec": 217.91142857142856, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5230, + "time_sec": 218.1542857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5236, + "time_sec": 218.39714285714285, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5242, + "time_sec": 218.64, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "medium", + "note": "ok" + }, + "229.92": { + "query_time_sec": 229.92, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 5500, + "time_sec": 229.42, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5506, + "time_sec": 229.66285714285712, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5512, + "time_sec": 229.90571428571428, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5518, + "time_sec": 230.14857142857142, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5523, + "time_sec": 230.39142857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5529, + "time_sec": 230.63428571428568, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5535, + "time_sec": 230.87714285714284, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5541, + "time_sec": 231.11999999999998, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "232.8": { + "query_time_sec": 232.8, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 5569, + "time_sec": 232.3, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5575, + "time_sec": 232.54285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5581, + "time_sec": 232.7857142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5587, + "time_sec": 233.02857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5592, + "time_sec": 233.27142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5598, + "time_sec": 233.5142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5604, + "time_sec": 233.75714285714287, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5610, + "time_sec": 234.0, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "234.24": { + "query_time_sec": 234.24, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 5604, + "time_sec": 233.74, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5609, + "time_sec": 233.98285714285714, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5615, + "time_sec": 234.2257142857143, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5621, + "time_sec": 234.46857142857144, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5627, + "time_sec": 234.71142857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5633, + "time_sec": 234.9542857142857, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5639, + "time_sec": 235.19714285714286, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5644, + "time_sec": 235.44, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + }, + "240.48": { + "query_time_sec": 240.48, + "reaction_score": 0.0, + "num_valid_frames": 0, + "peak_frame_time": null, + "frame_scores": [ + { + "frame_no": 5753, + "time_sec": 239.98, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5758, + "time_sec": 240.1720595238095, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5762, + "time_sec": 240.36411904761903, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5767, + "time_sec": 240.55617857142855, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5772, + "time_sec": 240.7482380952381, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5776, + "time_sec": 240.9402976190476, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5781, + "time_sec": 241.13235714285713, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + }, + { + "frame_no": 5785, + "time_sec": 241.32441666666665, + "ear_score": 0.0, + "mar_score": 0.0, + "brow_score": 0.0, + "composite": 0.0, + "face_detected": false, + "num_faces": 0 + } + ], + "confidence": "low", + "note": "no_face_detected" + } + }, + "captions": [ + { + "index": 1, + "start_sec": 0.0, + "end_sec": 2.0, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.9986, + "visual_score": 0.0, + "fusion_score": 0.6491 + }, + { + "index": 2, + "start_sec": 34.08, + "end_sec": 35.22, + "caption_text": "[ Squeak / Rodent ]", + "category": "RAT_SQUEAK", + "priority": "MEDIUM", + "audio_score": 0.6757, + "visual_score": 0.2321, + "fusion_score": 0.5204 + }, + { + "index": 3, + "start_sec": 35.52, + "end_sec": 38.4, + "caption_text": "[ Door ]", + "category": "DOOR", + "priority": "MEDIUM", + "audio_score": 0.6978, + "visual_score": 0.0, + "fusion_score": 0.4536 + }, + { + "index": 4, + "start_sec": 60.0, + "end_sec": 60.66, + "caption_text": "[ Alarm / Siren ]", + "category": "ALARM", + "priority": "HIGH", + "audio_score": 0.5991, + "visual_score": 0.1164, + "fusion_score": 0.4302 + }, + { + "index": 5, + "start_sec": 60.96, + "end_sec": 62.96, + "caption_text": "[ Door ]", + "category": "DOOR", + "priority": "MEDIUM", + "audio_score": 0.8742, + "visual_score": 0.1082, + "fusion_score": 0.6061 + }, + { + "index": 6, + "start_sec": 69.6, + "end_sec": 71.6, + "caption_text": "[ Door ]", + "category": "DOOR", + "priority": "MEDIUM", + "audio_score": 0.6825, + "visual_score": 0.0, + "fusion_score": 0.4436 + }, + { + "index": 7, + "start_sec": 96.0, + "end_sec": 98.0, + "caption_text": "[ Door ]", + "category": "DOOR", + "priority": "MEDIUM", + "audio_score": 0.8374, + "visual_score": 0.1957, + "fusion_score": 0.6128 + }, + { + "index": 8, + "start_sec": 128.16, + "end_sec": 130.66, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 0.7665, + "visual_score": 0.0, + "fusion_score": 0.4982 + }, + { + "index": 9, + "start_sec": 141.12, + "end_sec": 142.74, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.94, + "visual_score": 0.3057, + "fusion_score": 0.718 + }, + { + "index": 10, + "start_sec": 143.04, + "end_sec": 145.54, + "caption_text": "[ Glass Breaking ]", + "category": "GLASS_BREAK", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 11, + "start_sec": 146.88, + "end_sec": 149.38, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 0.8671, + "visual_score": 0.1395, + "fusion_score": 0.6124 + }, + { + "index": 12, + "start_sec": 159.84, + "end_sec": 161.46, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.8905, + "visual_score": 0.1703, + "fusion_score": 0.6384 + }, + { + "index": 13, + "start_sec": 161.76, + "end_sec": 164.26, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.1137, + "fusion_score": 0.6898 + }, + { + "index": 14, + "start_sec": 164.64, + "end_sec": 166.64, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.7932, + "visual_score": 0.0, + "fusion_score": 0.5156 + }, + { + "index": 15, + "start_sec": 169.92, + "end_sec": 170.12, + "caption_text": "[ Crash ]", + "category": "CRASH", + "priority": "HIGH", + "audio_score": 0.8198, + "visual_score": 0.1633, + "fusion_score": 0.59 + }, + { + "index": 16, + "start_sec": 170.4, + "end_sec": 172.9, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 17, + "start_sec": 173.76, + "end_sec": 173.96, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 18, + "start_sec": 174.24, + "end_sec": 176.74, + "caption_text": "[ Crash ]", + "category": "CRASH", + "priority": "HIGH", + "audio_score": 0.6233, + "visual_score": 0.3627, + "fusion_score": 0.5321 + }, + { + "index": 19, + "start_sec": 183.36, + "end_sec": 185.86, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 0.8607, + "visual_score": 0.1072, + "fusion_score": 0.597 + }, + { + "index": 20, + "start_sec": 187.2, + "end_sec": 188.34, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 21, + "start_sec": 188.64, + "end_sec": 191.14, + "caption_text": "[ Glass Breaking ]", + "category": "GLASS_BREAK", + "priority": "HIGH", + "audio_score": 0.9322, + "visual_score": 0.0, + "fusion_score": 0.6059 + }, + { + "index": 22, + "start_sec": 194.88, + "end_sec": 196.5, + "caption_text": "[ Door ]", + "category": "DOOR", + "priority": "MEDIUM", + "audio_score": 0.6888, + "visual_score": 0.0, + "fusion_score": 0.4477 + }, + { + "index": 23, + "start_sec": 196.8, + "end_sec": 197.0, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.6886, + "visual_score": 0.0, + "fusion_score": 0.4476 + }, + { + "index": 24, + "start_sec": 197.28, + "end_sec": 199.38, + "caption_text": "[ Glass Breaking ]", + "category": "GLASS_BREAK", + "priority": "HIGH", + "audio_score": 0.6661, + "visual_score": 0.0, + "fusion_score": 0.433 + }, + { + "index": 25, + "start_sec": 199.68, + "end_sec": 199.88, + "caption_text": "[ Alarm / Siren ]", + "category": "ALARM", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 26, + "start_sec": 200.16, + "end_sec": 200.82, + "caption_text": "[ Doorbell ]", + "category": "DOORBELL", + "priority": "MEDIUM", + "audio_score": 0.6458, + "visual_score": 0.0, + "fusion_score": 0.4198 + }, + { + "index": 27, + "start_sec": 201.12, + "end_sec": 203.62, + "caption_text": "[ Glass Breaking ]", + "category": "GLASS_BREAK", + "priority": "HIGH", + "audio_score": 0.577, + "visual_score": 0.0, + "fusion_score": 0.375 + }, + { + "index": 28, + "start_sec": 204.96, + "end_sec": 207.46, + "caption_text": "[ Glass Breaking ]", + "category": "GLASS_BREAK", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 29, + "start_sec": 210.24, + "end_sec": 211.38, + "caption_text": "[ Explosion ]", + "category": "EXPLOSION", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 30, + "start_sec": 211.68, + "end_sec": 213.68, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.9598, + "visual_score": 0.0, + "fusion_score": 0.6239 + }, + { + "index": 31, + "start_sec": 215.52, + "end_sec": 217.52, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.9978, + "visual_score": 0.1854, + "fusion_score": 0.7135 + }, + { + "index": 32, + "start_sec": 232.8, + "end_sec": 233.94, + "caption_text": "[ Glass Breaking ]", + "category": "GLASS_BREAK", + "priority": "HIGH", + "audio_score": 1.0, + "visual_score": 0.0, + "fusion_score": 0.65 + }, + { + "index": 33, + "start_sec": 234.24, + "end_sec": 236.24, + "caption_text": "[ Music ]", + "category": "MUSIC", + "priority": "MEDIUM", + "audio_score": 0.9996, + "visual_score": 0.0, + "fusion_score": 0.6497 + } + ] +} \ No newline at end of file diff --git a/main.py b/main.py new file mode 100644 index 0000000..2b62519 --- /dev/null +++ b/main.py @@ -0,0 +1,301 @@ +from __future__ import annotations +import argparse +import sys +import time +from pathlib import Path + +DEFAULT_VIDEO_PATH = "fight.mp4" + + + +def _parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + prog="cc_tool", + description=( + "Intelligent CC Suggestion Tool — " + "generates non-speech closed captions for videos." + ), + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" + """, + ) + + parser.add_argument( + "--video", "-v", + required=False, + default=DEFAULT_VIDEO_PATH, + metavar="PATH", + help=( + "Path to the input video file (MP4, AVI, MKV, MOV, …). " + f"Defaults to {DEFAULT_VIDEO_PATH}." + ), + ) + parser.add_argument( + "--output", "-o", + default=None, + metavar="DIR", + help=( + "Output directory for SRT, JSON and annotated frames. " + "Defaults to config.OUTPUT_DIR ('demo_results/')." + ), + ) + parser.add_argument( + "--no-visual", + action="store_true", + help=( + "Skip Module 2 (visual reaction detection). " + "Faster, but captions are based on audio confidence alone." + ), + ) + parser.add_argument( + "--debug", + action="store_true", + help="Enable DEBUG-level log messages (very verbose).", + ) + parser.add_argument( + "--no-frames", + action="store_true", + help="Skip saving annotated JPEG frames to the output folder.", + ) + + return parser.parse_args() + + +def _print_banner() -> None: + print( + "\n" + "╔══════════════════════════════════════════════════════════╗\n" + "║ Intelligent CC Suggestion Tool • PlanetRead / C4GT ║\n" + "║ Module 1: Sound Detection (YAMNet) ║\n" + "║ Module 2: Visual Reaction (MediaPipe Face Mesh) ║\n" + "║ Module 3: Fusion Engine + SRT Output ║\n" + "╚══════════════════════════════════════════════════════════╝\n" + ) + + +def _print_final_summary( + entries, + output_dir: Path, + elapsed: float, + audio_count: int, +) -> None: + srt_path = output_dir / "output.srt" + json_path = output_dir / "report.json" + frames_dir = output_dir / "frames" + frame_count = len(list(frames_dir.glob("*.jpg"))) if frames_dir.exists() else 0 + + print( + "\n" + "════════════════════════════════════════════════════════════\n" + f" Pipeline complete in {elapsed:.1f} s\n" + f" Audio events detected : {audio_count}\n" + f" Captions emitted : {len(entries)}\n" + f" Annotated frames : {frame_count}\n" + f" SRT → {srt_path}\n" + f" JSON → {json_path}\n" + "════════════════════════════════════════════════════════════\n" + ) + + if entries: + print(" Caption preview:") + print(" " + "─" * 54) + for e in entries: + print( + f" {e.start_sec:6.2f}s → {e.end_sec:6.2f}s " + f"{e.caption_text:<28} " + f"fusion={e.fusion_score:.3f}" + ) + print(" " + "─" * 54) + else: + print( + " ⚠ No captions were generated.\n" + " Try lowering FUSION_THRESHOLD or AUDIO_EMIT_THRESHOLD\n" + " in config.py, or check the pipeline.log for details.\n" + ) + +def run_pipeline( + video_path: str, + output_dir: str, + skip_visual: bool = False, + save_frames: bool = True, +) -> int: + + from modules.sound_detector import SoundEventDetector + from modules.visual_detector import VisualReactionDetector, VisualScore + from modules.fusion_engine import FusionEngine + from utils.srt_writer import write_srt, write_json_report + from utils.logger import get_logger, setup_file_logger + import config as cfg + + out_dir = Path(output_dir) + out_dir.mkdir(parents=True, exist_ok=True) + setup_file_logger(str(out_dir)) + + log = get_logger("main") + log.info("Video : %s", video_path) + log.info("Output : %s", out_dir) + log.info("Mode : %s", "audio-only" if skip_visual else "audio+visual") + + t_start = time.perf_counter() + + # ════════════════════════════════════════════════════════════════════════ + # MODULE 1 — Sound Event Detection + # ════════════════════════════════════════════════════════════════════════ + print("\n[1/3] Running sound event detection …") + try: + detector = SoundEventDetector() + audio_events = detector.detect(video_path) + except Exception as exc: + log.error("[M1] Fatal error during sound detection: %s", exc, + exc_info=True) + return 1 + + if not audio_events: + log.warning( + "No audio events passed the filter. " + "Check AUDIO_EMIT_THRESHOLD in config.py or verify the video " + "has a usable audio track." + ) + # Write empty outputs so downstream tools don't crash + _write_empty_outputs(out_dir) + return 0 + + timestamps = [ev.timestamp_sec for ev in audio_events] + log.info("[M1] %d audio event(s) found at timestamps: %s", + len(audio_events), + [f"{t:.2f}s" for t in timestamps]) + + # ════════════════════════════════════════════════════════════════════════ + # MODULE 2 — Visual Reaction Detection + # ════════════════════════════════════════════════════════════════════════ + visual_scores: dict = {} + + if skip_visual: + log.info("[M2] Skipped (--no-visual flag set).") + print("[2/3] Visual reaction detection … SKIPPED (--no-visual)") + # Provide zero-score placeholders so Module 3 runs in audio-only mode + from modules.visual_detector import VisualScore + visual_scores = { + ts: VisualScore( + query_time_sec = ts, + reaction_score = 0.0, + num_valid_frames = 0, + peak_frame_time = None, + confidence = "low", + note = "skipped_by_user", + ) + for ts in timestamps + } + else: + print("[2/3] Running visual reaction detection …") + try: + analyser = VisualReactionDetector() + visual_scores = analyser.analyse(video_path, timestamps) + except Exception as exc: + log.error("[M2] Fatal error during visual analysis: %s", exc, + exc_info=True) + log.warning("[M2] Falling back to audio-only mode.") + from modules.visual_detector import VisualScore + visual_scores = { + ts: VisualScore( + query_time_sec = ts, + reaction_score = 0.0, + num_valid_frames = 0, + peak_frame_time = None, + confidence = "low", + note = "module2_error", + ) + for ts in timestamps + } + + # ════════════════════════════════════════════════════════════════════════ + # MODULE 3 — Fusion Decision Engine + # ════════════════════════════════════════════════════════════════════════ + print("[3/3] Running fusion engine …") + try: + engine = FusionEngine(output_dir=str(out_dir)) + entries = engine.decide( + audio_events = audio_events, + visual_scores = visual_scores, + video_path = video_path if save_frames else None, + ) + except Exception as exc: + log.error("[M3] Fatal error in fusion engine: %s", exc, exc_info=True) + return 1 + + srt_path = str(out_dir / "output.srt") + json_path = str(out_dir / "report.json") + + write_srt(entries, srt_path) + write_json_report( + entries = entries, + audio_events = audio_events, + visual_scores = visual_scores, + output_path = json_path, + video_path = video_path, + ) + + elapsed = time.perf_counter() - t_start + _print_final_summary(entries, out_dir, elapsed, len(audio_events)) + + return 0 + +def _write_empty_outputs(out_dir: Path) -> None: + """Write empty SRT and minimal JSON so callers don't get FileNotFoundError.""" + from utils.srt_writer import write_srt, write_json_report + write_srt([], str(out_dir / "output.srt")) + write_json_report( + entries = [], + audio_events = [], + visual_scores = {}, + output_path = str(out_dir / "report.json"), + ) + + +def _validate_video_path(path: str) -> str: + """Resolve and verify the video file exists; exit with a message if not.""" + p = Path(path).resolve() + if not p.exists(): + print(f"\n ✗ Video file not found: {p}", file=sys.stderr) + sys.exit(1) + if p.suffix.lower() not in { + ".mp4", ".avi", ".mkv", ".mov", ".webm", + ".flv", ".wmv", ".m4v", ".3gp", + }: + print( + f"\n ⚠ Warning: '{p.suffix}' may not be a supported video format.\n" + f" Supported: .mp4 .avi .mkv .mov .webm .flv .wmv .m4v .3gp\n" + f" Proceeding anyway — ffmpeg may still handle it.\n", + file=sys.stderr, + ) + return str(p) + +def main() -> None: + args = _parse_args() + + # ── Set log level before any module imports ────────────────────────────── + import logging + if args.debug: + logging.getLogger().setLevel(logging.DEBUG) + else: + logging.getLogger().setLevel(logging.INFO) + + _print_banner() + + video_path = _validate_video_path(args.video) + + import config as cfg + output_dir = args.output if args.output else cfg.OUTPUT_DIR + + exit_code = run_pipeline( + video_path = video_path, + output_dir = output_dir, + skip_visual = args.no_visual, + save_frames = not args.no_frames, + ) + + sys.exit(exit_code) + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/modules/fusion_engine.py b/modules/fusion_engine.py new file mode 100644 index 0000000..d081b99 --- /dev/null +++ b/modules/fusion_engine.py @@ -0,0 +1,404 @@ +from __future__ import annotations +import os +from collections import defaultdict +from dataclasses import dataclass +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +import cv2 +import numpy as np + +from modules.sound_detector import AudioEvent +from modules.visual_detector import VisualScore +from utils.logger import get_logger +from utils.srt_writer import CaptionEntry + +import config as cfg + +log = get_logger(__name__) + +@dataclass +class _Decision: + audio_event: AudioEvent + visual_score: VisualScore + fusion_score: float + threshold: float + accepted: bool + reject_reason: str + +class FusionEngine: + + def __init__( + self, + audio_weight: float = cfg.FUSION_AUDIO_WEIGHT, + visual_weight: float = cfg.FUSION_VISUAL_WEIGHT, + output_dir: str = cfg.OUTPUT_DIR, + ): + if abs(audio_weight + visual_weight - 1.0) > 1e-6: + raise ValueError( + f"audio_weight ({audio_weight}) + visual_weight " + f"({visual_weight}) must sum to 1.0" + ) + self.audio_weight = audio_weight + self.visual_weight = visual_weight + self.output_dir = Path(output_dir) + self.frames_dir = self.output_dir / "frames" + self.frames_dir.mkdir(parents=True, exist_ok=True) + + + def decide( + self, + audio_events: List[AudioEvent], + visual_scores: Dict[float, VisualScore], + video_path: Optional[str] = None, + ) -> List[CaptionEntry]: + if not audio_events: + log.warning("[M3] No audio events received — nothing to decide.") + return [] + + log.info("[M3] Evaluating %d candidate audio events …", + len(audio_events)) + + # Detect whether any window actually found a face → drives audio-only mode + has_visual = self._any_face_detected(visual_scores) + if not has_visual: + log.warning( + "[M3] No faces detected in any visual window. " + "Switching to audio-only fusion mode (thresholds lowered 20 %%)." + ) + + decisions = self._score_all_events( + audio_events, visual_scores, has_visual + ) + + self._log_decisions(decisions) + + accepted = [d for d in decisions if d.accepted] + log.info( + "[M3] %d / %d events accepted for captioning.", + len(accepted), len(decisions), + ) + + deduped = self._deduplicate(accepted) + if len(deduped) < len(accepted): + log.info("[M3] Deduplication removed %d duplicate(s).", + len(accepted) - len(deduped)) + + entries = self._build_caption_entries(deduped) + + entries = self._enforce_srt_gaps(entries) + + if video_path and Path(video_path).exists(): + self._annotate_frames(entries, video_path) + + log.info("[M3] Fusion complete — %d caption entries ready.", len(entries)) + return entries + + + def _score_all_events( + self, + audio_events: List[AudioEvent], + visual_scores: Dict[float, VisualScore], + has_visual: bool, + ) -> List[_Decision]: + decisions: List[_Decision] = [] + + for event in audio_events: + vscore = self._lookup_visual(event.timestamp_sec, visual_scores) + fusion, threshold = self._compute_fusion(event, vscore, has_visual) + accepted = fusion >= threshold + reject_reason = "" if accepted else ( + f"fusion {fusion:.3f} < threshold {threshold:.3f}" + ) + decisions.append(_Decision( + audio_event = event, + visual_score = vscore, + fusion_score = fusion, + threshold = threshold, + accepted = accepted, + reject_reason = reject_reason, + )) + + return decisions + + def _lookup_visual( + self, + timestamp: float, + visual_scores: Dict[float, VisualScore], + ) -> VisualScore: + + # Exact match first + if timestamp in visual_scores: + return visual_scores[timestamp] + + # Near match (floating-point drift tolerance) + for ts, vs in visual_scores.items(): + if abs(ts - timestamp) <= 0.05: + return vs + + # Nothing found — construct a null VisualScore + return VisualScore( + query_time_sec = timestamp, + reaction_score = 0.0, + num_valid_frames = 0, + peak_frame_time = None, + confidence = "low", + note = "not_queried", + ) + + def _compute_fusion( + self, + event: AudioEvent, + vscore: VisualScore, + has_visual: bool, + ) -> Tuple[float, float]: + priority = event.priority + threshold = cfg.FUSION_THRESHOLD.get(priority, 0.45) + + if not has_visual: + # Pure audio mode + fusion = float(event.confidence) + threshold = threshold * 0.80 + else: + fusion = ( + self.audio_weight * event.confidence + + self.visual_weight * vscore.reaction_score + ) + + fusion = round(min(1.0, max(0.0, fusion)), 4) + return fusion, round(threshold, 4) + + def _log_decisions(self, decisions: List[_Decision]) -> None: + header = ( + f"{'Time':>7} {'Category':<16} {'Pri':<7} " + f"{'Audio':>6} {'Visual':>7} {'Fusion':>7} {'Thresh':>7} {'Decision'}" + ) + log.info("[M3] Decision table:\n %s", header) + + for d in decisions: + ev = d.audio_event + vs = d.visual_score + verdict = "✓ EMIT" if d.accepted else f"✗ SKIP ({d.reject_reason})" + row = ( + f"{ev.timestamp_sec:>6.2f}s " + f"{ev.category:<16} " + f"{ev.priority:<7} " + f"{ev.confidence:>6.3f} " + f"{vs.reaction_score:>7.3f} " + f"{d.fusion_score:>7.3f} " + f"{d.threshold:>7.3f} " + f"{verdict}" + ) + if d.accepted: + log.info(" %s", row) + else: + log.warning(" %s", row) + + + def _deduplicate(self, accepted: List[_Decision]) -> List[_Decision]: + by_category: Dict[str, List[_Decision]] = defaultdict(list) + for d in accepted: + by_category[d.audio_event.category].append(d) + + kept: List[_Decision] = [] + + for cat_decisions in by_category.values(): + # Sort by timestamp + cat_decisions.sort(key=lambda d: d.audio_event.timestamp_sec) + survivors: List[_Decision] = [] + + for current in cat_decisions: + suppress = False + for survivor in survivors: + gap = abs( + current.audio_event.timestamp_sec - + survivor.audio_event.timestamp_sec + ) + if gap < cfg.CAPTION_DEDUP_SEC: + # Keep the higher-scoring one + if current.fusion_score > survivor.fusion_score: + survivors.remove(survivor) + # Will be added below + else: + suppress = True + break + + if not suppress: + survivors.append(current) + + kept.extend(survivors) + + # Re-sort by timestamp + kept.sort(key=lambda d: d.audio_event.timestamp_sec) + return kept + + + def _build_caption_entries( + self, + decisions: List[_Decision], + ) -> List[CaptionEntry]: + entries: List[CaptionEntry] = [] + + for idx, d in enumerate(decisions, start=1): + ev = d.audio_event + priority = ev.priority + duration = cfg.SRT_DISPLAY_DURATION.get(priority, 2.0) + + start_sec = ev.timestamp_sec + # Prefer the event's own end_sec if it gives a sensible duration + natural_dur = ev.end_sec - ev.timestamp_sec + if 0.3 <= natural_dur <= 5.0: + end_sec = ev.timestamp_sec + max(natural_dur, duration) + else: + end_sec = start_sec + duration + + entries.append(CaptionEntry( + index = idx, + start_sec = round(start_sec, 3), + end_sec = round(end_sec, 3), + caption_text = ev.display_label, + category = ev.category, + priority = priority, + audio_score = ev.confidence, + visual_score = d.visual_score.reaction_score, + fusion_score = d.fusion_score, + )) + + return entries + + def _enforce_srt_gaps( + self, + entries: List[CaptionEntry], + ) -> List[CaptionEntry]: + if len(entries) < 2: + return entries + + for i in range(len(entries) - 1): + curr = entries[i] + nxt = entries[i + 1] + required_end = nxt.start_sec - cfg.SRT_MIN_GAP_SEC + if curr.end_sec > required_end: + curr.end_sec = max( + curr.start_sec + 0.2, # keep at least 0.2 s visible + required_end, + ) + curr.end_sec = round(curr.end_sec, 3) + + return entries + + + def _annotate_frames( + self, + entries: List[CaptionEntry], + video_path: str, + ) -> None: + cap = cv2.VideoCapture(str(video_path)) + if not cap.isOpened(): + log.warning("[M3] Cannot open video for frame annotation: %s", + video_path) + return + + fps = cap.get(cv2.CAP_PROP_FPS) or 25.0 + total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) + + for entry in entries: + frame_no = min( + int(entry.start_sec * fps), + total_frames - 1, + ) + cap.set(cv2.CAP_PROP_POS_FRAMES, frame_no) + ret, frame = cap.read() + if not ret or frame is None: + continue + + self._draw_annotation(frame, entry, frame_no) + + safe_cat = entry.category.lower().replace("/", "_") + filename = ( + f"frame_{frame_no:05d}_" + f"t{entry.start_sec:.2f}s_" + f"{safe_cat}.jpg" + ) + out_path = self.frames_dir / filename + cv2.imwrite(str(out_path), frame, [cv2.IMWRITE_JPEG_QUALITY, 90]) + log.debug("[M3] Annotated frame saved → %s", out_path) + + cap.release() + log.info("[M3] %d annotated frames saved to %s", + len(entries), self.frames_dir) + + @staticmethod + def _draw_annotation( + frame: np.ndarray, + entry: CaptionEntry, + frame_no: int, + ) -> None: + """ + Burn caption text, scores, and a timestamp into a video frame + in-place (modifies the numpy array directly). + """ + h, w = frame.shape[:2] + + bar_h = max(52, h // 10) + overlay = frame.copy() + cv2.rectangle(overlay, (0, h - bar_h), (w, h), (0, 0, 0), -1) + cv2.addWeighted(overlay, 0.65, frame, 0.35, 0, frame) + + font = cv2.FONT_HERSHEY_DUPLEX + font_scale = max(0.6, w / 900) + thickness = max(1, int(font_scale * 1.5)) + + # Caption text centred + text = entry.caption_text + (tw, th), _ = cv2.getTextSize(text, font, font_scale, thickness) + tx = max(8, (w - tw) // 2) + ty = h - bar_h // 2 + th // 2 + cv2.putText(frame, text, (tx, ty), font, font_scale, + (255, 255, 255), thickness, cv2.LINE_AA) + + PRIORITY_COLOURS = { + "HIGH": (0, 80, 220), # red-ish + "MEDIUM": (30, 160, 30), # green + "LOW": (180, 100, 0), # blue-ish + } + badge_colour = PRIORITY_COLOURS.get(entry.priority, (100, 100, 100)) + badge_lines = [ + f"t={entry.start_sec:.2f}s frm#{frame_no}", + f"audio={entry.audio_score:.3f} visual={entry.visual_score:.3f}", + f"fusion={entry.fusion_score:.3f} [{entry.priority}]", + ] + small_scale = max(0.35, w / 1800) + small_thick = 1 + for i, line in enumerate(badge_lines): + y = 20 + i * 18 + cv2.putText(frame, line, (8, y), cv2.FONT_HERSHEY_SIMPLEX, + small_scale, badge_colour, small_thick, cv2.LINE_AA) + + + @staticmethod + def _any_face_detected(visual_scores: Dict[float, VisualScore]) -> bool: + """Return True if at least one visual window found a face.""" + return any( + vs.num_valid_frames > 0 + for vs in visual_scores.values() + ) + + + def summary(self, entries: List[CaptionEntry]) -> str: + """ + Return a human-readable summary string of the caption output. + Useful for the final console banner in main.py. + """ + if not entries: + return "No captions generated." + + lines = ["Caption Summary:", "─" * 54] + for e in entries: + lines.append( + f" {e.start_sec:6.2f}s → {e.end_sec:6.2f}s " + f"{e.caption_text:<28} " + f"(fusion={e.fusion_score:.3f})" + ) + lines.append("─" * 54) + lines.append(f" Total: {len(entries)} caption(s)") + return "\n".join(lines) \ No newline at end of file diff --git a/modules/sound_detector.py b/modules/sound_detector.py new file mode 100644 index 0000000..96be2a1 --- /dev/null +++ b/modules/sound_detector.py @@ -0,0 +1,354 @@ +from __future__ import annotations +import os +import shutil +import subprocess +import tempfile +from dataclasses import dataclass, field +from pathlib import Path +from typing import Dict, List, Optional, Tuple +import numpy as np +from utils.logger import get_logger +import config as cfg + +log = get_logger(__name__) + + +@dataclass +class AudioEvent: + timestamp_sec: float # centre of the detection window + end_sec: float # estimated end time + category: str # key from SOUND_CATEGORIES + display_label: str # human-readable CC text + priority: str # HIGH | MEDIUM | LOW + confidence: float # boosted & normalised score [0, 1] + raw_class: str # best-matching YAMNet class name + raw_score: float # original YAMNet score before boost + + +def _build_lookup() -> Tuple[List[str], Dict[str, dict]]: + blacklist = [s.lower() for s in cfg.YAMNET_BLACKLIST] + + class_to_cat: Dict[str, dict] = {} + for cat_key, cat_info in cfg.SOUND_CATEGORIES.items(): + for token in cat_info["yamnet"]: + class_to_cat[token.lower()] = { + "key": cat_key, + "display": cat_info["display"], + "priority": cat_info["priority"], + "boost": cat_info["boost"], + } + + return blacklist, class_to_cat + + +_BLACKLIST_TOKENS, _CLASS_TO_CAT = _build_lookup() + + +class SoundEventDetector: + + def __init__(self, model_handle: str = cfg.YAMNET_MODEL_PATH): + self._model_handle = model_handle + self._yamnet = None # lazy-loaded + self._class_names = None + + + def detect(self, video_path: str) -> List[AudioEvent]: + + log.info("[M1] Starting sound detection on: %s", video_path) + + # Step 1: extract audio waveform + waveform = self._extract_audio(video_path) + if waveform is None or len(waveform) == 0: + log.error("[M1] Could not extract audio from %s", video_path) + return [] + + duration_sec = len(waveform) / cfg.AUDIO_SAMPLE_RATE + log.info("[M1] Audio duration: %.2f s | samples: %d", + duration_sec, len(waveform)) + + # Step 2: load model once + self._load_model() + + # Step 3: sliding-window inference + raw_events = self._sliding_window_inference(waveform) + log.info("[M1] Sliding window produced %d candidate events", + len(raw_events)) + + # Step 4: filter, map, boost + filtered = self._filter_and_map(raw_events) + log.info("[M1] After filtering: %d events remain", len(filtered)) + + # Step 5: merge nearby duplicates + merged = self._merge_events(filtered) + log.info("[M1] After merging: %d events", len(merged)) + + # Step 6: cap per-category count + capped = self._cap_per_category(merged) + log.info("[M1] Final audio events: %d", len(capped)) + + for ev in capped: + log.debug( + " [M1] %.2fs %-15s conf=%.3f raw='%s'", + ev.timestamp_sec, ev.category, ev.confidence, ev.raw_class, + ) + + return sorted(capped, key=lambda e: e.timestamp_sec) + + + def _extract_audio(self, video_path: str) -> Optional[np.ndarray]: + video_path = str(Path(video_path).resolve()) + + try: + return self._extract_via_ffmpeg(video_path) + except Exception as exc: + log.warning("[M1] ffmpeg extraction failed (%s), trying librosa", exc) + + try: + import librosa + waveform, _ = librosa.load( + video_path, + sr=cfg.AUDIO_SAMPLE_RATE, + mono=True, + ) + return waveform.astype(np.float32) + except Exception as exc: + log.error("[M1] librosa fallback failed: %s", exc) + return None + + def _extract_via_ffmpeg(self, video_path: str) -> np.ndarray: + import soundfile as sf + + with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp: + tmp_path = tmp.name + + try: + ffmpeg_bin = shutil.which("ffmpeg") + if not ffmpeg_bin: + # Fallback to the bundled binary from imageio-ffmpeg when PATH has no ffmpeg. + try: + import imageio_ffmpeg + ffmpeg_bin = imageio_ffmpeg.get_ffmpeg_exe() + except Exception as exc: + raise RuntimeError( + "ffmpeg executable not found. Install ffmpeg or imageio-ffmpeg." + ) from exc + + cmd = [ + ffmpeg_bin, "-y", "-loglevel", "error", + "-i", video_path, + "-ar", str(cfg.AUDIO_SAMPLE_RATE), + "-ac", "1", + "-f", "wav", + tmp_path, + ] + result = subprocess.run( + cmd, capture_output=True, check=True, timeout=120 + ) + data, _ = sf.read(tmp_path, dtype="float32") + return data + finally: + try: + os.unlink(tmp_path) + except OSError: + pass + + + def _load_model(self) -> None: + if self._yamnet is not None: + return + log.info("[M1] Loading YAMNet from %s …", self._model_handle) + try: + import tensorflow_hub as hub + self._yamnet = hub.load(self._model_handle) + # Retrieve class names from the model asset + import csv + class_map_path = self._yamnet.class_map_path().numpy().decode() + with open(class_map_path, newline="", encoding="utf-8") as f: + reader = csv.reader(f) + next(reader, None) # skip header row + self._class_names = [row[2] for row in reader if len(row) >= 3] + + if not self._class_names: + raise RuntimeError(f"YAMNet class map is empty: {class_map_path}") + log.info("[M1] YAMNet loaded — %d classes", len(self._class_names)) + except ImportError as exc: + raise RuntimeError( + "tensorflow_hub is required for Module 1. " + "Run: pip install tensorflow tensorflow-hub" + ) from exc + + + def _sliding_window_inference( + self, + waveform: np.ndarray, + ) -> List[dict]: + import tensorflow as tf + + window_samples = int(cfg.AUDIO_WINDOW_SEC * cfg.AUDIO_SAMPLE_RATE) + hop_samples = int(cfg.AUDIO_HOP_SEC * cfg.AUDIO_SAMPLE_RATE) + + results = [] + n = len(waveform) + start = 0 + + while start < n: + end = start + window_samples + chunk = waveform[start:end] + + # Zero-pad the last chunk if needed + if len(chunk) < window_samples: + chunk = np.pad(chunk, (0, window_samples - len(chunk))) + + # YAMNet expects shape [num_samples] + scores_tensor, _, _ = self._yamnet( + tf.constant(chunk, dtype=tf.float32) + ) + + # scores_tensor shape: [num_patches, 521] + # Average across time patches for a single window-level score + mean_scores = tf.reduce_mean(scores_tensor, axis=0).numpy() + + # Top-5 classes + top_k = np.argsort(mean_scores)[::-1][:5] + top_pairs = [ + (self._class_names[i], float(mean_scores[i])) + for i in top_k + if float(mean_scores[i]) >= cfg.YAMNET_RAW_THRESHOLD + ] + + timestamp = start / cfg.AUDIO_SAMPLE_RATE + end_time = min(end / cfg.AUDIO_SAMPLE_RATE, + n / cfg.AUDIO_SAMPLE_RATE) + + if top_pairs: + results.append({ + "timestamp_sec": timestamp, + "end_sec": end_time, + "scores": top_pairs, + }) + + start += hop_samples + + return results + + + def _filter_and_map(self, raw_events: List[dict]) -> List[AudioEvent]: + events: List[AudioEvent] = [] + + for raw in raw_events: + best_event = self._best_category_match( + raw["scores"], + raw["timestamp_sec"], + raw["end_sec"], + ) + if best_event is not None: + events.append(best_event) + + return events + + def _best_category_match( + self, + scores: List[Tuple[str, float]], + timestamp_sec: float, + end_sec: float, + ) -> Optional[AudioEvent]: + best_score = 0.0 + best_cat = None + best_raw_cls = "" + best_raw_score = 0.0 + + for class_name, raw_score in scores: + cname_lower = class_name.lower() + + if any(bl in cname_lower for bl in _BLACKLIST_TOKENS): + continue + + matched_cat = None + for token, cat_info in _CLASS_TO_CAT.items(): + if token in cname_lower: + matched_cat = cat_info + break + + if matched_cat is None: + continue + + boosted = min(raw_score * matched_cat["boost"], 1.0) + + if boosted > best_score: + best_score = boosted + best_cat = matched_cat + best_raw_cls = class_name + best_raw_score = raw_score + + if best_cat is None: + return None + + if best_score < cfg.AUDIO_EMIT_THRESHOLD: + return None + + return AudioEvent( + timestamp_sec = timestamp_sec, + end_sec = end_sec, + category = best_cat["key"], + display_label = best_cat["display"], + priority = best_cat["priority"], + confidence = round(best_score, 4), + raw_class = best_raw_cls, + raw_score = round(best_raw_score, 4), + ) + + + def _merge_events(self, events: List[AudioEvent]) -> List[AudioEvent]: + if not events: + return events + + # Group by category + from collections import defaultdict + by_cat: Dict[str, List[AudioEvent]] = defaultdict(list) + for ev in events: + by_cat[ev.category].append(ev) + + merged_all: List[AudioEvent] = [] + + for cat_events in by_cat.values(): + cat_events.sort(key=lambda e: e.timestamp_sec) + groups: List[List[AudioEvent]] = [] + current_group: List[AudioEvent] = [cat_events[0]] + + for ev in cat_events[1:]: + gap = ev.timestamp_sec - current_group[-1].end_sec + if gap <= cfg.EVENT_MERGE_GAP_SEC: + current_group.append(ev) + else: + groups.append(current_group) + current_group = [ev] + groups.append(current_group) + + for group in groups: + # Representative = highest confidence + best = max(group, key=lambda e: e.confidence) + merged_all.append(AudioEvent( + timestamp_sec = group[0].timestamp_sec, + end_sec = group[-1].end_sec, + category = best.category, + display_label = best.display_label, + priority = best.priority, + confidence = best.confidence, + raw_class = best.raw_class, + raw_score = best.raw_score, + )) + + return merged_all + + def _cap_per_category(self, events: List[AudioEvent]) -> List[AudioEvent]: + from collections import defaultdict + by_cat: Dict[str, List[AudioEvent]] = defaultdict(list) + for ev in events: + by_cat[ev.category].append(ev) + + capped: List[AudioEvent] = [] + for cat_events in by_cat.values(): + cat_events.sort(key=lambda e: e.confidence, reverse=True) + capped.extend(cat_events[: cfg.MAX_EVENTS_PER_CATEGORY]) + + return capped \ No newline at end of file diff --git a/modules/visual_detector.py b/modules/visual_detector.py new file mode 100644 index 0000000..d5d8441 --- /dev/null +++ b/modules/visual_detector.py @@ -0,0 +1,336 @@ +from __future__ import annotations +import math +from dataclasses import dataclass, field +from typing import Dict, List, Optional, Tuple +import cv2 +import numpy as np +from utils.logger import get_logger +import config as cfg + +log = get_logger(__name__) + +@dataclass +class FaceFrameScore: + """Facial action scores for a single video frame.""" + frame_no: int + time_sec: float + ear_score: float # normalised Eye Aspect Ratio delta + mar_score: float # normalised Mouth Aspect Ratio delta + brow_score: float # normalised Brow Raise + composite: float # weighted combination of the above + face_detected: bool + num_faces: int + + +@dataclass +class VisualScore: + """Aggregated visual reaction score for one audio event timestamp.""" + query_time_sec: float + reaction_score: float # [0, 1] + num_valid_frames: int + peak_frame_time: Optional[float] + frame_scores: List[FaceFrameScore] = field(default_factory=list) + confidence: str = "low" # low | medium | high + note: str = "" + +class VisualReactionDetector: + def __init__( + self, + min_detection_confidence: float = cfg.MEDIAPIPE_DETECTION_CONFIDENCE, + min_tracking_confidence: float = cfg.MEDIAPIPE_TRACKING_CONFIDENCE, + ): + self._det_conf = min_detection_confidence + self._trk_conf = min_tracking_confidence + self._face_mesh = None # lazy-loaded per-video + + # ── Public API ────────────────────────────────────────────────────────── + + def analyse( + self, + video_path: str, + timestamps: List[float], + ) -> Dict[float, VisualScore]: + log.info("[M2] Opening video: %s", video_path) + cap = cv2.VideoCapture(str(video_path)) + if not cap.isOpened(): + log.error("[M2] Cannot open video: %s", video_path) + return {t: self._null_score(t, "video_open_failed") for t in timestamps} + + fps = cap.get(cv2.CAP_PROP_FPS) or 25.0 + total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) + duration_sec = total_frames / fps + + log.info("[M2] Video: fps=%.1f frames=%d duration=%.2f s", + fps, total_frames, duration_sec) + + # Load MediaPipe once for the whole video + self._load_face_mesh() + + results: Dict[float, VisualScore] = {} + + for ts in timestamps: + log.debug("[M2] Processing window around t=%.2f s", ts) + score = self._score_window(cap, ts, fps, total_frames, duration_sec) + results[ts] = score + log.debug( + "[M2] ts=%.2f reaction=%.3f valid_frames=%d conf=%s", + ts, score.reaction_score, score.num_valid_frames, score.confidence, + ) + + cap.release() + self._release_face_mesh() + + log.info("[M2] Visual analysis complete — %d timestamps processed", + len(results)) + return results + + def _score_window( + self, + cap: cv2.VideoCapture, + query_time: float, + fps: float, + total_frames: int, + duration_sec: float, + ) -> VisualScore: + """ + Sample frames in [query_time - BEFORE, query_time + AFTER], + score each, and aggregate. + """ + t_start = max(0.0, query_time - cfg.VISUAL_WINDOW_BEFORE_SEC) + t_end = min(duration_sec, query_time + cfg.VISUAL_WINDOW_AFTER_SEC) + + # Build evenly-spaced sample timestamps + n = cfg.VISUAL_MAX_FRAMES_PER_WINDOW + sample_t = np.linspace(t_start, t_end, n) + + frame_scores: List[FaceFrameScore] = [] + for t in sample_t: + fn = int(t * fps) + fn = max(0, min(fn, total_frames - 1)) + frame = self._seek_frame(cap, fn) + if frame is None: + continue + fs = self._score_frame(frame, fn, t) + frame_scores.append(fs) + + # Aggregate + return self._aggregate(query_time, frame_scores) + + @staticmethod + def _seek_frame( + cap: cv2.VideoCapture, frame_no: int + ) -> Optional[np.ndarray]: + """ + Seek to *frame_no* and return the frame as a numpy BGR array, + or None on failure. + """ + cap.set(cv2.CAP_PROP_POS_FRAMES, frame_no) + ret, frame = cap.read() + return frame if ret else None + + def _score_frame( + self, + frame: np.ndarray, + frame_no: int, + time_sec: float, + ) -> FaceFrameScore: + + # MediaPipe expects RGB + rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) + h, w = rgb.shape[:2] + + results = self._face_mesh.process(rgb) + + if not results.multi_face_landmarks: + return FaceFrameScore( + frame_no=frame_no, time_sec=time_sec, + ear_score=0.0, mar_score=0.0, brow_score=0.0, + composite=0.0, face_detected=False, num_faces=0, + ) + + # Evaluate all detected faces, take the highest-reacting one + best_composite = -1.0 + best_ear = best_mar = best_brow = 0.0 + + for face_landmarks in results.multi_face_landmarks: + lm = face_landmarks.landmark + + def pt(idx): + return np.array([lm[idx].x * w, lm[idx].y * h]) + + ear = self._compute_ear(pt) + mar = self._compute_mar(pt) + brow = self._compute_brow_raise(pt, h) + + ear_norm = max(0.0, (ear - cfg.EAR_BASELINE) / (1.0 - cfg.EAR_BASELINE)) + mar_norm = max(0.0, (mar - cfg.MAR_BASELINE) / (1.0 - cfg.MAR_BASELINE)) + brow_norm = max(0.0, brow / 0.10) # 0.10 is practical max raise + + # Clamp each to [0, 1] + ear_norm = min(ear_norm, 1.0) + mar_norm = min(mar_norm, 1.0) + brow_norm = min(brow_norm, 1.0) + + composite = ( + cfg.VISUAL_WEIGHT_EAR * ear_norm + + cfg.VISUAL_WEIGHT_MAR * mar_norm + + cfg.VISUAL_WEIGHT_BROW * brow_norm + ) + + if composite > best_composite: + best_composite = composite + best_ear = ear_norm + best_mar = mar_norm + best_brow = brow_norm + + return FaceFrameScore( + frame_no = frame_no, + time_sec = time_sec, + ear_score = round(best_ear, 4), + mar_score = round(best_mar, 4), + brow_score = round(best_brow, 4), + composite = round(best_composite, 4), + face_detected = True, + num_faces = len(results.multi_face_landmarks), + ) + + @staticmethod + def _compute_ear(pt) -> float: + """ + Eye Aspect Ratio (EAR) — averaged across both eyes. + EAR = (vertical_distance) / (2 * horizontal_distance) + A wide-open eye has EAR ≈ 0.35; closed eye ≈ 0. + """ + ears = [] + for side in ("left", "right"): + idx = cfg.EYE_LANDMARKS[side] + v = np.linalg.norm(pt(idx["top"]) - pt(idx["bottom"])) + h = np.linalg.norm(pt(idx["inner"]) - pt(idx["outer"])) + if h > 1e-6: + ears.append(v / h) + + return float(np.mean(ears)) if ears else cfg.EAR_BASELINE + + @staticmethod + def _compute_mar(pt) -> float: + """ + Mouth Aspect Ratio (MAR). + Larger values → more open mouth. + """ + idx = cfg.MOUTH_LANDMARKS + v1 = np.linalg.norm(pt(idx["top"]) - pt(idx["bottom"])) + v2 = np.linalg.norm(pt(idx["top2"]) - pt(idx["bottom2"])) + h = np.linalg.norm(pt(idx["left"]) - pt(idx["right"])) + if h < 1e-6: + return cfg.MAR_BASELINE + return float((v1 + v2) / (2.0 * h)) + + @staticmethod + def _compute_brow_raise(pt, face_height: int) -> float: + """ + Brow raise: average distance from brow landmarks to eye-top, + normalised by face height. Higher → more raised brows. + """ + def _side_raise(brow_idxs, eye_top_idx): + brow_y = np.mean([pt(i)[1] for i in brow_idxs]) + eye_y = pt(eye_top_idx)[1] + # brow is above eye → brow_y < eye_y in image coords + return max(0.0, float(eye_y - brow_y)) / max(face_height, 1) + + left_raise = _side_raise( + cfg.BROW_LANDMARKS["left_brow"], + cfg.BROW_LANDMARKS["left_eye_top"], + ) + right_raise = _side_raise( + cfg.BROW_LANDMARKS["right_brow"], + cfg.BROW_LANDMARKS["right_eye_top"], + ) + return (left_raise + right_raise) / 2.0 + + def _aggregate( + self, + query_time: float, + frame_scores: List[FaceFrameScore], + ) -> VisualScore: + valid = [f for f in frame_scores if f.face_detected] + + if len(valid) < cfg.VISUAL_MIN_VALID_FRAMES: + note = ( + "too_few_valid_frames" + if valid else "no_face_detected" + ) + return VisualScore( + query_time_sec = query_time, + reaction_score = 0.0, + num_valid_frames = len(valid), + peak_frame_time = None, + frame_scores = frame_scores, + confidence = "low", + note = note, + ) + + # Temporal weights + weights = np.array([ + math.exp(-abs(f.time_sec - query_time) / 0.5) + for f in valid + ]) + composites = np.array([f.composite for f in valid]) + w_sum = weights.sum() + if w_sum < 1e-9: + reaction = float(np.mean(composites)) + else: + reaction = float(np.dot(weights, composites) / w_sum) + + reaction = min(1.0, max(0.0, reaction)) + + peak = valid[int(np.argmax(composites))] + conf = ( + "high" if len(valid) >= 5 else + "medium" if len(valid) >= 3 else + "low" + ) + + return VisualScore( + query_time_sec = query_time, + reaction_score = round(reaction, 4), + num_valid_frames = len(valid), + peak_frame_time = peak.time_sec, + frame_scores = frame_scores, + confidence = conf, + note = "ok", + ) + + def _load_face_mesh(self) -> None: + if self._face_mesh is not None: + return + try: + import mediapipe as mp + self._face_mesh = mp.solutions.face_mesh.FaceMesh( + static_image_mode = False, # video mode → faster tracking + max_num_faces = 4, # handle group reactions + refine_landmarks = True, # iris landmarks for better EAR + min_detection_confidence = self._det_conf, + min_tracking_confidence = self._trk_conf, + ) + log.info("[M2] MediaPipe Face Mesh loaded") + except ImportError as exc: + raise RuntimeError( + "mediapipe is required for Module 2. " + "Run: pip install mediapipe" + ) from exc + + def _release_face_mesh(self) -> None: + if self._face_mesh is not None: + self._face_mesh.close() + self._face_mesh = None + + @staticmethod + def _null_score(ts: float, note: str) -> VisualScore: + return VisualScore( + query_time_sec = ts, + reaction_score = 0.0, + num_valid_frames = 0, + peak_frame_time = None, + confidence = "low", + note = note, + ) \ No newline at end of file diff --git a/pann_test/test_panns.py b/pann_test/test_panns.py new file mode 100644 index 0000000..36e0856 --- /dev/null +++ b/pann_test/test_panns.py @@ -0,0 +1,110 @@ +import os +import subprocess +import tempfile +import shutil +from pathlib import Path +import numpy as np +import soundfile as sf +from panns_inference import AudioTagging + +# ====================== CONFIG ====================== +VIDEO_FOLDER = "data" +OUTPUT_FILE = "panns_results.txt" +SAMPLE_RATE = 32000 +# =================================================== + +print("Loading PANNs model...") +model = AudioTagging(checkpoint_path=None, device='cpu') + +results = [] + +def extract_audio_ffmpeg(video_path: str) -> np.ndarray: + """Robust audio extraction using ffmpeg""" + video_path = str(Path(video_path).resolve()) + + with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp: + tmp_path = tmp.name + + try: + ffmpeg_bin = shutil.which("ffmpeg") + if not ffmpeg_bin: + try: + import imageio_ffmpeg + ffmpeg_bin = imageio_ffmpeg.get_ffmpeg_exe() + except Exception as e: + raise RuntimeError("ffmpeg not found.") from e + + cmd = [ + ffmpeg_bin, "-y", "-loglevel", "error", + "-i", video_path, + "-ar", str(SAMPLE_RATE), + "-ac", "1", + "-f", "wav", + tmp_path, + ] + subprocess.run(cmd, check=True, capture_output=True, timeout=120) + data, _ = sf.read(tmp_path, dtype="float32") + return data + finally: + try: + os.unlink(tmp_path) + except: + pass + + +for filename in os.listdir(VIDEO_FOLDER): + if filename.endswith(('.mp4', '.mkv', '.avi', '.mov')): + video_path = os.path.join(VIDEO_FOLDER, filename) + print(f"\nProcessing: {filename}") + + try: + # Step 1: Extract audio + waveform = extract_audio_ffmpeg(video_path) + + if waveform is None or len(waveform) == 0: + print(f" ❌ Could not extract audio") + continue + + # Step 2: Fix shape for PANNs + if waveform.ndim == 1: + waveform = waveform.reshape(1, -1) + + # Step 3: Run PANNs (ROBUST version) + output = model.inference(waveform) + + # Handle both dict and tuple return types + if isinstance(output, dict): + clipwise_output = output['clipwise_output'] + elif isinstance(output, (list, tuple)): + clipwise_output = output[0] + else: + clipwise_output = output + + # Make sure we have a 1D array of scores + if hasattr(clipwise_output, 'ndim') and clipwise_output.ndim > 1: + clipwise_output = clipwise_output[0] + + # Get top 5 predictions + top5_idx = np.argsort(clipwise_output)[::-1][:5] + top5 = [(model.labels[i], float(clipwise_output[i])) for i in top5_idx] + + results.append({ + "video": filename, + "top_predictions": top5 + }) + + print(f"Top predictions for {filename}:") + for label, score in top5: + print(f" {label:<45} → {score:.4f}") + + except Exception as e: + print(f" ❌ Error processing {filename}: {e}") + +# Save results +with open(OUTPUT_FILE, "w", encoding="utf-8") as f: + for r in results: + f.write(f"\n=== {r['video']} ===\n") + for label, score in r['top_predictions']: + f.write(f"{label}: {score:.4f}\n") + +print(f"\n✅ Results saved to {OUTPUT_FILE}") \ No newline at end of file diff --git a/panns_results.txt b/panns_results.txt new file mode 100644 index 0000000..fcf5953 --- /dev/null +++ b/panns_results.txt @@ -0,0 +1,70 @@ + +=== ankletscare.mp4 === +Music: 0.7518 +Jingle bell: 0.1245 +Jingle, tinkle: 0.0387 +Speech: 0.0277 +Animal: 0.0271 + +=== fight.mp4 === +Music: 0.9443 +Speech: 0.9156 +Shatter: 0.8145 +Inside, large room or hall: 0.3370 +Inside, small room: 0.2187 + +=== firecrackers.mp4 === +Speech: 0.6879 +Machine gun: 0.1686 +Fusillade: 0.1474 +Gunshot, gunfire: 0.0894 +Laughter: 0.0889 + +=== mridangam.mp4 === +Music: 0.7131 +Percussion: 0.1665 +Musical instrument: 0.1333 +Wood block: 0.0917 +Drum: 0.0884 + +=== rat+things.mp4 === +Music: 0.7510 +Speech: 0.5726 +Shatter: 0.1167 +Breaking: 0.1042 +Meow: 0.0927 + +=== rrr-forest.mp4 === +Speech: 0.6852 +Music: 0.6793 +Roar: 0.4513 +Roaring cats (lions, tigers): 0.0762 +Animal: 0.0719 + +=== sample_video.mp4 === +Silence: 0.3943 +Music: 0.1086 +Vehicle: 0.0728 +Speech: 0.0586 +Inside, small room: 0.0350 + +=== sound scare1.mp4 === +Music: 0.8226 +Shatter: 0.2257 +Chink, clink: 0.1326 +Breaking: 0.0824 +Glass: 0.0641 + +=== soundshock2.mp4 === +Shatter: 0.8903 +Music: 0.7428 +Breaking: 0.1848 +Scary music: 0.1504 +Speech: 0.1074 + +=== splash.mp4 === +Speech: 0.6401 +Music: 0.5673 +Splash, splatter: 0.1755 +Slosh: 0.1097 +Water: 0.0800 diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..d191a7f --- /dev/null +++ b/requirements.txt @@ -0,0 +1,7 @@ +tensorflow>=2.12.0 +tensorflow-hub>=0.13.0 +mediapipe>=0.10.0 +opencv-python>=4.8.0 +numpy>=1.24.0 +soundfile>=0.12.1 +librosa>=0.10.0 \ No newline at end of file diff --git a/utils/__init__.py b/utils/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/utils/logger.py b/utils/logger.py new file mode 100644 index 0000000..829a2d1 --- /dev/null +++ b/utils/logger.py @@ -0,0 +1,59 @@ +import logging +import sys +from pathlib import Path + + +def get_logger(name: str, level: str = "INFO") -> logging.Logger: + + logger = logging.getLogger(name) + if logger.handlers: + return logger + + logger.setLevel(getattr(logging, level.upper(), logging.INFO)) + + handler = logging.StreamHandler(sys.stdout) + handler.setLevel(logging.DEBUG) + + fmt = ( + "%(asctime)s %(levelname)-8s " + "%(name)-28s %(message)s" + ) + handler.setFormatter(_ColourFormatter(fmt)) + logger.addHandler(handler) + logger.propagate = False + return logger + + +class _ColourFormatter(logging.Formatter): + + COLOURS = { + "DEBUG": "\033[36m", # cyan + "INFO": "\033[32m", # green + "WARNING": "\033[33m", # yellow + "ERROR": "\033[31m", # red + "CRITICAL": "\033[35m", # magenta + } + RESET = "\033[0m" + + def format(self, record: logging.LogRecord) -> str: + colour = self.COLOURS.get(record.levelname, "") + record.levelname = f"{colour}{record.levelname}{self.RESET}" + return super().format(record) + + +def setup_file_logger(output_dir: str, level: str = "DEBUG") -> None: + root = logging.getLogger() + if any(isinstance(h, logging.FileHandler) for h in root.handlers): + return # already attached + + log_path = Path(output_dir) / "pipeline.log" + log_path.parent.mkdir(parents=True, exist_ok=True) + + fh = logging.FileHandler(str(log_path), mode="a", encoding="utf-8") + fh.setLevel(getattr(logging, level.upper(), logging.DEBUG)) + fh.setFormatter( + logging.Formatter( + "%(asctime)s %(levelname)-8s %(name)s %(message)s" + ) + ) + root.addHandler(fh) \ No newline at end of file diff --git a/utils/srt_writer.py b/utils/srt_writer.py new file mode 100644 index 0000000..98ff7dc --- /dev/null +++ b/utils/srt_writer.py @@ -0,0 +1,140 @@ +import json +from dataclasses import asdict, dataclass +from pathlib import Path +from typing import List, Optional + +from utils.logger import get_logger + +log = get_logger(__name__) + + +# ───────────────────────────────────────────────────────────────────────────── +# Data Structures +# ───────────────────────────────────────────────────────────────────────────── + +@dataclass +class CaptionEntry: + """One finalized CC subtitle entry ready for SRT output.""" + index: int + start_sec: float + end_sec: float + caption_text: str + category: str + priority: str + audio_score: float + visual_score: float + fusion_score: float + + +# ───────────────────────────────────────────────────────────────────────────── +# Time Helpers +# ───────────────────────────────────────────────────────────────────────────── + +def _sec_to_srt_time(seconds: float) -> str: + """Convert float seconds → SRT timestamp HH:MM:SS,mmm.""" + seconds = max(0.0, seconds) + hours = int(seconds // 3600) + minutes = int((seconds % 3600) // 60) + secs = int(seconds % 60) + millis = int(round((seconds % 1) * 1000)) + # Guard against rounding pushing millis to 1000 + if millis >= 1000: + millis = 999 + return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}" + + +# ───────────────────────────────────────────────────────────────────────────── +# SRT Writer +# ───────────────────────────────────────────────────────────────────────────── + +def write_srt(entries: List[CaptionEntry], output_path: str) -> None: + if not entries: + log.warning("No caption entries to write — SRT file will be empty.") + + # Sort by start time + sorted_entries = sorted(entries, key=lambda e: e.start_sec) + + # Prevent timeline overlaps: ensure each end <= next start + for i in range(len(sorted_entries) - 1): + curr = sorted_entries[i] + nxt = sorted_entries[i + 1] + if curr.end_sec > nxt.start_sec: + curr.end_sec = max(curr.start_sec + 0.1, nxt.start_sec - 0.05) + + out_path = Path(output_path) + out_path.parent.mkdir(parents=True, exist_ok=True) + + with open(out_path, "w", encoding="utf-8") as fh: + for idx, entry in enumerate(sorted_entries, start=1): + start_ts = _sec_to_srt_time(entry.start_sec) + end_ts = _sec_to_srt_time(entry.end_sec) + fh.write(f"{idx}\n") + fh.write(f"{start_ts} --> {end_ts}\n") + fh.write(f"{entry.caption_text}\n") + fh.write("\n") + + log.info("SRT written → %s (%d entries)", out_path, len(sorted_entries)) + + +# ───────────────────────────────────────────────────────────────────────────── +# JSON Report Writer +# ───────────────────────────────────────────────────────────────────────────── + +def write_json_report( + entries: List[CaptionEntry], + audio_events: list, + visual_scores: dict, + output_path: str, + video_path: Optional[str] = None, +) -> None: + """ + Write a comprehensive JSON report containing: + - pipeline metadata + - raw audio events from Module 1 + - visual scores from Module 2 + - finalized CC entries from Module 3 + + Parameters + ---------- + entries : finalized CaptionEntry list + audio_events : raw AudioEvent list from Module 1 + visual_scores : dict {timestamp: VisualScore} from Module 2 + output_path : destination .json file path + video_path : optional path to the source video (for reference) + """ + import datetime + + report = { + "meta": { + "tool": "Intelligent CC Suggestion Tool", + "version": "2.0.0", + "created_at": datetime.datetime.utcnow().isoformat() + "Z", + "video_path": str(video_path) if video_path else None, + }, + "summary": { + "total_audio_events": len(audio_events), + "total_visual_windows": len(visual_scores), + "total_captions": len(entries), + }, + "audio_events": [_serialise(e) for e in audio_events], + "visual_scores": { + str(round(k, 3)): _serialise(v) + for k, v in visual_scores.items() + }, + "captions": [asdict(e) for e in entries], + } + + out_path = Path(output_path) + out_path.parent.mkdir(parents=True, exist_ok=True) + + with open(out_path, "w", encoding="utf-8") as fh: + json.dump(report, fh, indent=2, ensure_ascii=False) + + log.info("JSON report written → %s", out_path) + + +def _serialise(obj) -> dict: + try: + return asdict(obj) + except TypeError: + return vars(obj) if hasattr(obj, "__dict__") else str(obj) \ No newline at end of file