PlanetRead · imtushar01 · May 11, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,13 @@
+# Virtual environment
+venv/
+.venv/
+
+# Python cache
+__pycache__/
+*.pyc
+
+# VS Code
+.vscode/
+
+# macOS
+.DS_Store
diff --git a/README.md b/README.md
@@ -0,0 +1,272 @@
+# Intelligent CC Suggestion Tool (DMP 2026)
+
+A Python pipeline that detects non-speech audio events in a video and checks whether there's a visible reaction on screen before flagging it as a CC candidate. The idea is to avoid dumping every background sound into the caption track — only events that actually affect what's happening on screen should get a CC.
+
+This submission covers:
+
+* Goal 1 — Sound Event Detection
+* Goal 2 — Speaker Reaction Detection
+
+---
+
+# File Structure
+
+```text
+.
+├── sound_event_detector.py   # Goal 1
+├── reaction_detector.py      # Goal 2
+├── requirements.txt
+└── README.md
+```
+
+---
+
+# Prerequisites
+
+You'll need `ffmpeg` installed on your machine.
+
+## macOS
+
+```bash
+brew install ffmpeg
+```
+
+## Ubuntu / Debian
+
+```bash
+sudo apt install ffmpeg
+```
+
+---
+
+# Setup & Installation
+
+## 1. Clone the repository
+
+```bash
+git clone <your-repo-url>
+cd Intelligent-cc-generation
+```
+
+## 2. Create a virtual environment
+
+### macOS / Linux
+
+```bash
+python3 -m venv venv
+```
+
+### Windows
+
+```bash
+python -m venv venv
+```
+
+---
+
+## 3. Activate the virtual environment
+
+### macOS / Linux
+
+```bash
+source venv/bin/activate
+```
+
+### Windows
+
+```bash
+venv\Scripts\activate
+```
+
+---
+
+## 4. Install dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+The first run will download the YAMNet model (~25 MB) from TensorFlow Hub and cache it locally. Subsequent runs are instant.
+
+---
+
+# Goal 1 — Sound Event Detection
+
+Run:
+
+```bash
+python3 sound_event_detector.py video.mp4
+```
+
+This extracts the audio track via ffmpeg (converting to mono 16kHz WAV, which is what YAMNet expects), runs it through YAMNet, and returns a list of non-speech events with timestamps and confidence scores.
+
+## Implementation details
+
+### Speech filtering
+
+YAMNet class indices `0–6` are all speech variants:
+
+* Speech
+* Male speech
+* Female speech
+* Child speech
+* Conversation
+* Narration
+* Whispering
+
+These are hard-dropped before anything else. For Hindi/regional content where dialogue is dense, doing this as a blocklist rather than a low-confidence filter makes a noticeable difference.
+
+---
+
+### Event merging
+
+Consecutive YAMNet windows (~0.48s each) with the same label get merged into a single event. Peak confidence is preserved.
+
+Without this, a 2-second gunshot would show up as four separate entries.
+
+---
+
+### CC label mapping
+
+There's a lookup table that maps YAMNet class names to readable CC labels:
+
+* `[gunshot]`
+* `[glass breaking]`
+* `[applause]`
+* etc.
+
+Anything not in the table falls back to the first word of the YAMNet class name.
+
+---
+
+## Sample output
+
+```text
+3.36s – 5.28s    [music]    (conf: 0.87)  [Music]
+12.00s – 12.48s  [gunshot]  (conf: 0.74)  [Gunshot, gunfire]
+18.72s – 19.20s  [applause] (conf: 0.61)  [Applause]
+```
+
+---
+
+# Goal 2 — Speaker Reaction Detection
+
+Run:
+
+```bash
+python3 reaction_detector.py video.mp4
+```
+
+This runs Goal 1 first, then for each detected event it pulls three frames from the video:
+
+* one just before the event midpoint
+* one at the midpoint
+* one just after
+
+It then scores how much visible change happened around that moment.
+
+---
+
+## Reaction scoring pipeline
+
+### Motion score
+
+Computes grayscale pixel-diff between:
+
+* before/mid
+* mid/after
+
+frame pairs.
+
+The mean diff intensity is normalized to `0–1`.
+
+The higher of the two values is used so the system captures both anticipatory movement and delayed reactions.
+
+---
+
+### Face score
+
+Runs Haar cascade face detection on the midpoint frame.
+
+The largest detected face is expressed as a fraction of the total frame area, then scaled up.
+
+The intuition is that a close-up reacting face carries more signal than a tiny face in a wide shot.
+
+---
+
+### Final reaction score
+
+```text
+reaction_score = (0.7 × motion_score) + (0.3 × face_score)
+```
+
+Motion receives the higher weight because movement is generally a stronger reaction signal than mere face visibility.
+
+Events below `0.25` are filtered out entirely.
+
+---
+
+## Sample output
+
+```text
+=== Important CC Candidate Events ===
+
+12.00s – 12.48s  [gunshot]  (audio:0.74)  (motion:0.81)  (face:0.55)  (reaction:0.73)
+
+18.72s – 19.20s  [applause] (audio:0.61)  (motion:0.52)  (face:0.44)  (reaction:0.50)
+
+3.36s – 5.28s    [music]    (audio:0.87)  (motion:0.03)  (face:0.18)  (reaction:0.08)  ← filtered out
+```
+
+The background music event gets dropped even though it had high audio confidence because there was no visible reaction associated with it.
+
+---
+
+# Limitations & Next Steps
+
+## YAMNet and Indian content
+
+YAMNet is trained on AudioSet, which is heavily English/Western biased.
+
+Sounds like:
+
+* dhol
+* firecrackers
+* devotional music
+
+may not map cleanly to existing classes.
+
+The fallback label helps, but benchmarking on actual PlanetRead clips would be valuable.
+
+PANNs may provide better coverage for these cases.
+
+---
+
+## Single-frame reaction scoring
+
+Currently only the midpoint frame is used for scoring.
+
+If someone reacts slightly after an event, the system can miss it.
+
+Sampling a short temporal window and taking the peak score would improve robustness.
+
+---
+
+## Face detection quality
+
+Haar cascades work reasonably well for frontal faces but struggle with:
+
+* profile views
+* partial occlusion
+* motion blur
+
+MediaPipe Pose + FaceMesh would provide richer reaction signals such as:
+
+* head turns
+* shoulder movement
+* mouth opening
+
+and is the natural upgrade path.
+
+---
+