Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Virtual environment
venv/
.venv/

# Python cache
__pycache__/
*.pyc

# VS Code
.vscode/

# macOS
.DS_Store
272 changes: 272 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Intelligent CC Suggestion Tool (DMP 2026)

A Python pipeline that detects non-speech audio events in a video and checks whether there's a visible reaction on screen before flagging it as a CC candidate. The idea is to avoid dumping every background sound into the caption track — only events that actually affect what's happening on screen should get a CC.

This submission covers:

* Goal 1 — Sound Event Detection
* Goal 2 — Speaker Reaction Detection

---

# File Structure

```text
.
├── sound_event_detector.py # Goal 1
├── reaction_detector.py # Goal 2
├── requirements.txt
└── README.md
```

---

# Prerequisites

You'll need `ffmpeg` installed on your machine.

## macOS

```bash
brew install ffmpeg
```

## Ubuntu / Debian

```bash
sudo apt install ffmpeg
```

---

# Setup & Installation

## 1. Clone the repository

```bash
git clone <your-repo-url>
cd Intelligent-cc-generation
```

## 2. Create a virtual environment

### macOS / Linux

```bash
python3 -m venv venv
```

### Windows

```bash
python -m venv venv
```

---

## 3. Activate the virtual environment

### macOS / Linux

```bash
source venv/bin/activate
```

### Windows

```bash
venv\Scripts\activate
```

---

## 4. Install dependencies

```bash
pip install -r requirements.txt
```

The first run will download the YAMNet model (~25 MB) from TensorFlow Hub and cache it locally. Subsequent runs are instant.

---

# Goal 1 — Sound Event Detection

Run:

```bash
python3 sound_event_detector.py video.mp4
```

This extracts the audio track via ffmpeg (converting to mono 16kHz WAV, which is what YAMNet expects), runs it through YAMNet, and returns a list of non-speech events with timestamps and confidence scores.

## Implementation details

### Speech filtering

YAMNet class indices `0–6` are all speech variants:

* Speech
* Male speech
* Female speech
* Child speech
* Conversation
* Narration
* Whispering

These are hard-dropped before anything else. For Hindi/regional content where dialogue is dense, doing this as a blocklist rather than a low-confidence filter makes a noticeable difference.

---

### Event merging

Consecutive YAMNet windows (~0.48s each) with the same label get merged into a single event. Peak confidence is preserved.

Without this, a 2-second gunshot would show up as four separate entries.

---

### CC label mapping

There's a lookup table that maps YAMNet class names to readable CC labels:

* `[gunshot]`
* `[glass breaking]`
* `[applause]`
* etc.

Anything not in the table falls back to the first word of the YAMNet class name.

---

## Sample output

```text
3.36s – 5.28s [music] (conf: 0.87) [Music]
12.00s – 12.48s [gunshot] (conf: 0.74) [Gunshot, gunfire]
18.72s – 19.20s [applause] (conf: 0.61) [Applause]
```

---

# Goal 2 — Speaker Reaction Detection

Run:

```bash
python3 reaction_detector.py video.mp4
```

This runs Goal 1 first, then for each detected event it pulls three frames from the video:

* one just before the event midpoint
* one at the midpoint
* one just after

It then scores how much visible change happened around that moment.

---

## Reaction scoring pipeline

### Motion score

Computes grayscale pixel-diff between:

* before/mid
* mid/after

frame pairs.

The mean diff intensity is normalized to `0–1`.

The higher of the two values is used so the system captures both anticipatory movement and delayed reactions.

---

### Face score

Runs Haar cascade face detection on the midpoint frame.

The largest detected face is expressed as a fraction of the total frame area, then scaled up.

The intuition is that a close-up reacting face carries more signal than a tiny face in a wide shot.

---

### Final reaction score

```text
reaction_score = (0.7 × motion_score) + (0.3 × face_score)
```

Motion receives the higher weight because movement is generally a stronger reaction signal than mere face visibility.

Events below `0.25` are filtered out entirely.

---

## Sample output

```text
=== Important CC Candidate Events ===

12.00s – 12.48s [gunshot] (audio:0.74) (motion:0.81) (face:0.55) (reaction:0.73)

18.72s – 19.20s [applause] (audio:0.61) (motion:0.52) (face:0.44) (reaction:0.50)

3.36s – 5.28s [music] (audio:0.87) (motion:0.03) (face:0.18) (reaction:0.08) ← filtered out
```

The background music event gets dropped even though it had high audio confidence because there was no visible reaction associated with it.

---

# Limitations & Next Steps

## YAMNet and Indian content

YAMNet is trained on AudioSet, which is heavily English/Western biased.

Sounds like:

* dhol
* firecrackers
* devotional music

may not map cleanly to existing classes.

The fallback label helps, but benchmarking on actual PlanetRead clips would be valuable.

PANNs may provide better coverage for these cases.

---

## Single-frame reaction scoring

Currently only the midpoint frame is used for scoring.

If someone reacts slightly after an event, the system can miss it.

Sampling a short temporal window and taking the peak score would improve robustness.

---

## Face detection quality

Haar cascades work reasonably well for frontal faces but struggle with:

* profile views
* partial occlusion
* motion blur

MediaPipe Pose + FaceMesh would provide richer reaction signals such as:

* head turns
* shoulder movement
* mouth opening

and is the natural upgrade path.

---

Loading