Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual environments
venv/
env/
.env/
.venv/

# IDE
.idea/
.vscode/
*.swp
*.swo
*~

# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/

# OS
.DS_Store
Thumbs.db

# Model cache
models/cache/
*.h5
*.tflite

# Media files (test inputs/outputs)
*.mp4
*.avi
*.mkv
*.mov
*.wav
*.mp3
*.srt
!tests/fixtures/*.srt

# Logs
*.log
logs/

# Jupyter
.ipynb_checkpoints/
150 changes: 150 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Intelligent Closed Caption (CC) Suggestion Tool

An AI-powered tool that intelligently identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds.

## Architecture

```
┌─────────────┐ ┌──────────────────┐ ┌──────────────────────┐
│ Video File │───▶│ Audio Extractor │───▶│ Sound Event Detector │
│ (input) │ │ (ffmpeg/moviepy) │ │ (YAMNet) │
└──────┬───────┘ └──────────────────┘ └──────────┬───────────┘
│ │
│ ┌──────────────────┐ │
└───────────▶│ Frame Extractor │ │
│ (OpenCV) │ │
└────────┬─────────┘ │
│ │
┌────────▼─────────┐ │
│ Reaction Detector│ │
│ (MediaPipe) │ │
└────────┬─────────┘ │
│ │
┌────────▼─────────────────────────▼┐
│ CC Decision Engine │
│ Combines audio + visual signals │
└────────────────┬───────────────────┘
┌────────▼────────┐
│ SRT Generator │
└────────┬────────┘
┌────────▼────────┐
│ output.srt │
└─────────────────┘
```

## Features

- **Sound Event Detection** — Automatically detects and classifies non-speech audio events (honking, explosions, laughter, music, alarms, applause, etc.) with confidence scores and timestamps using YAMNet.
- **Speaker Reaction Detection** — Analyzes video frames at detected event timestamps using MediaPipe to identify visible reactions (head turns, startled body language, facial expressions).
- **Intelligent CC Decisions** — Combines audio and visual signals to determine whether a CC annotation is truly warranted, avoiding over-captioning of ambient sounds.
- **SRT Output** — Generates standard SRT subtitle files with properly formatted timestamps and descriptive CC labels like `[honking]`, `[crowd cheering]`, `[gunshot]`.

## Prerequisites

- **Python 3.9+**
- **FFmpeg** — Must be installed and available on your system PATH
- Windows: `choco install ffmpeg` or download from [ffmpeg.org](https://ffmpeg.org/download.html)
- macOS: `brew install ffmpeg`
- Linux: `sudo apt install ffmpeg`

## Installation

1. **Clone the repository**
```bash
git clone https://github.com/PlanetRead/Intelligent-cc-generation.git
cd Intelligent-cc-generation
```

2. **Create a virtual environment**
```bash
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
```

3. **Install dependencies**
```bash
pip install -r requirements.txt
```

4. **Install in development mode** (optional)
```bash
pip install -e .
```

## 🎯 Usage

### Extract audio from a video file
```python
from src.utils.audio_extractor import AudioExtractor

extractor = AudioExtractor()
audio_path = extractor.extract("input_video.mp4")
print(f"Audio saved to: {audio_path}")
```

### Full pipeline (coming soon)
```bash
python -m src.cli --input video.mp4 --output captions.srt
```

## Running Tests

```bash
pytest tests/ -v
```

## Project Structure

```
Intelligent-cc-generation/
├── src/
│ ├── __init__.py
│ ├── cli.py # CLI entry point
│ ├── utils/
│ │ ├── __init__.py
│ │ └── audio_extractor.py # Video → Audio extraction
│ ├── detectors/
│ │ ├── __init__.py
│ │ ├── sound_event_detector.py # YAMNet-based audio analysis
│ │ └── reaction_detector.py # MediaPipe-based visual analysis
│ ├── models/
│ │ ├── __init__.py
│ │ ├── event.py # SoundEvent dataclass
│ │ ├── reaction.py # ReactionEvent dataclass
│ │ └── cc_suggestion.py # CCSuggestion dataclass
│ ├── engine/
│ │ ├── __init__.py
│ │ └── decision_engine.py # CC decision combiner
│ └── output/
│ ├── __init__.py
│ └── srt_generator.py # SRT file writer
├── config/
│ └── settings.py # Configuration defaults
├── tests/
│ ├── __init__.py
│ ├── test_audio_extractor.py
│ └── fixtures/
├── requirements.txt
├── setup.py
├── .gitignore
└── README.md
```

## Tech Stack

| Component | Technology |
|-----------|-----------|
| Language | Python 3.9+ |
| Audio Event Detection | [YAMNet](https://tfhub.dev/google/yamnet/1) (TensorFlow Hub) |
| Frame Extraction | [OpenCV](https://opencv.org/) |
| Pose & Expression Analysis | [MediaPipe](https://mediapipe.dev/) |
| Audio Extraction | [FFmpeg](https://ffmpeg.org/) via moviepy |
| Output Format | SRT (SubRip Subtitle) |


## License

This project is part of the [Planet Read](https://www.planetread.org/) initiative under the DMP 2026 program.
1 change: 1 addition & 0 deletions config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Configuration package."""
115 changes: 115 additions & 0 deletions config/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
"""Configuration settings for the Intelligent CC Suggestion Tool."""

import os


# =============================================================================
# Audio Extraction Settings
# =============================================================================

# Default audio sample rate for extracted audio (Hz)
AUDIO_SAMPLE_RATE = 16000

# Default audio format for extracted files
AUDIO_FORMAT = "wav"

# Default output directory for extracted audio files
AUDIO_OUTPUT_DIR = os.path.join(os.getcwd(), "output", "audio")


# =============================================================================
# Sound Event Detection Settings
# =============================================================================

# Minimum confidence threshold for a sound event to be considered
SOUND_CONFIDENCE_THRESHOLD = 0.3

# Analysis window size in seconds for the sound event detector
ANALYSIS_WINDOW_SIZE = 0.96 # YAMNet default patch size

# Hop length between analysis windows in seconds
ANALYSIS_HOP_LENGTH = 0.48

# Non-speech event categories to detect (YAMNet class names)
# Full list: https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv
TARGET_SOUND_EVENTS = [
"Gunshot, gunfire",
"Explosion",
"Glass",
"Breaking",
"Siren",
"Car alarm",
"Vehicle horn, car horn, honking",
"Screaming",
"Crying, sobbing",
"Laughter",
"Applause",
"Cheering",
"Crowd",
"Dog",
"Thunder",
"Alarm",
"Bell",
"Door",
"Knock",
"Telephone",
"Music",
"Singing",
"Drum",
"Fire",
"Water",
"Rain",
"Wind",
]


# =============================================================================
# Reaction Detection Settings
# =============================================================================

# Number of frames to extract around each event timestamp
REACTION_FRAME_COUNT = 10

# Time window (seconds) before and after event to look for reactions
REACTION_TIME_WINDOW = 1.5

# Minimum confidence for a reaction to be considered significant
REACTION_CONFIDENCE_THRESHOLD = 0.4

# Head turn angle threshold (degrees) to consider as a reaction
HEAD_TURN_THRESHOLD = 15.0

# Pose change threshold (normalized) for startled body language
POSE_CHANGE_THRESHOLD = 0.1


# =============================================================================
# CC Decision Engine Settings
# =============================================================================

# Weight for audio event confidence in the final decision
AUDIO_WEIGHT = 0.6

# Weight for visual reaction confidence in the final decision
VISUAL_WEIGHT = 0.4

# Combined confidence threshold for generating a CC annotation
CC_DECISION_THRESHOLD = 0.5

# Minimum duration (seconds) between consecutive CC annotations
# to avoid overwhelming the viewer
MIN_CC_GAP = 2.0


# =============================================================================
# Output Settings
# =============================================================================

# Default output format
OUTPUT_FORMAT = "srt"

# Default output directory for generated subtitle files
OUTPUT_DIR = os.path.join(os.getcwd(), "output")

# Default CC display duration (seconds) if not determined by event duration
DEFAULT_CC_DURATION = 2.0
20 changes: 20 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Core dependencies
moviepy>=1.0.3
numpy>=1.24.0

# Audio/Video processing
librosa>=0.10.0
soundfile>=0.12.0
opencv-python>=4.8.0

# ML Models
tensorflow>=2.13.0
tensorflow-hub>=0.14.0
mediapipe>=0.10.0

# Testing
pytest>=7.4.0
pytest-cov>=4.1.0

# Utilities
pydub>=0.25.1
Loading