Course: Machine Learning Engineer in the Generative AI Era Week: 3 of 10 Topics: Web Scraping, OCR, ASR, Data Cleaning, TTS, Voice Agents
This week you'll build a complete pretraining data pipeline -- from web scraping and OCR to data cleaning and deduplication -- mirroring what companies like Meta (LLaMA 4) and HuggingFace (FineWeb) use to prepare training data for frontier LLMs. You'll also build a voice agent that combines speech recognition, LLM reasoning, and text-to-speech into a conversational pipeline.
The homework is updated for 2025-2026 with modern tools: Crawl4AI for LLM-ready web extraction, Marker/Docling for PDF processing, faster-whisper for ASR, Kokoro for TTS, and Pipecat for voice agents.
- Extract clean text from web pages using trafilatura and Crawl4AI
- Perform OCR on images and PDFs with Tesseract, EasyOCR, Marker, and Docling
- Transcribe audio using faster-whisper with model size trade-offs
- Build a data cleaning pipeline with MinHash deduplication and PII removal
- Synthesize speech with edge-tts and Kokoro TTS
- Design a voice agent pipeline (ASR -> LLM -> TTS)
- Default model:
claude-sonnet-4-6 - Cost: ~$1-3 for the full assignment
- Requires:
ANTHROPIC_API_KEYin.env
- Default model:
qwen3.5:27b - Requires: ~20GB RAM,
ollama pull qwen3.5:27b
- Use both Claude + Ollama
- Python 3.11+
- ffmpeg (for audio processing)
- tesseract (for OCR)
# macOS
brew install tesseract ffmpeg
# Ubuntu/Debian
sudo apt install tesseract-ocr ffmpeg# 1. Clone the repository
git clone <repo-url>
cd Homework3-Submission
# 2. Create virtual environment (recommended)
python3.11 -m venv .venv
source .venv/bin/activate # macOS/Linux
# 3. Install Python packages
pip install -r requirements.txt
# 4. Set up environment variables
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY
# 5. (Path B/C only) Start Ollama
ollama serve
ollama pull qwen3.5:27bHomework3-Submission/
├── README.md
├── requirements.txt
���── .env.example
├── .gitignore
├── notebooks/
│ ├── 00_setup_verification.ipynb # 5 min
�� ├── 01_environment_setup.ipynb # 20 min
│ ├── 02_web_scraping.ipynb # 30 min
│ ├── 03_document_ocr.ipynb # 30 min
│ ├─�� 04_speech_recognition.ipynb # 25 min
│ ├── 05_data_cleaning.ipynb # 30 min
│ ├── 06_text_to_speech.ipynb # 25 min
│ ├── 07_voice_agent.ipynb # 30 min
│ └── 08_project_integration.ipynb # 35 min
├── src/
│ ├── __init__.py
��� ├── config.py # PATH, model settings
│ ├── llm_client.py # Claude + Ollama unified client
│ ├── cost_tracker.py # API cost tracking
│ ├── utils.py # Formatting, reflection helpers
│ ├── prompt_templates.py # CO-STAR framework
│ ├── scraping_utils.py # Web scraping (trafilatura, Crawl4AI)
│ ├── ocr_utils.py # OCR (Tesseract, EasyOCR, Marker, Docling)
│ ├── audio_utils.py # ASR (faster-whisper) + TTS (Kokoro, edge-tts)
│ └── data_pipeline.py # Data cleaning (dedup, PII, quality)
├── outputs/ # Auto-created by notebooks
│ ├── homework_reflection.md # Primary deliverable
│ ├── my_project_update.md # Project integration
│ ├── arxiv_papers.json # Scraped papers
│ └── cleaned_data.json # Pipeline output
├── test_data/ # Sample data for experiments
│ ├── audio/
│ ├── image/
│ └── data/
└── docs/
| Notebook | Topic | Time | Key Deliverable |
|---|---|---|---|
| 00 | Setup Verification | 5 min | -- |
| 01 | Environment Setup | 20 min | path_selection.md |
| 02 | Web Scraping & Text Extraction | 30 min | arxiv_papers.json |
| 03 | Document OCR & PDF Extraction | 30 min | OCR comparison analysis |
| 04 | Speech Recognition (ASR) | 25 min | Transcription comparison |
| 05 | Data Cleaning Pipeline | 30 min | cleaned_data.json |
| 06 | Text-to-Speech & Voice Synthesis | 25 min | Audio files |
| 07 | Voice Agent with Pipecat | 30 min | Multi-turn conversation |
| 08 | Project Integration | 35 min | my_project_update.md |
Total estimated time: ~4 hours
outputs/homework_reflection.md(70%) -- Built incrementally across all notebooksoutputs/my_project_update.md(20%) -- Data pipeline strategy for your capstone- All notebooks executed with TODOs filled in (10%)
| Path | Model | Estimated Cost |
|---|---|---|
| A | claude-sonnet-4-6 | ~$1.50-3.00 |
| B | qwen3.5:27b (Ollama) | Free |
| C | Hybrid | ~$0.50-1.50 |
Tesseract not found:
brew install tesseract # macOS
sudo apt install tesseract-ocr # Linuxffmpeg not found:
brew install ffmpeg # macOS
sudo apt install ffmpeg # LinuxOllama connection refused:
ollama serve # Start the server
ollama pull qwen3.5:27b # Pull the default modeledge-tts fails in Jupyter:
pip install nest-asyncio
# Then add at top of notebook: import nest_asyncio; nest_asyncio.apply()Marker/Docling import errors: These are optional packages that require PyTorch. Install separately:
pip install marker-pdf # For Marker
pip install docling # For Docling- Crawl4AI deep dive (+5%): Install Crawl4AI and compare with trafilatura on 10+ URLs
- Kokoro TTS (+5%): Install Kokoro and compare audio quality with edge-tts
- Full voice pipeline (+10%): Build a working FastAPI voice server with ASR->LLM->TTS
- Pipecat integration (+10%): Build a Pipecat voice agent with WebRTC transport
| Day | Notebooks | Focus |
|---|---|---|
| 1 | 00, 01 | Setup and path selection |
| 2 | 02 | Web scraping and text extraction |
| 3 | 03 | OCR and document extraction |
| 4 | 04, 05 | ASR and data cleaning pipeline |
| 5 | 06 | TTS and voice synthesis |
| 6 | 07 | Voice agent pipeline |
| 7 | 08 | Project integration and submission |
- Discord: #week-3 channel
- Office Hours: Check course calendar