Course: Machine Learning Engineer in the Generative AI Era
Week: 4 of 10
Topics: Document loading, chunking, embeddings, vector stores, hybrid retrieval, reranking, RAG evaluation
Capstone: Build a RAG-enabled "Resume AI Assistant" (the in-class project)
This week, you'll graduate from "asking an LLM and hoping" to grounded question answering over your own documents. By the end you will have built a complete production-style RAG pipeline that can answer questions about your resume + portfolio (or any corpus you choose) with citations and an automated faithfulness score — no hallucinations.
You'll touch the entire stack: PDF extraction (PyMuPDF), four chunking strategies including 2024's contextual retrieval (Anthropic), three embedding models from the MTEB leaderboard, three vector stores (FAISS, Chroma, Qdrant), advanced retrieval (hybrid search + RRF, HyDE, MMR, multi-query rewriting), cross-encoder + FlashRank reranking, and four RAGAS-style evaluation metrics. The final notebook wires everything into a working assistant and (bonus) a FastAPI service.
- Extract clean text from PDFs, web pages, and the arXiv API
- Compare chunking strategies (recursive, semantic, contextual) and feel why chunk size dominates RAG quality
- Pick the right embedding model for YOUR corpus by Recall@k — not by the leaderboard
- Build, persist, and query indexes in FAISS, Chroma, and Qdrant
- Implement and benchmark hybrid search, HyDE, MMR, and multi-query retrieval
- Add a reranker (cross-encoder / FlashRank) — the highest-leverage 100ms in RAG
- Evaluate end-to-end with reference-free RAGAS-style metrics (faithfulness, context precision, answer relevancy)
- Build a working Resume RAG Assistant that refuses out-of-corpus questions
- Default model:
claude-sonnet-4-6(withclaude-haiku-4-5-20251001for cheap eval calls) - Cost: ~$0.50–$2.00 for the entire week
- Requires:
ANTHROPIC_API_KEYin.env
- Default model:
qwen3.5:27b(orllama3.1:8bif you have less RAM) - Cost: $0
- Requires: ~20GB RAM,
ollama pull qwen3.5:27b
- Heavy generation + eval on Claude, light/cheap calls on Ollama
- Best for cost-conscious learning if you have a local GPU
- Python 3.9+ (3.10 or 3.11 recommended for
sentence-transformers) - ~3 GB free disk (sentence-transformer + cross-encoder weights, FAISS, Chroma)
- No system packages required this week 🎉
# 1. Clone / cd into the homework folder
cd Homework4-Submission
# 2. Create + activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up your API key (Path A)
cp .env.example .env
# edit .env -- paste your ANTHROPIC_API_KEY
# 5. (Path B only) install + start Ollama, then pull a model
# https://ollama.com/download
ollama serve &
ollama pull qwen3.5:27b
# 6. Verify everything in nb00
jupyter notebook notebooks/00_setup_verification.ipynbHomework4-Submission/
├── README.md # This file
├── requirements.txt # All Python deps
├── .env.example # Copy to .env
├── .gitignore
├── LICENSE
├── notebooks/
│ ├── 00_setup_verification.ipynb # ~5 min -- env check
│ ├── 01_environment_setup.ipynb # ~20 min -- pick path A/B/C
│ ├── 02_document_loading.ipynb # ~30 min -- PDF, arXiv, web
│ ├── 03_chunking_strategies.ipynb # ~30 min -- recursive, semantic, contextual
│ ├── 04_embeddings_deep_dive.ipynb # ~30 min -- MiniLM, BGE, OpenAI
│ ├── 05_vector_stores.ipynb # ~30 min -- FAISS, Chroma, Qdrant
│ ├── 06_retrieval_strategies.ipynb # ~35 min -- hybrid, HyDE, MMR, multi-query
│ ├── 07_reranking_evaluation.ipynb # ~35 min -- cross-encoder + RAGAS metrics
│ └── 08_project_integration.ipynb # ~40 min -- Resume RAG agent + FastAPI
├── src/
│ ├── __init__.py
│ ├── config.py # PATH, CLAUDE_MODEL, OLLAMA_MODEL
│ ├── llm_client.py # Unified Claude + Ollama client (reused)
│ ├── cost_tracker.py # Reused
│ ├── utils.py # Reused
│ ├── prompt_templates.py # CO-STAR templates
│ ├── document_loader.py # PDF, arXiv, web, directory loaders [NEW]
│ ├── chunking.py # Recursive, semantic, contextual [NEW]
│ ├── embeddings.py # Multi-backend embedding wrapper [NEW]
│ ├── vector_store.py # FAISS / Chroma / Qdrant unified [NEW]
│ ├── retrieval.py # BM25, hybrid+RRF, HyDE, MMR [NEW]
│ ├── reranker.py # CrossEncoder, FlashRank, Cohere [NEW]
│ ├── rag_evaluation.py # Faithfulness, recall, precision [NEW]
│ └── rag_pipeline.py # End-to-end pipeline glue [NEW]
├── outputs/ # Auto-created by notebooks
│ ├── homework_reflection.md # 70% — built incrementally
│ ├── my_project_update.md # 20% — generated by nb08
│ ├── path_selection.md # nb01
│ ├── setup_summary.txt # nb01
│ ├── corpus_stats.json # nb02
│ ├── embedding_scoreboard.json # nb04
│ ├── retrieval_scoreboard.json # nb06
│ ├── eval_result_07.json # nb07
│ ├── ab_reranker.json # nb07
│ ├── demo_qa.json # nb08
│ ├── project_index/ # nb08 -- persisted FAISS index
│ ├── chroma_db/ # nb05 -- persisted Chroma DB
│ └── main.py # nb08 bonus -- FastAPI server
├── test_data/
│ ├── sample_resume.pdf # Shipped fixture
│ ├── portfolio_notes.txt # Shipped fixture
│ ├── arxiv/ # nb02 downloads
│ └── my_corpus/ # YOUR docs (curated in nb02)
└── docs/
| Notebook | Topic | Time | Key Deliverable |
|---|---|---|---|
| 00 | Setup Verification | 5 min | — |
| 01 | Environment Setup | 20 min | path_selection.md |
| 02 | Document Loading & Extraction | 30 min | corpus_stats.json + curated my_corpus/ |
| 03 | Chunking Strategies | 30 min | Reflection + chunk-length plots |
| 04 | Embeddings Deep Dive | 30 min | embedding_scoreboard.json |
| 05 | Vector Stores | 30 min | Persisted Chroma + FAISS indexes |
| 06 | Retrieval Strategies | 35 min | retrieval_scoreboard.json |
| 07 | Reranking & Evaluation | 35 min | ab_reranker.json + RAGAS scores |
| 08 | Project Integration | 40 min | my_project_update.md + outputs/main.py |
Total estimated time: ~4–5 hours
| File | % | Notes |
|---|---|---|
outputs/homework_reflection.md |
70% | Built incrementally by append_to_reflection() in nb01-08. Depth, evidence of experimentation, and clear reasoning. |
outputs/my_project_update.md |
20% | Generated by nb08. Architecture, demo Q&A with sources + faithfulness scores, adversarial test, FastAPI write-up, next-step. |
| All 9 notebooks executed | 10% | TODOs filled in, no runtime errors. |
- Working FastAPI server (
outputs/main.py) with acurltranscript in your reflection (+5%) - Try a third embedding model not in the default set (Voyage / OpenAI / BGE-M3) and compare (+3%)
- Add a knowledge-graph layer (e.g.
networkxover entity mentions) and show one query where it beats vanilla hybrid (+5%) - Replace our
evaluate_ragwith the real RAGAS library on the same questions and compare scores (+3%)
| Path | Model | Approx. cost (full week) |
|---|---|---|
| A | claude-sonnet-4-6 + claude-haiku-4-5 (eval) | $0.50 – $2.00 |
| B | qwen3.5:27b (Ollama, local) | $0.00 |
| C | mostly haiku for eval, sonnet for nb08 | $0.20 – $0.80 |
Major cost drivers: nb03 contextual chunking (~10 LLM calls), nb06 multi-query rewriting + HyDE (~30 calls), nb07 RAGAS eval (~50 LLM-judge calls), nb08 demo + adversarial (~20 calls). The eval metrics dominate — that's why we default the judge to Haiku.
Downloads ~80MB on first call. Cached after that. If behind a firewall, set HF_HUB_OFFLINE=1 once cached.
pip install faiss-cpu (the GPU build doesn't ship for macOS — that's fine, CPU is plenty for this homework).
Some Linux distros ship a too-old SQLite. Quick fix: pip install pysqlite3-binary then add at the top of any failing notebook:
__import__('pysqlite3'); import sys; sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')Safe to ignore — the chunked output is still valid.
qwen3.5:27b is ~17GB and won't fit in unified memory on most M1 / 8GB devices. Use llama3.1:8b instead, or switch to Path A.
You're rate-limited. Wait 60 seconds, or set download_pdfs=False in fetch_arxiv_papers and use the metadata only.
Big models (BGE-base) cache to ~/.cache/huggingface. Either clean it or set HF_HOME=/some/large/disk.
- Anthropic — Contextual Retrieval (2024)
- HyDE: Precise Zero-Shot Dense Retrieval
- Reciprocal Rank Fusion (RRF)
- LangChain RAG docs
- LlamaIndex RAG docs
- MTEB Leaderboard — current SOTA
- BAAI/bge-m3 — versatile open-source
- Cohere Rerank docs
| Day | Notebooks | Focus |
|---|---|---|
| Mon | 00, 01 | Get the env right, pick your path |
| Tue | 02 | Curate your corpus (this matters a lot — pick docs you actually want to query) |
| Wed | 03 | Try at least 2 chunking strategies on your corpus |
| Thu | 04, 05 | Pick your embedding model + vector store |
| Fri | 06 | Retrieval bake-off — find what works on YOUR data |
| Sat | 07 | Add the reranker, run RAGAS metrics |
| Sun | 08 | Build the Resume RAG Agent + (bonus) FastAPI; write my_project_update.md |
Week 5: Supervised Fine-Tuning (SFT) — you'll move from retrieval to teaching the model your style with custom Q&A pairs. Many real-world systems combine the two: RAG for knowledge, SFT for tone.
- Discord:
#week-4channel - Office hours: see course calendar
- Stuck? Open an issue or message the instructor with: (a) which notebook, (b) the cell, (c) the full traceback