This week you take HW5's working SFT pipeline (Qwen2.5-0.5B + MLX/HF-TRL+QLoRA) and turn it into a data-quality lab for an interview-prep agent. You will quantify diversity, generate synthetic data via three different strategies, filter it with multi-criteria rejection sampling, train three ablation variants, and validate with a pairwise Elo tournament plus a frontier-model comparison. Each TODO cell auto-saves to outputs/homework_reflection.md when you run it, so the reflection writes itself as you work.
- Quantify data diversity with TF-IDF cosine, vocabulary richness (TTR/MTLD), and category balance — not just eyeballing the JSON.
- Generate synthetic data via Self-Instruct, Distillation, and Evolutionary strategies, and know when each is the right tool.
- Apply multi-criteria LLM-as-judge rejection sampling with chain-of-thought scoring and iterative refinement of the bottom quartile.
- Run real ablation experiments — three SFT variants on different data mixtures (
max_steps=30) and compare loss + downstream quality. - Implement pairwise Elo tournament over anonymous A/B labels with position-swap to neutralize judge bias.
- Compare your model against Claude (required) and GPT (optional) on quality and per-token cost — frontier is the honest yardstick.
- Update HW5's gbrain skill with the improved adapter so the interview-prep skill picks up the new weights.
- Write a production model card documenting dataset, training run, eval results, known failure modes, and license.
- Install:
pip install -r requirements.txt - Ablation runtime: ~15-25 min total for 3 variants on M2/M3
- Smoke time per cell: ~2 min for synthetic generation, ~5 min per SFT variant
- No CUDA, no quantization libraries needed; MLX handles everything natively.
- Install:
pip install -r requirements.txt+ ensurenvidia-smishows your GPU - 8–16 GB VRAM target;
max_seq_length=512fits comfortably with QLoRA 4-bit - Ablation runtime: ~30–60 min total for 3 variants depending on GPU
- See
docs/unsloth_bonus.mdfor a 2× speedup option on CUDA.
See HW5's docs/stack_choice_guide.md for Modal / Lambda Labs / Runpod walkthroughs. For HW6 specifically: the AblationRunner works unchanged in cloud — point it at a mounted volume for outputs/.
- Completed HW5. HW6 reuses HW5's adapter and dataset as the baseline in the Elo tournament and the frontier comparison. NB01 will create stubs if HW5 artifacts are missing, but real HW5 outputs make the deltas meaningful.
- Python 3.10+ recommended (3.9 mostly works with deprecation warnings).
ANTHROPIC_API_KEYin.env(required — every notebook needs Claude as generator or judge).- Either: Apple Silicon Mac (Path A), GPU with 8+ GB VRAM (Path B), or CPU only (extremely slow, smoke tests only).
- OPTIONAL:
OPENAI_API_KEYfor the NB07 bonus GPT battle.
cd Homework6-Submission
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # edit to add your keys
python -c "import sys; sys.path.insert(0,'.'); from src import LLMClient, append_to_reflection; print('OK')"If the final python -c line prints OK, your environment is ready.
Homework6-Submission/
├── README.md # this file
├── LICENSE # MIT
├── requirements.txt # extends HW5 with seaborn + openai
├── .env.example # copy to .env and fill in keys
├── Class 6 Homework.ipynb # source assignment notebook (instructor copy)
├── notebooks/
│ ├── 00_setup_and_smoke.ipynb # env check + API smoke + HW5 artifact discovery
│ ├── 01_baseline_recap.ipynb # load HW5 adapter; baseline Elo + sanity gen
│ ├── 02_diversity_audit.ipynb # TF-IDF, MTLD, category balance on HW5 data
│ ├── 03_synthetic_strategies.ipynb # Self-Instruct + Distillation + Evolutionary
│ ├── 04_rejection_sampling.ipynb # multi-criteria judge + iterative refinement
│ ├── 05_ablation_training.ipynb # 3 SFT variants on different mixtures
│ ├── 06_elo_tournament.ipynb # pairwise A/B with position swap + ratings
│ ├── 07_real_model_battle.ipynb # vs Claude (req'd) + GPT (optional)
│ └── 08_skill_integration.ipynb # merge adapter, update gbrain skill, model card
├── src/
│ ├── __init__.py # public exports (named exports only)
│ ├── llm_client.py # Claude wrapper from HW5, +OpenAI optional
│ ├── reflection.py # append_to_reflection() for auto-saved answers
│ ├── dataset_io.py # JSONL/JSON load/save with schema check
│ ├── diversity_metrics.py # TF-IDF, TTR, MTLD, category balance
│ ├── diversity_report.py # plot + json export of all metrics
│ ├── self_instruct.py # Self-Instruct synthetic strategy
│ ├── distillation.py # Strong-teacher distillation strategy
│ ├── evolutionary_gen.py # Evol-Instruct / WizardLM-style mutator
│ ├── synthetic_pipeline.py # orchestrator: blend strategies, dedupe
│ ├── multi_criteria_judge.py # 4-axis CoT scorer
│ ├── rejection_sampler.py # filter + iterative re-gen
│ ├── ablation_runner.py # train N variants, log to scoreboard
│ ├── sft_runner_mlx.py # Path A backend
│ ├── sft_runner_hf.py # Path B backend
│ ├── elo_tournament.py # K-factor Elo with position swap
│ ├── pairwise_judge.py # anonymous A/B grader with CoT
│ ├── real_model_battle.py # frontier comparison harness
│ ├── pii_scrub.py # regex + optional spacy NER
│ ├── model_card.py # render outputs/final_model_card.md
│ ├── skill_wrapper.py # gbrain interview-prep skill update
│ └── cost_tracker.py # token + USD accounting per notebook
├── docs/
│ ├── data_quality_playbook.md # SFT data-quality field guide
│ └── unsloth_bonus.md # Linux+CUDA 2× speedup
├── outputs/ # generated by notebooks (gitignored content)
└── test_data/ # tiny fixtures for smoke tests
| NB | Title | Topic | Time | Key Deliverable |
|---|---|---|---|---|
| 00 | Setup & Smoke | Env check, API ping, HW5 artifact discovery | 10 min | outputs/env_report.json |
| 01 | Baseline Recap | Load HW5 adapter, generate sanity completions | 15 min | outputs/baseline_completions.json |
| 02 | Diversity Audit | TF-IDF, MTLD, balance on HW5 dataset | 30 min | outputs/diversity_baseline.json |
| 03 | Synthetic Strategies | Self-Instruct + Distillation + Evolutionary blend | 35 min | outputs/synthetic_v2_dataset.json |
| 04 | Rejection Sampling | Multi-criteria judge, iterative re-gen | 30 min | outputs/rejection_sampled_dataset.json |
| 05 | Ablation Training | 3 SFT variants, scoreboard | 35 min | outputs/ablation_scoreboard.json |
| 06 | Elo Tournament | Pairwise A/B, position swap, ratings | 30 min | outputs/elo_tournament_results.json |
| 07 | Real-Model Battle | vs Claude required, GPT optional | 30 min | outputs/real_model_battle.json |
| 08 | Skill Integration | Merge adapter, gbrain update, model card | 30 min | outputs/final_model_card.md |
Every TODO cell auto-saves to outputs/homework_reflection.md when you run the cell.
In each notebook:
- TODO cells contain a
todo1_reflection = """..."""placeholder string with hints. - You edit the placeholder to your answer (the hints stay as scaffolding).
- The summary cell at the end of the notebook automatically calls
append_to_reflection()with your answers.
You don't need to copy/paste anything manually. Just edit the placeholder, run the cells, and the reflection builds itself.
outputs/homework_reflection.md— auto-built across NB01–NB08 viaappend_to_reflection()outputs/synthetic_v2_dataset.json— ≥ 80 records, ≥ 3 categories [NB03]outputs/rejection_sampled_dataset.json— kept set after multi-criteria filter [NB04]outputs/diversity_baseline.json+outputs/diversity_after_synth.json— before/after metrics [NB02–03]outputs/ablation_scoreboard.json— ≥ 3 variants with loss + qualitative scores [NB05]outputs/elo_tournament_results.json— pairwise wins, position-swap counts, final ratings [NB06]outputs/real_model_battle.json— Claude required, GPT optional [NB07]- A merged adapter + Ollama Modelfile [NB08]
outputs/final_model_card.md— dataset, training, eval, limitations, license [NB08]outputs/my_project_update.md— short narrative on what changed since HW5 [NB08]
- 4th synthetic strategy — GAN-style adversarial or co-teaching with two LLMs.
- Online Elo K-update — K=32 for the first 50 matches (burn-in), K=16 after.
- Use Unsloth on Linux for 2× faster ablation (see
docs/unsloth_bonus.md). crowd_judgewith 3+ different judges and disagreement analysis.- Open-source the dataset on HuggingFace Hub with a model card.
- Automated PII scrubbing combining regex + spaCy NER.
| Task | Notebook | Cost (typical) |
|---|---|---|
| Claude synthetic generation (3 strategies) | NB03 | $1.00 – $2.00 |
| Claude multi-criteria judging + re-gen | NB04 | $0.50 – $1.00 |
| Claude pairwise + ablation eval | NB05–NB06 | $0.50 – $1.00 |
| Claude real-model battle | NB07 | $0.50 – $1.00 |
Optional GPT (if OPENAI_API_KEY set) |
NB07 | $0.10 – $0.30 |
| Total | ~$2.50 – $5 | |
| Worst case (heavy iteration) | ~$7 |
src/cost_tracker.py tallies tokens and USD per notebook so you can see the running total live.
- HW5 artifacts missing — NB01 creates stubs so HW6 still runs, but the baseline becomes a generic chat model and your Elo deltas will be meaningless. Re-run HW5's NB07 (
merge_adapter.ipynb) before starting HW6 if at all possible. scikit-learninstall fails — trypip install scikit-learn --no-build-isolation(common on Python 3.13 + macOS).seabornimport error in NB02 —pip install seaborn; the diversity report falls back to matplotlib-only plots if seaborn is unavailable.- Ollama timeout — confirm
ollama serveis running; raiseOLLAMA_HOSTtimeout viaOLLAMA_TIMEOUT=120env var. - OpenAI not installed — NB07 bonus cells skip cleanly with a printed note. Install with
pip install openaiif you want the GPT battle. mlx-lmnot found — Apple Silicon only; non-Mac users go Path B (HF+TRL+QLoRA).peftversion mismatch with TRL — pinpeft==0.14.0andtrl==0.13.0; newer pre-releases breakSFTTrainerconstructor signatures.llama.cppmissing for GGUF export — NB08 prints install instructions; the deploy step is skipped gracefully ifllama.cppisn't on$PATH.- Judge LLM rate limits — reduce concurrent calls (
max_workers=2); Haiku has higher RPM than Sonnet, swap if you're being throttled. - Ablation OOM on small GPU — drop
max_seq_lengthto 256 andper_device_train_batch_size=1; LoRAr=8instead ofr=16saves another ~10% VRAM. - Cell hangs on synthetic data gen — set a lower
max_budget_usdinsynthetic_pipeline; the cell aborts cleanly when the budget is hit. - Pairwise judge timeout — a CoT pairwise call can take 30s+; let it finish or reduce
n_roundsin NB06. bitsandbytesimport error on Mac — expected; it's lazy-imported insft_runner_hf.pyand the code falls back to fp16/bf16 when missing.
- Self-Instruct — bootstrap synthetic prompts by asking an LLM to expand a small seed set.
- Distillation (synthetic data) — use a stronger teacher model (e.g. Claude Sonnet) to produce gold-standard responses you train your smaller model to imitate.
- Evolutionary generation — iteratively mutate prompts (deepen, broaden, add constraints) and keep the winners by judge score (WizardLM Evol-Instruct).
- Rejection Sampling — generate many candidates, filter to the ones that pass quality gates, optionally re-generate the rejected pile.
- Multi-Criteria Judge — score each example on multiple axes (instruction-following, factuality, format, depth) instead of a single rating.
- Chain-of-Thought scoring — judge explains its reasoning before emitting the score, which raises consistency.
- Pairwise comparison — judge sees two anonymous responses (A vs B) and picks a winner; more reliable than absolute scoring.
- Elo rating — 1v1 ranking system (chess-origin) where ratings update by expected vs actual win rate.
- Position bias — judges systematically prefer the first or second response; mitigate with position swap (run each pair twice).
- Crowd judging — aggregate multiple judge LLMs to reduce per-judge bias.
- Data ablation — train N variants on different data mixtures to isolate which slice of the data drives quality gains.
- MTLD — Measure of Textual Lexical Diversity, a length-invariant vocabulary richness score.
- TTR — Type-Token Ratio, vocabulary diversity = unique tokens / total tokens.
- Lexical diversity (TF-IDF) —
1 - mean_pairwise_cosine_similarityon TF-IDF vectors; higher = more linguistically varied. - Anonymous A/B labels — pairwise responses are labeled "Model X / Model Y" so the judge can't see which fine-tune produced which response.
- Karpathy nanochat — https://github.com/karpathy/nanochat (read the data section)
- Self-Instruct paper (Wang et al., 2022) — https://arxiv.org/abs/2212.10560
- LMSYS Chatbot Arena — https://chat.lmsys.org (the canonical pairwise-Elo deployment)
- HuggingFace TRL — https://huggingface.co/docs/trl
- gbrain agent architecture — https://github.com/garrytan/gbrain
- HW5 README — for the upstream SFT pipeline this builds on
docs/data_quality_playbook.md— distilled lecture notesdocs/unsloth_bonus.md— Linux+CUDA 2× accelerator path
- Day 1 — NB00 setup + NB01 baseline (25 min). Verify HW5 artifacts load.
- Day 2 — NB02 diversity measurement (30 min). Read
data_quality_playbook.md§2. - Day 3 — NB03 synthetic strategies (35 min, longest cell is the synthesis loop).
- Day 4 — NB04 rejection sampling (30 min). Tune your judge rubric.
- Day 5 — NB05 ablation training (35 min, longest compute step of the week).
- Day 6 — NB06 Elo tournament + NB07 real-model comparison (60 min combined).
- Day 7 — NB08 integration + final reflection sweep (30 min).
Week 7 likely covers serving optimization (vLLM, continuous batching, AWQ/GPTQ quantization beyond GGUF) or agents/tool-use building on the gbrain skill. This README will be updated when the Class 7 syllabus is finalized.
- Course GitHub — file an issue with the notebook number and the cell that failed; include the last 20 lines of stderr.
- Office hours — see syllabus.