Week 6: SFT Part II — Data Quality, Diversity & Elo Evaluation

Overview

This week you take HW5's working SFT pipeline (Qwen2.5-0.5B + MLX/HF-TRL+QLoRA) and turn it into a data-quality lab for an interview-prep agent. You will quantify diversity, generate synthetic data via three different strategies, filter it with multi-criteria rejection sampling, train three ablation variants, and validate with a pairwise Elo tournament plus a frontier-model comparison. Each TODO cell auto-saves to outputs/homework_reflection.md when you run it, so the reflection writes itself as you work.

Learning Objectives

Quantify data diversity with TF-IDF cosine, vocabulary richness (TTR/MTLD), and category balance — not just eyeballing the JSON.
Generate synthetic data via Self-Instruct, Distillation, and Evolutionary strategies, and know when each is the right tool.
Apply multi-criteria LLM-as-judge rejection sampling with chain-of-thought scoring and iterative refinement of the bottom quartile.
Run real ablation experiments — three SFT variants on different data mixtures (max_steps=30) and compare loss + downstream quality.
Implement pairwise Elo tournament over anonymous A/B labels with position-swap to neutralize judge bias.
Compare your model against Claude (required) and GPT (optional) on quality and per-token cost — frontier is the honest yardstick.
Update HW5's gbrain skill with the improved adapter so the interview-prep skill picks up the new weights.
Write a production model card documenting dataset, training run, eval results, known failure modes, and license.

Setup Options

Path A: MLX-LM (Recommended on Mac Apple Silicon)

Install: pip install -r requirements.txt
Ablation runtime: ~15-25 min total for 3 variants on M2/M3
Smoke time per cell: ~2 min for synthetic generation, ~5 min per SFT variant
No CUDA, no quantization libraries needed; MLX handles everything natively.

Path B: HF + PEFT + TRL + QLoRA (Recommended on Linux GPU)

Install: pip install -r requirements.txt + ensure nvidia-smi shows your GPU
8–16 GB VRAM target; max_seq_length=512 fits comfortably with QLoRA 4-bit
Ablation runtime: ~30–60 min total for 3 variants depending on GPU
See docs/unsloth_bonus.md for a 2× speedup option on CUDA.

Path C: Cloud GPU (Bonus)

See HW5's docs/stack_choice_guide.md for Modal / Lambda Labs / Runpod walkthroughs. For HW6 specifically: the AblationRunner works unchanged in cloud — point it at a mounted volume for outputs/.

Prerequisites

Completed HW5. HW6 reuses HW5's adapter and dataset as the baseline in the Elo tournament and the frontier comparison. NB01 will create stubs if HW5 artifacts are missing, but real HW5 outputs make the deltas meaningful.
Python 3.10+ recommended (3.9 mostly works with deprecation warnings).
ANTHROPIC_API_KEY in .env (required — every notebook needs Claude as generator or judge).
Either: Apple Silicon Mac (Path A), GPU with 8+ GB VRAM (Path B), or CPU only (extremely slow, smoke tests only).
OPTIONAL: OPENAI_API_KEY for the NB07 bonus GPT battle.

Installation

cd Homework6-Submission
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # edit to add your keys
python -c "import sys; sys.path.insert(0,'.'); from src import LLMClient, append_to_reflection; print('OK')"

If the final python -c line prints OK, your environment is ready.

Repository Structure

Homework6-Submission/
├── README.md                          # this file
├── LICENSE                            # MIT
├── requirements.txt                   # extends HW5 with seaborn + openai
├── .env.example                       # copy to .env and fill in keys
├── Class 6 Homework.ipynb             # source assignment notebook (instructor copy)
├── notebooks/
│   ├── 00_setup_and_smoke.ipynb       # env check + API smoke + HW5 artifact discovery
│   ├── 01_baseline_recap.ipynb        # load HW5 adapter; baseline Elo + sanity gen
│   ├── 02_diversity_audit.ipynb       # TF-IDF, MTLD, category balance on HW5 data
│   ├── 03_synthetic_strategies.ipynb  # Self-Instruct + Distillation + Evolutionary
│   ├── 04_rejection_sampling.ipynb    # multi-criteria judge + iterative refinement
│   ├── 05_ablation_training.ipynb     # 3 SFT variants on different mixtures
│   ├── 06_elo_tournament.ipynb        # pairwise A/B with position swap + ratings
│   ├── 07_real_model_battle.ipynb     # vs Claude (req'd) + GPT (optional)
│   └── 08_skill_integration.ipynb     # merge adapter, update gbrain skill, model card
├── src/
│   ├── __init__.py                    # public exports (named exports only)
│   ├── llm_client.py                  # Claude wrapper from HW5, +OpenAI optional
│   ├── reflection.py                  # append_to_reflection() for auto-saved answers
│   ├── dataset_io.py                  # JSONL/JSON load/save with schema check
│   ├── diversity_metrics.py           # TF-IDF, TTR, MTLD, category balance
│   ├── diversity_report.py            # plot + json export of all metrics
│   ├── self_instruct.py               # Self-Instruct synthetic strategy
│   ├── distillation.py                # Strong-teacher distillation strategy
│   ├── evolutionary_gen.py            # Evol-Instruct / WizardLM-style mutator
│   ├── synthetic_pipeline.py          # orchestrator: blend strategies, dedupe
│   ├── multi_criteria_judge.py        # 4-axis CoT scorer
│   ├── rejection_sampler.py           # filter + iterative re-gen
│   ├── ablation_runner.py             # train N variants, log to scoreboard
│   ├── sft_runner_mlx.py              # Path A backend
│   ├── sft_runner_hf.py               # Path B backend
│   ├── elo_tournament.py              # K-factor Elo with position swap
│   ├── pairwise_judge.py              # anonymous A/B grader with CoT
│   ├── real_model_battle.py           # frontier comparison harness
│   ├── pii_scrub.py                   # regex + optional spacy NER
│   ├── model_card.py                  # render outputs/final_model_card.md
│   ├── skill_wrapper.py               # gbrain interview-prep skill update
│   └── cost_tracker.py                # token + USD accounting per notebook
├── docs/
│   ├── data_quality_playbook.md       # SFT data-quality field guide
│   └── unsloth_bonus.md               # Linux+CUDA 2× speedup
├── outputs/                           # generated by notebooks (gitignored content)
└── test_data/                         # tiny fixtures for smoke tests

Assignment Structure

NB	Title	Topic	Time	Key Deliverable
00	Setup & Smoke	Env check, API ping, HW5 artifact discovery	10 min	`outputs/env_report.json`
01	Baseline Recap	Load HW5 adapter, generate sanity completions	15 min	`outputs/baseline_completions.json`
02	Diversity Audit	TF-IDF, MTLD, balance on HW5 dataset	30 min	`outputs/diversity_baseline.json`
03	Synthetic Strategies	Self-Instruct + Distillation + Evolutionary blend	35 min	`outputs/synthetic_v2_dataset.json`
04	Rejection Sampling	Multi-criteria judge, iterative re-gen	30 min	`outputs/rejection_sampled_dataset.json`
05	Ablation Training	3 SFT variants, scoreboard	35 min	`outputs/ablation_scoreboard.json`
06	Elo Tournament	Pairwise A/B, position swap, ratings	30 min	`outputs/elo_tournament_results.json`
07	Real-Model Battle	vs Claude required, GPT optional	30 min	`outputs/real_model_battle.json`
08	Skill Integration	Merge adapter, gbrain update, model card	30 min	`outputs/final_model_card.md`

Auto-Reflection

Every TODO cell auto-saves to outputs/homework_reflection.md when you run the cell.

In each notebook:

TODO cells contain a todo1_reflection = """...""" placeholder string with hints.
You edit the placeholder to your answer (the hints stay as scaffolding).
The summary cell at the end of the notebook automatically calls append_to_reflection() with your answers.

You don't need to copy/paste anything manually. Just edit the placeholder, run the cells, and the reflection builds itself.

Deliverables

Required (graded ~70%)

outputs/homework_reflection.md — auto-built across NB01–NB08 via append_to_reflection()
outputs/synthetic_v2_dataset.json — ≥ 80 records, ≥ 3 categories [NB03]
outputs/rejection_sampled_dataset.json — kept set after multi-criteria filter [NB04]
outputs/diversity_baseline.json + outputs/diversity_after_synth.json — before/after metrics [NB02–03]
outputs/ablation_scoreboard.json — ≥ 3 variants with loss + qualitative scores [NB05]
outputs/elo_tournament_results.json — pairwise wins, position-swap counts, final ratings [NB06]
outputs/real_model_battle.json — Claude required, GPT optional [NB07]
A merged adapter + Ollama Modelfile [NB08]
outputs/final_model_card.md — dataset, training, eval, limitations, license [NB08]
outputs/my_project_update.md — short narrative on what changed since HW5 [NB08]

Bonus (extra credit)

4th synthetic strategy — GAN-style adversarial or co-teaching with two LLMs.
Online Elo K-update — K=32 for the first 50 matches (burn-in), K=16 after.
Use Unsloth on Linux for 2× faster ablation (see docs/unsloth_bonus.md).
crowd_judge with 3+ different judges and disagreement analysis.
Open-source the dataset on HuggingFace Hub with a model card.
Automated PII scrubbing combining regex + spaCy NER.

Cost Estimates

Task	Notebook	Cost (typical)
Claude synthetic generation (3 strategies)	NB03	$1.00 – $2.00
Claude multi-criteria judging + re-gen	NB04	$0.50 – $1.00
Claude pairwise + ablation eval	NB05–NB06	$0.50 – $1.00
Claude real-model battle	NB07	$0.50 – $1.00
Optional GPT (if `OPENAI_API_KEY` set)	NB07	$0.10 – $0.30
Total		~$2.50 – $5
Worst case (heavy iteration)		~$7

src/cost_tracker.py tallies tokens and USD per notebook so you can see the running total live.

Troubleshooting

HW5 artifacts missing — NB01 creates stubs so HW6 still runs, but the baseline becomes a generic chat model and your Elo deltas will be meaningless. Re-run HW5's NB07 (merge_adapter.ipynb) before starting HW6 if at all possible.
scikit-learn install fails — try pip install scikit-learn --no-build-isolation (common on Python 3.13 + macOS).
seaborn import error in NB02 — pip install seaborn; the diversity report falls back to matplotlib-only plots if seaborn is unavailable.
Ollama timeout — confirm ollama serve is running; raise OLLAMA_HOST timeout via OLLAMA_TIMEOUT=120 env var.
OpenAI not installed — NB07 bonus cells skip cleanly with a printed note. Install with pip install openai if you want the GPT battle.
mlx-lm not found — Apple Silicon only; non-Mac users go Path B (HF+TRL+QLoRA).
peft version mismatch with TRL — pin peft==0.14.0 and trl==0.13.0; newer pre-releases break SFTTrainer constructor signatures.
llama.cpp missing for GGUF export — NB08 prints install instructions; the deploy step is skipped gracefully if llama.cpp isn't on $PATH.
Judge LLM rate limits — reduce concurrent calls (max_workers=2); Haiku has higher RPM than Sonnet, swap if you're being throttled.
Ablation OOM on small GPU — drop max_seq_length to 256 and per_device_train_batch_size=1; LoRA r=8 instead of r=16 saves another ~10% VRAM.
Cell hangs on synthetic data gen — set a lower max_budget_usd in synthetic_pipeline; the cell aborts cleanly when the budget is hit.
Pairwise judge timeout — a CoT pairwise call can take 30s+; let it finish or reduce n_rounds in NB06.
bitsandbytes import error on Mac — expected; it's lazy-imported in sft_runner_hf.py and the code falls back to fp16/bf16 when missing.

Key Concepts (Glossary)

Self-Instruct — bootstrap synthetic prompts by asking an LLM to expand a small seed set.
Distillation (synthetic data) — use a stronger teacher model (e.g. Claude Sonnet) to produce gold-standard responses you train your smaller model to imitate.
Evolutionary generation — iteratively mutate prompts (deepen, broaden, add constraints) and keep the winners by judge score (WizardLM Evol-Instruct).
Rejection Sampling — generate many candidates, filter to the ones that pass quality gates, optionally re-generate the rejected pile.
Multi-Criteria Judge — score each example on multiple axes (instruction-following, factuality, format, depth) instead of a single rating.
Chain-of-Thought scoring — judge explains its reasoning before emitting the score, which raises consistency.
Pairwise comparison — judge sees two anonymous responses (A vs B) and picks a winner; more reliable than absolute scoring.
Elo rating — 1v1 ranking system (chess-origin) where ratings update by expected vs actual win rate.
Position bias — judges systematically prefer the first or second response; mitigate with position swap (run each pair twice).
Crowd judging — aggregate multiple judge LLMs to reduce per-judge bias.
Data ablation — train N variants on different data mixtures to isolate which slice of the data drives quality gains.
MTLD — Measure of Textual Lexical Diversity, a length-invariant vocabulary richness score.
TTR — Type-Token Ratio, vocabulary diversity = unique tokens / total tokens.
Lexical diversity (TF-IDF) — 1 - mean_pairwise_cosine_similarity on TF-IDF vectors; higher = more linguistically varied.
Anonymous A/B labels — pairwise responses are labeled "Model X / Model Y" so the judge can't see which fine-tune produced which response.

Resources

Karpathy nanochat — https://github.com/karpathy/nanochat (read the data section)
Self-Instruct paper (Wang et al., 2022) — https://arxiv.org/abs/2212.10560
LMSYS Chatbot Arena — https://chat.lmsys.org (the canonical pairwise-Elo deployment)
HuggingFace TRL — https://huggingface.co/docs/trl
gbrain agent architecture — https://github.com/garrytan/gbrain
HW5 README — for the upstream SFT pipeline this builds on
docs/data_quality_playbook.md — distilled lecture notes
docs/unsloth_bonus.md — Linux+CUDA 2× accelerator path

Timeline (suggested 7-day schedule)

Day 1 — NB00 setup + NB01 baseline (25 min). Verify HW5 artifacts load.
Day 2 — NB02 diversity measurement (30 min). Read data_quality_playbook.md §2.
Day 3 — NB03 synthetic strategies (35 min, longest cell is the synthesis loop).
Day 4 — NB04 rejection sampling (30 min). Tune your judge rubric.
Day 5 — NB05 ablation training (35 min, longest compute step of the week).
Day 6 — NB06 Elo tournament + NB07 real-model comparison (60 min combined).
Day 7 — NB08 integration + final reflection sweep (30 min).

What's Next (Week 7 preview)

Week 7 likely covers serving optimization (vLLM, continuous batching, AWQ/GPTQ quantization beyond GGUF) or agents/tool-use building on the gbrain skill. This README will be updated when the Class 7 syllabus is finalized.

Support

Course GitHub — file an issue with the notebook number and the cell that failed; include the last 20 lines of stderr.
Office hours — see syllabus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Week 6: SFT Part II — Data Quality, Diversity & Elo Evaluation

Overview

Learning Objectives

Setup Options

Path A: MLX-LM (Recommended on Mac Apple Silicon)

Path B: HF + PEFT + TRL + QLoRA (Recommended on Linux GPU)

Path C: Cloud GPU (Bonus)

Prerequisites

Installation

Repository Structure

Assignment Structure

Auto-Reflection

Deliverables

Required (graded ~70%)

Bonus (extra credit)

Cost Estimates

Troubleshooting

Key Concepts (Glossary)

Resources

Timeline (suggested 7-day schedule)

What's Next (Week 7 preview)

Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
notebooks		notebooks
outputs		outputs
src		src
test_data		test_data
.env.example		.env.example
.gitignore		.gitignore
Class 6 Homework.ipynb		Class 6 Homework.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Week 6: SFT Part II — Data Quality, Diversity & Elo Evaluation

Overview

Learning Objectives

Setup Options

Path A: MLX-LM (Recommended on Mac Apple Silicon)

Path B: HF + PEFT + TRL + QLoRA (Recommended on Linux GPU)

Path C: Cloud GPU (Bonus)

Prerequisites

Installation

Repository Structure

Assignment Structure

Auto-Reflection

Deliverables

Required (graded ~70%)

Bonus (extra credit)

Cost Estimates

Troubleshooting

Key Concepts (Glossary)

Resources

Timeline (suggested 7-day schedule)

What's Next (Week 7 preview)

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages