Skip to content

inference-ai-course/Homework6-Submission

Repository files navigation

Week 6: SFT Part II — Data Quality, Diversity & Elo Evaluation

Overview

This week you take HW5's working SFT pipeline (Qwen2.5-0.5B + MLX/HF-TRL+QLoRA) and turn it into a data-quality lab for an interview-prep agent. You will quantify diversity, generate synthetic data via three different strategies, filter it with multi-criteria rejection sampling, train three ablation variants, and validate with a pairwise Elo tournament plus a frontier-model comparison. Each TODO cell auto-saves to outputs/homework_reflection.md when you run it, so the reflection writes itself as you work.

Learning Objectives

  • Quantify data diversity with TF-IDF cosine, vocabulary richness (TTR/MTLD), and category balance — not just eyeballing the JSON.
  • Generate synthetic data via Self-Instruct, Distillation, and Evolutionary strategies, and know when each is the right tool.
  • Apply multi-criteria LLM-as-judge rejection sampling with chain-of-thought scoring and iterative refinement of the bottom quartile.
  • Run real ablation experiments — three SFT variants on different data mixtures (max_steps=30) and compare loss + downstream quality.
  • Implement pairwise Elo tournament over anonymous A/B labels with position-swap to neutralize judge bias.
  • Compare your model against Claude (required) and GPT (optional) on quality and per-token cost — frontier is the honest yardstick.
  • Update HW5's gbrain skill with the improved adapter so the interview-prep skill picks up the new weights.
  • Write a production model card documenting dataset, training run, eval results, known failure modes, and license.

Setup Options

Path A: MLX-LM (Recommended on Mac Apple Silicon)

  • Install: pip install -r requirements.txt
  • Ablation runtime: ~15-25 min total for 3 variants on M2/M3
  • Smoke time per cell: ~2 min for synthetic generation, ~5 min per SFT variant
  • No CUDA, no quantization libraries needed; MLX handles everything natively.

Path B: HF + PEFT + TRL + QLoRA (Recommended on Linux GPU)

  • Install: pip install -r requirements.txt + ensure nvidia-smi shows your GPU
  • 8–16 GB VRAM target; max_seq_length=512 fits comfortably with QLoRA 4-bit
  • Ablation runtime: ~30–60 min total for 3 variants depending on GPU
  • See docs/unsloth_bonus.md for a 2× speedup option on CUDA.

Path C: Cloud GPU (Bonus)

See HW5's docs/stack_choice_guide.md for Modal / Lambda Labs / Runpod walkthroughs. For HW6 specifically: the AblationRunner works unchanged in cloud — point it at a mounted volume for outputs/.

Prerequisites

  • Completed HW5. HW6 reuses HW5's adapter and dataset as the baseline in the Elo tournament and the frontier comparison. NB01 will create stubs if HW5 artifacts are missing, but real HW5 outputs make the deltas meaningful.
  • Python 3.10+ recommended (3.9 mostly works with deprecation warnings).
  • ANTHROPIC_API_KEY in .env (required — every notebook needs Claude as generator or judge).
  • Either: Apple Silicon Mac (Path A), GPU with 8+ GB VRAM (Path B), or CPU only (extremely slow, smoke tests only).
  • OPTIONAL: OPENAI_API_KEY for the NB07 bonus GPT battle.

Installation

cd Homework6-Submission
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # edit to add your keys
python -c "import sys; sys.path.insert(0,'.'); from src import LLMClient, append_to_reflection; print('OK')"

If the final python -c line prints OK, your environment is ready.

Repository Structure

Homework6-Submission/
├── README.md                          # this file
├── LICENSE                            # MIT
├── requirements.txt                   # extends HW5 with seaborn + openai
├── .env.example                       # copy to .env and fill in keys
├── Class 6 Homework.ipynb             # source assignment notebook (instructor copy)
├── notebooks/
│   ├── 00_setup_and_smoke.ipynb       # env check + API smoke + HW5 artifact discovery
│   ├── 01_baseline_recap.ipynb        # load HW5 adapter; baseline Elo + sanity gen
│   ├── 02_diversity_audit.ipynb       # TF-IDF, MTLD, category balance on HW5 data
│   ├── 03_synthetic_strategies.ipynb  # Self-Instruct + Distillation + Evolutionary
│   ├── 04_rejection_sampling.ipynb    # multi-criteria judge + iterative refinement
│   ├── 05_ablation_training.ipynb     # 3 SFT variants on different mixtures
│   ├── 06_elo_tournament.ipynb        # pairwise A/B with position swap + ratings
│   ├── 07_real_model_battle.ipynb     # vs Claude (req'd) + GPT (optional)
│   └── 08_skill_integration.ipynb     # merge adapter, update gbrain skill, model card
├── src/
│   ├── __init__.py                    # public exports (named exports only)
│   ├── llm_client.py                  # Claude wrapper from HW5, +OpenAI optional
│   ├── reflection.py                  # append_to_reflection() for auto-saved answers
│   ├── dataset_io.py                  # JSONL/JSON load/save with schema check
│   ├── diversity_metrics.py           # TF-IDF, TTR, MTLD, category balance
│   ├── diversity_report.py            # plot + json export of all metrics
│   ├── self_instruct.py               # Self-Instruct synthetic strategy
│   ├── distillation.py                # Strong-teacher distillation strategy
│   ├── evolutionary_gen.py            # Evol-Instruct / WizardLM-style mutator
│   ├── synthetic_pipeline.py          # orchestrator: blend strategies, dedupe
│   ├── multi_criteria_judge.py        # 4-axis CoT scorer
│   ├── rejection_sampler.py           # filter + iterative re-gen
│   ├── ablation_runner.py             # train N variants, log to scoreboard
│   ├── sft_runner_mlx.py              # Path A backend
│   ├── sft_runner_hf.py               # Path B backend
│   ├── elo_tournament.py              # K-factor Elo with position swap
│   ├── pairwise_judge.py              # anonymous A/B grader with CoT
│   ├── real_model_battle.py           # frontier comparison harness
│   ├── pii_scrub.py                   # regex + optional spacy NER
│   ├── model_card.py                  # render outputs/final_model_card.md
│   ├── skill_wrapper.py               # gbrain interview-prep skill update
│   └── cost_tracker.py                # token + USD accounting per notebook
├── docs/
│   ├── data_quality_playbook.md       # SFT data-quality field guide
│   └── unsloth_bonus.md               # Linux+CUDA 2× speedup
├── outputs/                           # generated by notebooks (gitignored content)
└── test_data/                         # tiny fixtures for smoke tests

Assignment Structure

NB Title Topic Time Key Deliverable
00 Setup & Smoke Env check, API ping, HW5 artifact discovery 10 min outputs/env_report.json
01 Baseline Recap Load HW5 adapter, generate sanity completions 15 min outputs/baseline_completions.json
02 Diversity Audit TF-IDF, MTLD, balance on HW5 dataset 30 min outputs/diversity_baseline.json
03 Synthetic Strategies Self-Instruct + Distillation + Evolutionary blend 35 min outputs/synthetic_v2_dataset.json
04 Rejection Sampling Multi-criteria judge, iterative re-gen 30 min outputs/rejection_sampled_dataset.json
05 Ablation Training 3 SFT variants, scoreboard 35 min outputs/ablation_scoreboard.json
06 Elo Tournament Pairwise A/B, position swap, ratings 30 min outputs/elo_tournament_results.json
07 Real-Model Battle vs Claude required, GPT optional 30 min outputs/real_model_battle.json
08 Skill Integration Merge adapter, gbrain update, model card 30 min outputs/final_model_card.md

Auto-Reflection

Every TODO cell auto-saves to outputs/homework_reflection.md when you run the cell.

In each notebook:

  • TODO cells contain a todo1_reflection = """...""" placeholder string with hints.
  • You edit the placeholder to your answer (the hints stay as scaffolding).
  • The summary cell at the end of the notebook automatically calls append_to_reflection() with your answers.

You don't need to copy/paste anything manually. Just edit the placeholder, run the cells, and the reflection builds itself.

Deliverables

Required (graded ~70%)

  • outputs/homework_reflection.md — auto-built across NB01–NB08 via append_to_reflection()
  • outputs/synthetic_v2_dataset.json — ≥ 80 records, ≥ 3 categories [NB03]
  • outputs/rejection_sampled_dataset.json — kept set after multi-criteria filter [NB04]
  • outputs/diversity_baseline.json + outputs/diversity_after_synth.json — before/after metrics [NB02–03]
  • outputs/ablation_scoreboard.json — ≥ 3 variants with loss + qualitative scores [NB05]
  • outputs/elo_tournament_results.json — pairwise wins, position-swap counts, final ratings [NB06]
  • outputs/real_model_battle.json — Claude required, GPT optional [NB07]
  • A merged adapter + Ollama Modelfile [NB08]
  • outputs/final_model_card.md — dataset, training, eval, limitations, license [NB08]
  • outputs/my_project_update.md — short narrative on what changed since HW5 [NB08]

Bonus (extra credit)

  • 4th synthetic strategy — GAN-style adversarial or co-teaching with two LLMs.
  • Online Elo K-update — K=32 for the first 50 matches (burn-in), K=16 after.
  • Use Unsloth on Linux for 2× faster ablation (see docs/unsloth_bonus.md).
  • crowd_judge with 3+ different judges and disagreement analysis.
  • Open-source the dataset on HuggingFace Hub with a model card.
  • Automated PII scrubbing combining regex + spaCy NER.

Cost Estimates

Task Notebook Cost (typical)
Claude synthetic generation (3 strategies) NB03 $1.00 – $2.00
Claude multi-criteria judging + re-gen NB04 $0.50 – $1.00
Claude pairwise + ablation eval NB05–NB06 $0.50 – $1.00
Claude real-model battle NB07 $0.50 – $1.00
Optional GPT (if OPENAI_API_KEY set) NB07 $0.10 – $0.30
Total ~$2.50 – $5
Worst case (heavy iteration) ~$7

src/cost_tracker.py tallies tokens and USD per notebook so you can see the running total live.

Troubleshooting

  1. HW5 artifacts missing — NB01 creates stubs so HW6 still runs, but the baseline becomes a generic chat model and your Elo deltas will be meaningless. Re-run HW5's NB07 (merge_adapter.ipynb) before starting HW6 if at all possible.
  2. scikit-learn install fails — try pip install scikit-learn --no-build-isolation (common on Python 3.13 + macOS).
  3. seaborn import error in NB02pip install seaborn; the diversity report falls back to matplotlib-only plots if seaborn is unavailable.
  4. Ollama timeout — confirm ollama serve is running; raise OLLAMA_HOST timeout via OLLAMA_TIMEOUT=120 env var.
  5. OpenAI not installed — NB07 bonus cells skip cleanly with a printed note. Install with pip install openai if you want the GPT battle.
  6. mlx-lm not found — Apple Silicon only; non-Mac users go Path B (HF+TRL+QLoRA).
  7. peft version mismatch with TRL — pin peft==0.14.0 and trl==0.13.0; newer pre-releases break SFTTrainer constructor signatures.
  8. llama.cpp missing for GGUF export — NB08 prints install instructions; the deploy step is skipped gracefully if llama.cpp isn't on $PATH.
  9. Judge LLM rate limits — reduce concurrent calls (max_workers=2); Haiku has higher RPM than Sonnet, swap if you're being throttled.
  10. Ablation OOM on small GPU — drop max_seq_length to 256 and per_device_train_batch_size=1; LoRA r=8 instead of r=16 saves another ~10% VRAM.
  11. Cell hangs on synthetic data gen — set a lower max_budget_usd in synthetic_pipeline; the cell aborts cleanly when the budget is hit.
  12. Pairwise judge timeout — a CoT pairwise call can take 30s+; let it finish or reduce n_rounds in NB06.
  13. bitsandbytes import error on Mac — expected; it's lazy-imported in sft_runner_hf.py and the code falls back to fp16/bf16 when missing.

Key Concepts (Glossary)

  • Self-Instruct — bootstrap synthetic prompts by asking an LLM to expand a small seed set.
  • Distillation (synthetic data) — use a stronger teacher model (e.g. Claude Sonnet) to produce gold-standard responses you train your smaller model to imitate.
  • Evolutionary generation — iteratively mutate prompts (deepen, broaden, add constraints) and keep the winners by judge score (WizardLM Evol-Instruct).
  • Rejection Sampling — generate many candidates, filter to the ones that pass quality gates, optionally re-generate the rejected pile.
  • Multi-Criteria Judge — score each example on multiple axes (instruction-following, factuality, format, depth) instead of a single rating.
  • Chain-of-Thought scoring — judge explains its reasoning before emitting the score, which raises consistency.
  • Pairwise comparison — judge sees two anonymous responses (A vs B) and picks a winner; more reliable than absolute scoring.
  • Elo rating — 1v1 ranking system (chess-origin) where ratings update by expected vs actual win rate.
  • Position bias — judges systematically prefer the first or second response; mitigate with position swap (run each pair twice).
  • Crowd judging — aggregate multiple judge LLMs to reduce per-judge bias.
  • Data ablation — train N variants on different data mixtures to isolate which slice of the data drives quality gains.
  • MTLD — Measure of Textual Lexical Diversity, a length-invariant vocabulary richness score.
  • TTR — Type-Token Ratio, vocabulary diversity = unique tokens / total tokens.
  • Lexical diversity (TF-IDF)1 - mean_pairwise_cosine_similarity on TF-IDF vectors; higher = more linguistically varied.
  • Anonymous A/B labels — pairwise responses are labeled "Model X / Model Y" so the judge can't see which fine-tune produced which response.

Resources

Timeline (suggested 7-day schedule)

  • Day 1 — NB00 setup + NB01 baseline (25 min). Verify HW5 artifacts load.
  • Day 2 — NB02 diversity measurement (30 min). Read data_quality_playbook.md §2.
  • Day 3 — NB03 synthetic strategies (35 min, longest cell is the synthesis loop).
  • Day 4 — NB04 rejection sampling (30 min). Tune your judge rubric.
  • Day 5 — NB05 ablation training (35 min, longest compute step of the week).
  • Day 6 — NB06 Elo tournament + NB07 real-model comparison (60 min combined).
  • Day 7 — NB08 integration + final reflection sweep (30 min).

What's Next (Week 7 preview)

Week 7 likely covers serving optimization (vLLM, continuous batching, AWQ/GPTQ quantization beyond GGUF) or agents/tool-use building on the gbrain skill. This README will be updated when the Class 7 syllabus is finalized.

Support

  • Course GitHub — file an issue with the notebook number and the cell that failed; include the last 20 lines of stderr.
  • Office hours — see syllabus.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors