Hands-on alignment of Qwen2.5-0.5B-Instruct using the 2026 production stack: HuggingFace TRL v1.0+, Direct Preference Optimization, and the latest variants (IPO, KTO, ORPO, GRPO, Iterative DPO).
By the end of this homework you'll have built every piece of a real alignment pipeline — preference data collection, a Bradley-Terry reward model, a DPO-trained adapter, and a complete evaluation suite — at small enough scale to run on a laptop in under an hour.
After completing the 9 notebooks you will be able to:
- Explain what alignment does on top of pretraining + SFT and why every modern LLM (Claude, GPT, Llama, Qwen) ships with it.
- Collect preference pairs synthetically (NB02) and from public datasets (UltraFeedback, HH-RLHF), with position-bias control for LLM judges.
- Train a Bradley-Terry reward model and measure pairwise accuracy on held-out data (NB03).
- Fine-tune Qwen2.5-0.5B with DPO (NB04) and the modern variants — IPO, KTO, ORPO (NB05).
- Demonstrate GRPO mathematically (NB06 Part 1) and run a real iterative-DPO loop the same way Llama-3-Instruct was built (NB06 Part 3).
- Evaluate alignment with pairwise win-rate, safety/refusal rate, length-bias check, and reward-hacking signals (NB07).
- Plan how to apply alignment to your Week-8 capstone (NB08).
Homework7-Submission/
├── README.md (this file)
├── requirements.txt full dependency list
├── .env.example copy to .env and add your API keys
├── notebooks/
│ ├── 00_setup_verification.ipynb Verify Python, packages, API keys (~5 min)
│ ├── 01_environment_setup.ipynb Pick path A/B/C, first model load (~10 min)
│ ├── 02_preference_data.ipynb Build pairs + judge + UltraFeedback + Gradio (~25 min)
│ ├── 03_reward_model.ipynb Train Bradley-Terry RM + accuracy plot (~25 min)
│ ├── 04_dpo.ipynb DPO training + before/after comparison (~30 min)
│ ├── 05_dpo_variants.ipynb IPO / KTO / ORPO loss-curve comparison (~30 min)
│ ├── 06_grpo_iterative.ipynb GRPO sim + 2-round Iterative DPO (~25 min)
│ ├── 07_alignment_eval.ipynb Win-rate + safety + length-bias eval (~25 min)
│ └── 08_project_integration.ipynb Connect alignment to your capstone (~15 min)
├── src/
│ ├── __init__.py Light core only (LLMClient, CostTracker, utils)
│ ├── config.py PATH, model IDs, dataset names, output paths
│ ├── llm_client.py Unified Claude + Ollama client (chat API)
│ ├── cost_tracker.py Token + cost tracking across all paths
│ ├── utils.py save_task_output(), append_to_reflection()
│ ├── prompt_templates.py CO-STAR templates (carryover from HW3+)
│ ├── preference_data.py Build/load/judge/convert preference data
│ ├── reward_model.py Bradley-Terry RM via trl.RewardTrainer
│ ├── dpo_runner.py DPO/IPO/KTO/ORPO unified wrapper
│ ├── grpo_demo.py Pure-Python GRPO + real GRPOConfig builder
│ ├── iterative_dpo.py Llama-3-style sample → judge → DPO loop
│ ├── alignment_eval.py Win-rate, refusal, length-bias detectors
│ ├── safety_prompts.py HARMFUL / BENIGN curated prompt sets
│ └── gradio_annotator.py Build a human pairwise-annotation UI
├── outputs/ Auto-populated by notebooks
│ ├── homework_reflection.md Single rolling reflection across all 9 notebooks
│ ├── preference_data.jsonl NB02
│ ├── ultrafeedback_tiny.jsonl NB02
│ ├── kto_format.jsonl NB02
│ ├── reward_format.jsonl NB02
│ ├── reward_model/ NB03 (RM weights)
│ ├── reward_model_scores.png NB03
│ ├── dpo_adapter/ NB04 (DPO LoRA)
│ ├── dpo_curves.png NB04
│ ├── dpo_before_after.json NB04 (read by NB07)
│ ├── variants_dpo / variants_ipo / variants_kto / variants_orpo/ NB05
│ ├── variants_loss.png NB05
│ ├── grpo_config.json NB06
│ ├── iterative_dpo/round_0/, round_1/ NB06
│ └── safety_refusal.png NB07
└── docs/ (placeholder for any extra writeups)
cd Homework7-Submission/
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # edit and fill ANTHROPIC_API_KEY (or use Ollama only)
jupyter lab notebooks/Then run notebooks in order: 00 → 08. Total time end-to-end: ~3 hours, dominated by NB04 (DPO training, ~5 min) and NB06 (iterative DPO, ~10 min).
Pick one in src/config.py (PATH = "A" | "B" | "C"):
| Path | What you run | Hardware needed | Best for |
|---|---|---|---|
| A | Cloud-only (Claude judging, conceptual training) | Any laptop + Claude key | Limited compute, focus on ideas |
| B | Real DPO/RM/Iterative-DPO on Qwen2.5-0.5B (HuggingFace) | 8 GB RAM, ~10 min total compute | The default — recommended |
| C | A + B combined (cloud judges, local training) | Both | Best educational value (slowest) |
The class slides cover RLHF, DPO, GRPO, and "modern open-source tools." This homework brings everything to current 2026 best practice:
| Topic in slides | What we actually use in the homework |
|---|---|
| RLHF / PPO | Conceptual coverage in NB03+NB06 (no PPO training — too expensive) |
| Reward Model | Bradley-Terry RM via trl.RewardTrainer (TRL v1.0 stable API) |
| DPO | trl.DPOTrainer + LoRA on Qwen2.5-0.5B — the default choice |
| GRPO | Pure-Python simulation + ready-to-paste GRPOConfig for Colab GPU |
| Iterative DPO | Real 2-round loop (Llama-3 / RLHFlow recipe) — src/iterative_dpo.py |
| TRL v1.0 (Apr 2026) | Unified post-training stack — we use the stable trainers + flag the experimental ones |
| IPO (NeurIPS 2024) | One config flag (loss_type="ipo") inside DPOTrainer — covered in NB05 |
| KTO (Mar 2024) | trl.experimental.KTOTrainer — binary signals, no pairs needed (NB05) |
| ORPO (Apr 2024) | trl.experimental.ORPOTrainer — single-stage SFT+alignment (NB05) |
| SimPO (NeurIPS 2024) | Conceptual coverage in NB04 + NB05 (reference-free implicit reward) |
| UltraFeedback / HH-RLHF | Tiny slices loaded via datasets.load_dataset in NB02 |
| Gradio annotation | Working build_annotator() Blocks app in src/gradio_annotator.py |
┌────────────────────────────┐
│ src/preference_data.py │ build/load/judge/convert
│ (used by every notebook) │
└─────────────┬──────────────┘
│
┌─────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ src/reward_model │ │ src/dpo_runner │ │ src/iterative_ │
│ (NB03) │ │ DPO/IPO/KTO/ORPO│ │ dpo (NB06) │
│ │ │ (NB04, NB05) │ │ │
└──────────┬───────┘ └─────────┬────────┘ └────────┬─────────┘
│ │ │
└─────────┬─────────┴────────────────────┘
▼
┌──────────────────────────┐
│ src/alignment_eval.py │ winrate + safety + length bias
│ (NB07) │
└──────────────────────────┘
src/grpo_demo.py lives off to the side — it's a teaching module (pure Python step + config dict generator), not a real trainer.
Every notebook calls append_to_reflection(notebook=..., section_title=..., reflection_content=...) exactly once at the end. The result is a single rolling document at outputs/homework_reflection.md:
# Week 7: Alignment — RLHF, DPO, GRPO -- Homework Reflection
**Student Name:** ...
**Path Selected:** ...
## Notebook 00: Setup Verification
...
## Notebook 01: Environment & First Model Call
...
... (one section per notebook)
## Notebook 08: Project Integration & Final Reflection
...
You will not have to copy/paste reflections by hand — each notebook has a TODO cell where you write your reflection text into a string variable, and the summary cell at the bottom auto-appends it.
| Notebook | Points | Required artifacts |
|---|---|---|
| 00 | 2 | All 5 verification cells pass; reflection initialized |
| 01 | 4 | First model call succeeds; reflection on alignment behaviors |
| 02 | 8 | preference_data.jsonl ≥ 4 pairs; position-bias demo ran |
| 03 | 8 | reward_model_scores.png saved; eval accuracy reported |
| 04 | 10 | dpo_curves.png saved; before/after JSON saved |
| 05 | 8 | At least DPO + IPO trained (KTO/ORPO partial credit if skipped) |
| 06 | 8 | GRPO simulation step + at least 1 iterative DPO round complete |
| 07 | 8 | Win-rate, safety bar chart, and length-bias dict all printed |
| 08 | 4 | Capstone plan template filled in; final reflection written |
Reflection quality is weighted into each notebook's score — partial-credit reflections (e.g., the unmodified TODO placeholder) lose 50% on that NB.
TRL v1.0 (April 2026) moved KTO and ORPO to trl.experimental. Our src/dpo_runner.py already tries both import paths. If both fail: pip install -U 'trl>=0.16'.
Cut max_steps from 20 → 5 in the run_preference_training calls. Or set PATH = "A" to skip real training (you'll lose the loss curves but the rest still works).
Either run ollama pull qwen3.5:27b (~16GB download), or set PATH = "A" to use Claude only.
Check internet, or comment out the UltraFeedback lines in NB02 / NB03. The notebooks fall back to your synthetic NB02 data.
This is a soft-dependency. The cell catches ImportError and prints a skip message. Install with pip install gradio>=4.0.0 if you want to run the annotator.
Harmless if you're using LoRA — adapter weights save separately. The run still succeeds.
- Hugging Face TRL v1.0 release — April 2026
- DPO paper — Rafailov et al., NeurIPS 2023
- DeepSeek-Math (GRPO)
- SimPO — NeurIPS 2024
- KTO
- ORPO
- IPO
- RLHFlow / Iterative DPO recipe
- UltraFeedback dataset
- Anthropic HH-RLHF dataset
- TRL DPOTrainer docs