Week 7 — Alignment: RLHF, DPO, GRPO

Hands-on alignment of Qwen2.5-0.5B-Instruct using the 2026 production stack: HuggingFace TRL v1.0+, Direct Preference Optimization, and the latest variants (IPO, KTO, ORPO, GRPO, Iterative DPO).

By the end of this homework you'll have built every piece of a real alignment pipeline — preference data collection, a Bradley-Terry reward model, a DPO-trained adapter, and a complete evaluation suite — at small enough scale to run on a laptop in under an hour.

Learning objectives

After completing the 9 notebooks you will be able to:

Explain what alignment does on top of pretraining + SFT and why every modern LLM (Claude, GPT, Llama, Qwen) ships with it.
Collect preference pairs synthetically (NB02) and from public datasets (UltraFeedback, HH-RLHF), with position-bias control for LLM judges.
Train a Bradley-Terry reward model and measure pairwise accuracy on held-out data (NB03).
Fine-tune Qwen2.5-0.5B with DPO (NB04) and the modern variants — IPO, KTO, ORPO (NB05).
Demonstrate GRPO mathematically (NB06 Part 1) and run a real iterative-DPO loop the same way Llama-3-Instruct was built (NB06 Part 3).
Evaluate alignment with pairwise win-rate, safety/refusal rate, length-bias check, and reward-hacking signals (NB07).
Plan how to apply alignment to your Week-8 capstone (NB08).

Folder structure

Homework7-Submission/
├── README.md                       (this file)
├── requirements.txt                full dependency list
├── .env.example                    copy to .env and add your API keys
├── notebooks/
│   ├── 00_setup_verification.ipynb       Verify Python, packages, API keys (~5 min)
│   ├── 01_environment_setup.ipynb        Pick path A/B/C, first model load (~10 min)
│   ├── 02_preference_data.ipynb          Build pairs + judge + UltraFeedback + Gradio (~25 min)
│   ├── 03_reward_model.ipynb             Train Bradley-Terry RM + accuracy plot (~25 min)
│   ├── 04_dpo.ipynb                      DPO training + before/after comparison (~30 min)
│   ├── 05_dpo_variants.ipynb             IPO / KTO / ORPO loss-curve comparison (~30 min)
│   ├── 06_grpo_iterative.ipynb           GRPO sim + 2-round Iterative DPO (~25 min)
│   ├── 07_alignment_eval.ipynb           Win-rate + safety + length-bias eval (~25 min)
│   └── 08_project_integration.ipynb      Connect alignment to your capstone (~15 min)
├── src/
│   ├── __init__.py                       Light core only (LLMClient, CostTracker, utils)
│   ├── config.py                         PATH, model IDs, dataset names, output paths
│   ├── llm_client.py                     Unified Claude + Ollama client (chat API)
│   ├── cost_tracker.py                   Token + cost tracking across all paths
│   ├── utils.py                          save_task_output(), append_to_reflection()
│   ├── prompt_templates.py               CO-STAR templates (carryover from HW3+)
│   ├── preference_data.py                Build/load/judge/convert preference data
│   ├── reward_model.py                   Bradley-Terry RM via trl.RewardTrainer
│   ├── dpo_runner.py                     DPO/IPO/KTO/ORPO unified wrapper
│   ├── grpo_demo.py                      Pure-Python GRPO + real GRPOConfig builder
│   ├── iterative_dpo.py                  Llama-3-style sample → judge → DPO loop
│   ├── alignment_eval.py                 Win-rate, refusal, length-bias detectors
│   ├── safety_prompts.py                 HARMFUL / BENIGN curated prompt sets
│   └── gradio_annotator.py               Build a human pairwise-annotation UI
├── outputs/                              Auto-populated by notebooks
│   ├── homework_reflection.md            Single rolling reflection across all 9 notebooks
│   ├── preference_data.jsonl             NB02
│   ├── ultrafeedback_tiny.jsonl          NB02
│   ├── kto_format.jsonl                  NB02
│   ├── reward_format.jsonl               NB02
│   ├── reward_model/                     NB03 (RM weights)
│   ├── reward_model_scores.png           NB03
│   ├── dpo_adapter/                      NB04 (DPO LoRA)
│   ├── dpo_curves.png                    NB04
│   ├── dpo_before_after.json             NB04 (read by NB07)
│   ├── variants_dpo / variants_ipo / variants_kto / variants_orpo/   NB05
│   ├── variants_loss.png                 NB05
│   ├── grpo_config.json                  NB06
│   ├── iterative_dpo/round_0/, round_1/  NB06
│   └── safety_refusal.png                NB07
└── docs/                                 (placeholder for any extra writeups)

Quick start

cd Homework7-Submission/
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env       # edit and fill ANTHROPIC_API_KEY (or use Ollama only)
jupyter lab notebooks/

Then run notebooks in order: 00 → 08. Total time end-to-end: ~3 hours, dominated by NB04 (DPO training, ~5 min) and NB06 (iterative DPO, ~10 min).

Three paths through Week 7

Pick one in src/config.py (PATH = "A" | "B" | "C"):

Path	What you run	Hardware needed	Best for
A	Cloud-only (Claude judging, conceptual training)	Any laptop + Claude key	Limited compute, focus on ideas
B	Real DPO/RM/Iterative-DPO on Qwen2.5-0.5B (HuggingFace)	8 GB RAM, ~10 min total compute	The default — recommended
C	A + B combined (cloud judges, local training)	Both	Best educational value (slowest)

What's new in 2026 (vs the original lecture deck)

The class slides cover RLHF, DPO, GRPO, and "modern open-source tools." This homework brings everything to current 2026 best practice:

Topic in slides	What we actually use in the homework
RLHF / PPO	Conceptual coverage in NB03+NB06 (no PPO training — too expensive)
Reward Model	Bradley-Terry RM via `trl.RewardTrainer` (TRL v1.0 stable API)
DPO	`trl.DPOTrainer` + LoRA on Qwen2.5-0.5B — the default choice
GRPO	Pure-Python simulation + ready-to-paste `GRPOConfig` for Colab GPU
Iterative DPO	Real 2-round loop (Llama-3 / RLHFlow recipe) — `src/iterative_dpo.py`
TRL v1.0 (Apr 2026)	Unified post-training stack — we use the stable trainers + flag the experimental ones
IPO (NeurIPS 2024)	One config flag (`loss_type="ipo"`) inside `DPOTrainer` — covered in NB05
KTO (Mar 2024)	`trl.experimental.KTOTrainer` — binary signals, no pairs needed (NB05)
ORPO (Apr 2024)	`trl.experimental.ORPOTrainer` — single-stage SFT+alignment (NB05)
SimPO (NeurIPS 2024)	Conceptual coverage in NB04 + NB05 (reference-free implicit reward)
UltraFeedback / HH-RLHF	Tiny slices loaded via `datasets.load_dataset` in NB02
Gradio annotation	Working `build_annotator()` Blocks app in `src/gradio_annotator.py`

How the modules fit together

              ┌────────────────────────────┐
              │   src/preference_data.py   │   build/load/judge/convert
              │  (used by every notebook)  │
              └─────────────┬──────────────┘
                            │
          ┌─────────────────┼───────────────────┐
          ▼                 ▼                   ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ src/reward_model │ │  src/dpo_runner  │ │  src/iterative_  │
│   (NB03)         │ │  DPO/IPO/KTO/ORPO│ │   dpo (NB06)     │
│                  │ │  (NB04, NB05)    │ │                  │
└──────────┬───────┘ └─────────┬────────┘ └────────┬─────────┘
           │                   │                    │
           └─────────┬─────────┴────────────────────┘
                     ▼
              ┌──────────────────────────┐
              │   src/alignment_eval.py  │   winrate + safety + length bias
              │          (NB07)          │
              └──────────────────────────┘

src/grpo_demo.py lives off to the side — it's a teaching module (pure Python step + config dict generator), not a real trainer.

Reflection file

Every notebook calls append_to_reflection(notebook=..., section_title=..., reflection_content=...) exactly once at the end. The result is a single rolling document at outputs/homework_reflection.md:

# Week 7: Alignment — RLHF, DPO, GRPO -- Homework Reflection

**Student Name:** ...
**Path Selected:** ...

## Notebook 00: Setup Verification
...
## Notebook 01: Environment & First Model Call
...
... (one section per notebook)
## Notebook 08: Project Integration & Final Reflection
...

You will not have to copy/paste reflections by hand — each notebook has a TODO cell where you write your reflection text into a string variable, and the summary cell at the bottom auto-appends it.

Grading rubric (60 points total)

Notebook	Points	Required artifacts
00	2	All 5 verification cells pass; reflection initialized
01	4	First model call succeeds; reflection on alignment behaviors
02	8	`preference_data.jsonl` ≥ 4 pairs; position-bias demo ran
03	8	`reward_model_scores.png` saved; eval accuracy reported
04	10	`dpo_curves.png` saved; before/after JSON saved
05	8	At least DPO + IPO trained (KTO/ORPO partial credit if skipped)
06	8	GRPO simulation step + at least 1 iterative DPO round complete
07	8	Win-rate, safety bar chart, and length-bias dict all printed
08	4	Capstone plan template filled in; final reflection written

Reflection quality is weighted into each notebook's score — partial-credit reflections (e.g., the unmodified TODO placeholder) lose 50% on that NB.

Troubleshooting

"ImportError: cannot import name 'KTOTrainer'"

TRL v1.0 (April 2026) moved KTO and ORPO to trl.experimental. Our src/dpo_runner.py already tries both import paths. If both fail: pip install -U 'trl>=0.16'.

NB04 / NB05 / NB06 take forever

Cut max_steps from 20 → 5 in the run_preference_training calls. Or set PATH = "A" to skip real training (you'll lose the loss curves but the rest still works).

"qwen3.5:27b not found"

Either run ollama pull qwen3.5:27b (~16GB download), or set PATH = "A" to use Claude only.

UltraFeedback download hangs

Check internet, or comment out the UltraFeedback lines in NB02 / NB03. The notebooks fall back to your synthetic NB02 data.

Gradio import error in NB02

This is a soft-dependency. The cell catches ImportError and prints a skip message. Install with pip install gradio>=4.0.0 if you want to run the annotator.

"save_model failed" warning during DPO

Harmless if you're using LoRA — adapter weights save separately. The run still succeeds.

References (used while building this homework)

Hugging Face TRL v1.0 release — April 2026
DPO paper — Rafailov et al., NeurIPS 2023
DeepSeek-Math (GRPO)
SimPO — NeurIPS 2024
KTO
ORPO
IPO
RLHFlow / Iterative DPO recipe
UltraFeedback dataset
Anthropic HH-RLHF dataset
TRL DPOTrainer docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Week 7 — Alignment: RLHF, DPO, GRPO

Learning objectives

Folder structure

Quick start

Three paths through Week 7

What's new in 2026 (vs the original lecture deck)

How the modules fit together

Reflection file

Grading rubric (60 points total)

Troubleshooting

"ImportError: cannot import name 'KTOTrainer'"

NB04 / NB05 / NB06 take forever

"qwen3.5:27b not found"

UltraFeedback download hangs

Gradio import error in NB02

"save_model failed" warning during DPO

References (used while building this homework)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
outputs		outputs
src		src
.env.example		.env.example
.gitignore		.gitignore
Class 7 Homework.ipynb		Class 7 Homework.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Week 7 — Alignment: RLHF, DPO, GRPO

Learning objectives

Folder structure

Quick start

Three paths through Week 7

What's new in 2026 (vs the original lecture deck)

How the modules fit together

Reflection file

Grading rubric (60 points total)

Troubleshooting

"ImportError: cannot import name 'KTOTrainer'"

NB04 / NB05 / NB06 take forever

"qwen3.5:27b not found"

UltraFeedback download hangs

Gradio import error in NB02

"save_model failed" warning during DPO

References (used while building this homework)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages