Skip to content

inference-ai-course/Homework7-Submission

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Week 7 — Alignment: RLHF, DPO, GRPO

Hands-on alignment of Qwen2.5-0.5B-Instruct using the 2026 production stack: HuggingFace TRL v1.0+, Direct Preference Optimization, and the latest variants (IPO, KTO, ORPO, GRPO, Iterative DPO).

By the end of this homework you'll have built every piece of a real alignment pipeline — preference data collection, a Bradley-Terry reward model, a DPO-trained adapter, and a complete evaluation suite — at small enough scale to run on a laptop in under an hour.


Learning objectives

After completing the 9 notebooks you will be able to:

  1. Explain what alignment does on top of pretraining + SFT and why every modern LLM (Claude, GPT, Llama, Qwen) ships with it.
  2. Collect preference pairs synthetically (NB02) and from public datasets (UltraFeedback, HH-RLHF), with position-bias control for LLM judges.
  3. Train a Bradley-Terry reward model and measure pairwise accuracy on held-out data (NB03).
  4. Fine-tune Qwen2.5-0.5B with DPO (NB04) and the modern variants — IPO, KTO, ORPO (NB05).
  5. Demonstrate GRPO mathematically (NB06 Part 1) and run a real iterative-DPO loop the same way Llama-3-Instruct was built (NB06 Part 3).
  6. Evaluate alignment with pairwise win-rate, safety/refusal rate, length-bias check, and reward-hacking signals (NB07).
  7. Plan how to apply alignment to your Week-8 capstone (NB08).

Folder structure

Homework7-Submission/
├── README.md                       (this file)
├── requirements.txt                full dependency list
├── .env.example                    copy to .env and add your API keys
├── notebooks/
│   ├── 00_setup_verification.ipynb       Verify Python, packages, API keys (~5 min)
│   ├── 01_environment_setup.ipynb        Pick path A/B/C, first model load (~10 min)
│   ├── 02_preference_data.ipynb          Build pairs + judge + UltraFeedback + Gradio (~25 min)
│   ├── 03_reward_model.ipynb             Train Bradley-Terry RM + accuracy plot (~25 min)
│   ├── 04_dpo.ipynb                      DPO training + before/after comparison (~30 min)
│   ├── 05_dpo_variants.ipynb             IPO / KTO / ORPO loss-curve comparison (~30 min)
│   ├── 06_grpo_iterative.ipynb           GRPO sim + 2-round Iterative DPO (~25 min)
│   ├── 07_alignment_eval.ipynb           Win-rate + safety + length-bias eval (~25 min)
│   └── 08_project_integration.ipynb      Connect alignment to your capstone (~15 min)
├── src/
│   ├── __init__.py                       Light core only (LLMClient, CostTracker, utils)
│   ├── config.py                         PATH, model IDs, dataset names, output paths
│   ├── llm_client.py                     Unified Claude + Ollama client (chat API)
│   ├── cost_tracker.py                   Token + cost tracking across all paths
│   ├── utils.py                          save_task_output(), append_to_reflection()
│   ├── prompt_templates.py               CO-STAR templates (carryover from HW3+)
│   ├── preference_data.py                Build/load/judge/convert preference data
│   ├── reward_model.py                   Bradley-Terry RM via trl.RewardTrainer
│   ├── dpo_runner.py                     DPO/IPO/KTO/ORPO unified wrapper
│   ├── grpo_demo.py                      Pure-Python GRPO + real GRPOConfig builder
│   ├── iterative_dpo.py                  Llama-3-style sample → judge → DPO loop
│   ├── alignment_eval.py                 Win-rate, refusal, length-bias detectors
│   ├── safety_prompts.py                 HARMFUL / BENIGN curated prompt sets
│   └── gradio_annotator.py               Build a human pairwise-annotation UI
├── outputs/                              Auto-populated by notebooks
│   ├── homework_reflection.md            Single rolling reflection across all 9 notebooks
│   ├── preference_data.jsonl             NB02
│   ├── ultrafeedback_tiny.jsonl          NB02
│   ├── kto_format.jsonl                  NB02
│   ├── reward_format.jsonl               NB02
│   ├── reward_model/                     NB03 (RM weights)
│   ├── reward_model_scores.png           NB03
│   ├── dpo_adapter/                      NB04 (DPO LoRA)
│   ├── dpo_curves.png                    NB04
│   ├── dpo_before_after.json             NB04 (read by NB07)
│   ├── variants_dpo / variants_ipo / variants_kto / variants_orpo/   NB05
│   ├── variants_loss.png                 NB05
│   ├── grpo_config.json                  NB06
│   ├── iterative_dpo/round_0/, round_1/  NB06
│   └── safety_refusal.png                NB07
└── docs/                                 (placeholder for any extra writeups)

Quick start

cd Homework7-Submission/
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env       # edit and fill ANTHROPIC_API_KEY (or use Ollama only)
jupyter lab notebooks/

Then run notebooks in order: 00 → 08. Total time end-to-end: ~3 hours, dominated by NB04 (DPO training, ~5 min) and NB06 (iterative DPO, ~10 min).


Three paths through Week 7

Pick one in src/config.py (PATH = "A" | "B" | "C"):

Path What you run Hardware needed Best for
A Cloud-only (Claude judging, conceptual training) Any laptop + Claude key Limited compute, focus on ideas
B Real DPO/RM/Iterative-DPO on Qwen2.5-0.5B (HuggingFace) 8 GB RAM, ~10 min total compute The default — recommended
C A + B combined (cloud judges, local training) Both Best educational value (slowest)

What's new in 2026 (vs the original lecture deck)

The class slides cover RLHF, DPO, GRPO, and "modern open-source tools." This homework brings everything to current 2026 best practice:

Topic in slides What we actually use in the homework
RLHF / PPO Conceptual coverage in NB03+NB06 (no PPO training — too expensive)
Reward Model Bradley-Terry RM via trl.RewardTrainer (TRL v1.0 stable API)
DPO trl.DPOTrainer + LoRA on Qwen2.5-0.5B — the default choice
GRPO Pure-Python simulation + ready-to-paste GRPOConfig for Colab GPU
Iterative DPO Real 2-round loop (Llama-3 / RLHFlow recipe) — src/iterative_dpo.py
TRL v1.0 (Apr 2026) Unified post-training stack — we use the stable trainers + flag the experimental ones
IPO (NeurIPS 2024) One config flag (loss_type="ipo") inside DPOTrainer — covered in NB05
KTO (Mar 2024) trl.experimental.KTOTrainer — binary signals, no pairs needed (NB05)
ORPO (Apr 2024) trl.experimental.ORPOTrainer — single-stage SFT+alignment (NB05)
SimPO (NeurIPS 2024) Conceptual coverage in NB04 + NB05 (reference-free implicit reward)
UltraFeedback / HH-RLHF Tiny slices loaded via datasets.load_dataset in NB02
Gradio annotation Working build_annotator() Blocks app in src/gradio_annotator.py

How the modules fit together

              ┌────────────────────────────┐
              │   src/preference_data.py   │   build/load/judge/convert
              │  (used by every notebook)  │
              └─────────────┬──────────────┘
                            │
          ┌─────────────────┼───────────────────┐
          ▼                 ▼                   ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ src/reward_model │ │  src/dpo_runner  │ │  src/iterative_  │
│   (NB03)         │ │  DPO/IPO/KTO/ORPO│ │   dpo (NB06)     │
│                  │ │  (NB04, NB05)    │ │                  │
└──────────┬───────┘ └─────────┬────────┘ └────────┬─────────┘
           │                   │                    │
           └─────────┬─────────┴────────────────────┘
                     ▼
              ┌──────────────────────────┐
              │   src/alignment_eval.py  │   winrate + safety + length bias
              │          (NB07)          │
              └──────────────────────────┘

src/grpo_demo.py lives off to the side — it's a teaching module (pure Python step + config dict generator), not a real trainer.


Reflection file

Every notebook calls append_to_reflection(notebook=..., section_title=..., reflection_content=...) exactly once at the end. The result is a single rolling document at outputs/homework_reflection.md:

# Week 7: Alignment — RLHF, DPO, GRPO -- Homework Reflection

**Student Name:** ...
**Path Selected:** ...

## Notebook 00: Setup Verification
...
## Notebook 01: Environment & First Model Call
...
... (one section per notebook)
## Notebook 08: Project Integration & Final Reflection
...

You will not have to copy/paste reflections by hand — each notebook has a TODO cell where you write your reflection text into a string variable, and the summary cell at the bottom auto-appends it.


Grading rubric (60 points total)

Notebook Points Required artifacts
00 2 All 5 verification cells pass; reflection initialized
01 4 First model call succeeds; reflection on alignment behaviors
02 8 preference_data.jsonl ≥ 4 pairs; position-bias demo ran
03 8 reward_model_scores.png saved; eval accuracy reported
04 10 dpo_curves.png saved; before/after JSON saved
05 8 At least DPO + IPO trained (KTO/ORPO partial credit if skipped)
06 8 GRPO simulation step + at least 1 iterative DPO round complete
07 8 Win-rate, safety bar chart, and length-bias dict all printed
08 4 Capstone plan template filled in; final reflection written

Reflection quality is weighted into each notebook's score — partial-credit reflections (e.g., the unmodified TODO placeholder) lose 50% on that NB.


Troubleshooting

"ImportError: cannot import name 'KTOTrainer'"

TRL v1.0 (April 2026) moved KTO and ORPO to trl.experimental. Our src/dpo_runner.py already tries both import paths. If both fail: pip install -U 'trl>=0.16'.

NB04 / NB05 / NB06 take forever

Cut max_steps from 20 → 5 in the run_preference_training calls. Or set PATH = "A" to skip real training (you'll lose the loss curves but the rest still works).

"qwen3.5:27b not found"

Either run ollama pull qwen3.5:27b (~16GB download), or set PATH = "A" to use Claude only.

UltraFeedback download hangs

Check internet, or comment out the UltraFeedback lines in NB02 / NB03. The notebooks fall back to your synthetic NB02 data.

Gradio import error in NB02

This is a soft-dependency. The cell catches ImportError and prints a skip message. Install with pip install gradio>=4.0.0 if you want to run the annotator.

"save_model failed" warning during DPO

Harmless if you're using LoRA — adapter weights save separately. The run still succeeds.


References (used while building this homework)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors