Skip to content

inference-ai-course/Homework8-Submission

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Week 8: Hallucinations, Jailbreaks & Ethical Safeguards in LLMs

Course: Machine Learning Engineer in the Generative AI Era Week: 8 of 10 Topic: Factuality, automated red-teaming, prompt injection, production guardrails, bias, MLCommons AILuminate


Overview

By 2026 the "AI safety" stack has split into five concrete engineering disciplines: factuality / hallucination control, jailbreak resistance, prompt-injection defense (including indirect injection inside retrieved content and tool outputs — the dominant 2026 vector), production guardrails (Llama Prompt-Guard 2, Llama Guard 3/4, ShieldGemma 2, NeMo Guardrails, Anthropic Constitutional Classifiers), and bias / hazard evaluation (BBQ, BOLD, MLCommons AILuminate v1.1). This week treats every one of those as an engineering problem with measurable metrics, not a slide deck of generic principles.

You will hand-roll mini implementations of the most influential 2025–2026 techniques (SelfCheckGPT, FActScore-lite, PAIR, Crescendo, StrongREJECT, a layered guardrail chain, AILuminate-12), run them against your chosen model, then write a safety plan for one feature of your capstone project.


Learning Objectives

  1. Trigger and measure hallucinations with SelfCheckGPT, FActScore-lite, and HHEM-style NLI grounding.
  2. Run automated single-turn red-teaming (PAIR) and multi-turn red-teaming (Crescendo) on a benign target.
  3. Score red-team responses with the StrongREJECT autograder.
  4. Build a minimal ReAct agent and observe direct, indirect, and tool-result prompt injection.
  5. Compose a multi-layer production guardrail chain (regex → Prompt-Guard → Llama-Guard → Constitutional Classifier).
  6. Run BBQ-mini, BOLD-mini, and MLCommons AILuminate-mini on your chosen model.
  7. Translate all of the above into a production safety plan for your capstone feature.

Setup Options

Path A: Claude API (Cloud) — Recommended

  • Default model: claude-sonnet-4-6
  • Cost: ~$2–4 for the full assignment
  • Requires: ANTHROPIC_API_KEY in .env

Path B: Ollama (Local / Free)

  • Default model: qwen3.5:27b
  • Requires: ~20 GB RAM, ollama pull qwen3.5:27b
  • Pedagogical note: open-weights local models often have weaker safety training than frontier APIs, which makes them a more honest target for studying jailbreaks. The contrast is part of the lesson.

Path C: Hybrid (Recommended for the bonus)

  • Use Claude for the attacker / judge roles and Ollama as the target, so the cross-model dynamic mirrors how academic red-team papers actually run their experiments.

Prerequisites

  • Python 3.8+
  • ~3 GB disk (transformers + torch)
  • Optional but useful: Ollama (brew install ollama / curl -fsSL https://ollama.ai/install.sh | sh)

System Dependencies

# macOS
brew install ollama

# Ubuntu
sudo apt install build-essential
curl -fsSL https://ollama.ai/install.sh | sh

Installation

# 1. Clone & enter
cd Homework8-Submission

# 2. (Recommended) virtualenv
python3 -m venv .venv
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure secrets
cp .env.example .env
$EDITOR .env   # paste your ANTHROPIC_API_KEY

# 5. (Optional) Pull a local model for Path B
ollama pull qwen3.5:27b

# 6. Launch
jupyter lab notebooks/

Repository Structure

Homework8-Submission/
├── README.md
├── requirements.txt
├── .env.example
├── notebooks/
│   ├── 00_setup_verification.ipynb
│   ├── 01_environment_setup.ipynb
│   ├── 02_hallucination_detection.ipynb
│   ├── 03_hallucination_mitigation.ipynb
│   ├── 04_jailbreaks.ipynb
│   ├── 05_prompt_injection.ipynb
│   ├── 06_guardrails.ipynb
│   ├── 07_bias_safety_benchmarks.ipynb
│   └── 08_project_integration.ipynb
├── src/
│   ├── __init__.py
│   ├── config.py                   # PATH, CLAUDE_MODEL, OLLAMA_MODEL, hazard taxonomy
│   ├── llm_client.py               # unified Claude / Ollama client
│   ├── cost_tracker.py             # token + cost accounting
│   ├── utils.py                    # save_task_output, append_to_reflection
│   ├── prompt_templates.py         # CO-STAR templates (inherited)
│   ├── hallucination.py            # SelfCheckGPT, FActScore-lite, HHEM / NLI, SimpleQA
│   ├── jailbreaks.py               # canonical tactics, PAIR, Crescendo, StrongREJECT
│   ├── prompt_injection.py         # toy ReAct agent, IPI helpers, Prompt-Guard wrapper
│   ├── guardrails.py               # regex + Llama Guard + OpenAI Mod + Constitutional + chain
│   └── bias_safety.py              # BBQ-mini, BOLD-mini, AILuminate-mini
└── outputs/                        # auto-created by notebooks

Assignment Structure

Notebook Topic Time Key Output
00 Setup verification 5 min
01 Environment setup 20 min path_selection.md, setup_summary.txt
02 Hallucinations I — Detection 30 min hallucination_trigger_results.json, selfcheckgpt_results.json, refusal_calibration.json
03 Hallucinations II — Mitigation (RAG / FActScore / HHEM) 30 min factscore_*.json, grounding_scores.json
04 Jailbreaks — PAIR, Crescendo, StrongREJECT 35 min pair_trace.json, crescendo_turns.json, strongreject_scores.json
05 Prompt Injection — Direct / Indirect / Agentic 35 min ipi_agent_traces.json, tool_result_injection.json
06 Production Guardrails — Layered Chain 30 min guardrail_per_layer.json, guardrail_chain_results.json
07 Bias + AILuminate-12 30 min bbq_mini.json, bold_mini.json, ailuminate_mini.json, ailuminate_per_category.png
08 Project integration 35 min my_project_update.md

Total estimated time: ~3.5 hours.


Deliverables

  1. outputs/homework_reflection.md (70%) — Built incrementally by each notebook's summary cell. Depth of analysis is graded, not length. See outputs/homework_reflection.example.md for a fully-worked example of what an A-grade submission looks like (~380 lines, ~20KB, drawn from a real run of NB01-NB08).
  2. outputs/my_project_update.md (20%) — A short safety plan for one feature of your capstone, written in NB08. See outputs/my_project_update.example.md for a worked example covering an internal HR Q&A bot.
  3. All 9 notebooks executed end-to-end with at least one TODO per topic notebook filled in (10%).

The *.example.md files are reference exemplars — do not modify them. When you run the notebooks, they will populate the non-example files with the template structure; your job is to fill in the TODO reflection blocks and re-run.


Cost Estimates

Path Model Estimated cost
A claude-sonnet-4-6 $2–4 (≈ 50–80 LLM calls × ~300 tokens each)
A (with thinking) claude-sonnet-4-6 + extended thinking $3–6
B qwen3.5:27b (Ollama, local) $0
C Hybrid $1–2 (only judge calls on Claude)

Troubleshooting

Symptom Fix
transformers import is slow / errors on M-series Macs pip install 'torch>=2.1' --index-url https://download.pytorch.org/whl/cpu to get a CPU-only build.
HHEM model fails to download Falls back automatically to MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli. The result interpretation is similar.
Llama-Guard / Prompt-Guard 403 from HuggingFace These are gated models — accept the license on the model page and set HF_TOKEN. NB06 falls back to regex if absent.
Ollama qwen3.5:27b empty response Increase max_tokens; qwen3 spends tokens internally on <think> blocks. Our LLMClient already disables think mode.
Notebook 07 plot doesn't render Confirm %matplotlib inline and rerun the cell.

Bonus Challenges (Optional)

  1. NB04 +5%: Run garak (NVIDIA's LLM vulnerability scanner) against your local Ollama model: pip install garak && garak --model_type ollama --model_name qwen3.5:27b --probes promptinject,encoding. Compare its findings to your PAIR-lite results.
  2. NB05 +5%: Implement an output-side IPI defender: when the agent is about to call send_payment, run a Constitutional classifier on the FULL tool-result trace and block if it contains policy-violating instructions.
  3. NB06 +5%: Add a 5th layer to your GuardrailChain — Anthropic's Constitutional Classifier approach but trained on a domain-specific constitution for your project.
  4. NB07 +10%: Reproduce one row of the MLCommons AILuminate v1.1 official benchmark by running ≥100 prompts per hazard category and computing a per-category grade A–E using AILuminate's scoring rubric.

Resources

Hallucinations

Jailbreaks

Prompt Injection / Agentic

Guardrails

Bias / Fairness

Safety Benchmarks

Red-team scanners


Timeline (7-Day Schedule)

Day Notebooks Focus
1 00, 01 Setup + path selection
2 02 Hallucination detection
3 03 Hallucination mitigation (RAG + FActScore + HHEM)
4 04 Jailbreaks (PAIR / Crescendo / StrongREJECT)
5 05 Prompt injection (direct / indirect / agentic)
6 06, 07 Guardrails + bias / AILuminate
7 08 Project integration + polish reflection

Support

  • Course Discord: #week-8 channel
  • Office Hours: see course calendar

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors