Week 8: Hallucinations, Jailbreaks & Ethical Safeguards in LLMs

Course: Machine Learning Engineer in the Generative AI Era Week: 8 of 10 Topic: Factuality, automated red-teaming, prompt injection, production guardrails, bias, MLCommons AILuminate

Overview

By 2026 the "AI safety" stack has split into five concrete engineering disciplines: factuality / hallucination control, jailbreak resistance, prompt-injection defense (including indirect injection inside retrieved content and tool outputs — the dominant 2026 vector), production guardrails (Llama Prompt-Guard 2, Llama Guard 3/4, ShieldGemma 2, NeMo Guardrails, Anthropic Constitutional Classifiers), and bias / hazard evaluation (BBQ, BOLD, MLCommons AILuminate v1.1). This week treats every one of those as an engineering problem with measurable metrics, not a slide deck of generic principles.

You will hand-roll mini implementations of the most influential 2025–2026 techniques (SelfCheckGPT, FActScore-lite, PAIR, Crescendo, StrongREJECT, a layered guardrail chain, AILuminate-12), run them against your chosen model, then write a safety plan for one feature of your capstone project.

Learning Objectives

Trigger and measure hallucinations with SelfCheckGPT, FActScore-lite, and HHEM-style NLI grounding.
Run automated single-turn red-teaming (PAIR) and multi-turn red-teaming (Crescendo) on a benign target.
Score red-team responses with the StrongREJECT autograder.
Build a minimal ReAct agent and observe direct, indirect, and tool-result prompt injection.
Compose a multi-layer production guardrail chain (regex → Prompt-Guard → Llama-Guard → Constitutional Classifier).
Run BBQ-mini, BOLD-mini, and MLCommons AILuminate-mini on your chosen model.
Translate all of the above into a production safety plan for your capstone feature.

Setup Options

Path A: Claude API (Cloud) — Recommended

Default model: claude-sonnet-4-6
Cost: ~$2–4 for the full assignment
Requires: ANTHROPIC_API_KEY in .env

Path B: Ollama (Local / Free)

Default model: qwen3.5:27b
Requires: ~20 GB RAM, ollama pull qwen3.5:27b
Pedagogical note: open-weights local models often have weaker safety training than frontier APIs, which makes them a more honest target for studying jailbreaks. The contrast is part of the lesson.

Path C: Hybrid (Recommended for the bonus)

Use Claude for the attacker / judge roles and Ollama as the target, so the cross-model dynamic mirrors how academic red-team papers actually run their experiments.

Prerequisites

Python 3.8+
~3 GB disk (transformers + torch)
Optional but useful: Ollama (brew install ollama / curl -fsSL https://ollama.ai/install.sh | sh)

System Dependencies

# macOS
brew install ollama

# Ubuntu
sudo apt install build-essential
curl -fsSL https://ollama.ai/install.sh | sh

Installation

# 1. Clone & enter
cd Homework8-Submission

# 2. (Recommended) virtualenv
python3 -m venv .venv
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure secrets
cp .env.example .env
$EDITOR .env   # paste your ANTHROPIC_API_KEY

# 5. (Optional) Pull a local model for Path B
ollama pull qwen3.5:27b

# 6. Launch
jupyter lab notebooks/

Repository Structure

Homework8-Submission/
├── README.md
├── requirements.txt
├── .env.example
├── notebooks/
│   ├── 00_setup_verification.ipynb
│   ├── 01_environment_setup.ipynb
│   ├── 02_hallucination_detection.ipynb
│   ├── 03_hallucination_mitigation.ipynb
│   ├── 04_jailbreaks.ipynb
│   ├── 05_prompt_injection.ipynb
│   ├── 06_guardrails.ipynb
│   ├── 07_bias_safety_benchmarks.ipynb
│   └── 08_project_integration.ipynb
├── src/
│   ├── __init__.py
│   ├── config.py                   # PATH, CLAUDE_MODEL, OLLAMA_MODEL, hazard taxonomy
│   ├── llm_client.py               # unified Claude / Ollama client
│   ├── cost_tracker.py             # token + cost accounting
│   ├── utils.py                    # save_task_output, append_to_reflection
│   ├── prompt_templates.py         # CO-STAR templates (inherited)
│   ├── hallucination.py            # SelfCheckGPT, FActScore-lite, HHEM / NLI, SimpleQA
│   ├── jailbreaks.py               # canonical tactics, PAIR, Crescendo, StrongREJECT
│   ├── prompt_injection.py         # toy ReAct agent, IPI helpers, Prompt-Guard wrapper
│   ├── guardrails.py               # regex + Llama Guard + OpenAI Mod + Constitutional + chain
│   └── bias_safety.py              # BBQ-mini, BOLD-mini, AILuminate-mini
└── outputs/                        # auto-created by notebooks

Assignment Structure

Notebook	Topic	Time	Key Output
00	Setup verification	5 min	—
01	Environment setup	20 min	`path_selection.md`, `setup_summary.txt`
02	Hallucinations I — Detection	30 min	`hallucination_trigger_results.json`, `selfcheckgpt_results.json`, `refusal_calibration.json`
03	Hallucinations II — Mitigation (RAG / FActScore / HHEM)	30 min	`factscore_*.json`, `grounding_scores.json`
04	Jailbreaks — PAIR, Crescendo, StrongREJECT	35 min	`pair_trace.json`, `crescendo_turns.json`, `strongreject_scores.json`
05	Prompt Injection — Direct / Indirect / Agentic	35 min	`ipi_agent_traces.json`, `tool_result_injection.json`
06	Production Guardrails — Layered Chain	30 min	`guardrail_per_layer.json`, `guardrail_chain_results.json`
07	Bias + AILuminate-12	30 min	`bbq_mini.json`, `bold_mini.json`, `ailuminate_mini.json`, `ailuminate_per_category.png`
08	Project integration	35 min	`my_project_update.md`

Total estimated time: ~3.5 hours.

Deliverables

outputs/homework_reflection.md (70%) — Built incrementally by each notebook's summary cell. Depth of analysis is graded, not length. See outputs/homework_reflection.example.md for a fully-worked example of what an A-grade submission looks like (~380 lines, ~20KB, drawn from a real run of NB01-NB08).
outputs/my_project_update.md (20%) — A short safety plan for one feature of your capstone, written in NB08. See outputs/my_project_update.example.md for a worked example covering an internal HR Q&A bot.
All 9 notebooks executed end-to-end with at least one TODO per topic notebook filled in (10%).

The *.example.md files are reference exemplars — do not modify them. When you run the notebooks, they will populate the non-example files with the template structure; your job is to fill in the TODO reflection blocks and re-run.

Cost Estimates

Path	Model	Estimated cost
A	`claude-sonnet-4-6`	$2–4 (≈ 50–80 LLM calls × ~300 tokens each)
A (with thinking)	`claude-sonnet-4-6` + extended thinking	$3–6
B	`qwen3.5:27b` (Ollama, local)	$0
C	Hybrid	$1–2 (only judge calls on Claude)

Troubleshooting

Symptom	Fix
`transformers` import is slow / errors on M-series Macs	`pip install 'torch>=2.1' --index-url https://download.pytorch.org/whl/cpu` to get a CPU-only build.
HHEM model fails to download	Falls back automatically to `MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli`. The result interpretation is similar.
Llama-Guard / Prompt-Guard 403 from HuggingFace	These are gated models — accept the license on the model page and set `HF_TOKEN`. NB06 falls back to regex if absent.
Ollama `qwen3.5:27b` empty response	Increase `max_tokens`; qwen3 spends tokens internally on `<think>` blocks. Our LLMClient already disables `think` mode.
Notebook 07 plot doesn't render	Confirm `%matplotlib inline` and rerun the cell.

Bonus Challenges (Optional)

NB04 +5%: Run garak (NVIDIA's LLM vulnerability scanner) against your local Ollama model: pip install garak && garak --model_type ollama --model_name qwen3.5:27b --probes promptinject,encoding. Compare its findings to your PAIR-lite results.
NB05 +5%: Implement an output-side IPI defender: when the agent is about to call send_payment, run a Constitutional classifier on the FULL tool-result trace and block if it contains policy-violating instructions.
NB06 +5%: Add a 5th layer to your GuardrailChain — Anthropic's Constitutional Classifier approach but trained on a domain-specific constitution for your project.
NB07 +10%: Reproduce one row of the MLCommons AILuminate v1.1 official benchmark by running ≥100 prompts per hazard category and computing a per-category grade A–E using AILuminate's scoring rubric.

Resources

Hallucinations

Vectara Hallucination Leaderboard — HHEM-2.3 with April 2026 update on a 7,700-doc harder corpus.
SelfCheckGPT (Manakul et al., EMNLP 2023)
FActScore (Min et al., 2023)
Semantic Entropy for Hallucination Detection (Farquhar et al., Nature 2024)
OpenAI SimpleQA

Jailbreaks

Prompt Injection / Agentic

OWASP Top 10 for LLM Apps 2025 — LLM-01 = prompt injection.
Google Security Blog: AI threats in the wild (Apr 2026)
Palo Alto Unit 42 — MCP tool poisoning
Microsoft PyRIT

Guardrails

Bias / Fairness

Safety Benchmarks

Red-team scanners

NVIDIA garak — pip install garak
Microsoft PyRIT — pip install pyrit

Timeline (7-Day Schedule)

Day	Notebooks	Focus
1	00, 01	Setup + path selection
2	02	Hallucination detection
3	03	Hallucination mitigation (RAG + FActScore + HHEM)
4	04	Jailbreaks (PAIR / Crescendo / StrongREJECT)
5	05	Prompt injection (direct / indirect / agentic)
6	06, 07	Guardrails + bias / AILuminate
7	08	Project integration + polish reflection

Support

Course Discord: #week-8 channel
Office Hours: see course calendar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Week 8: Hallucinations, Jailbreaks & Ethical Safeguards in LLMs

Overview

Learning Objectives

Setup Options

Path A: Claude API (Cloud) — Recommended

Path B: Ollama (Local / Free)

Path C: Hybrid (Recommended for the bonus)

Prerequisites

System Dependencies

Installation

Repository Structure

Assignment Structure

Deliverables

Cost Estimates

Troubleshooting

Bonus Challenges (Optional)

Resources

Hallucinations

Jailbreaks

Prompt Injection / Agentic

Guardrails

Bias / Fairness

Safety Benchmarks

Red-team scanners

Timeline (7-Day Schedule)

Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
outputs		outputs
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Week 8: Hallucinations, Jailbreaks & Ethical Safeguards in LLMs

Overview

Learning Objectives

Setup Options

Path A: Claude API (Cloud) — Recommended

Path B: Ollama (Local / Free)

Path C: Hybrid (Recommended for the bonus)

Prerequisites

System Dependencies

Installation

Repository Structure

Assignment Structure

Deliverables

Cost Estimates

Troubleshooting

Bonus Challenges (Optional)

Resources

Hallucinations

Jailbreaks

Prompt Injection / Agentic

Guardrails

Bias / Fairness

Safety Benchmarks

Red-team scanners

Timeline (7-Day Schedule)

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages