Course: Machine Learning Engineer in the Generative AI Era Week: 8 of 10 Topic: Factuality, automated red-teaming, prompt injection, production guardrails, bias, MLCommons AILuminate
By 2026 the "AI safety" stack has split into five concrete engineering disciplines: factuality / hallucination control, jailbreak resistance, prompt-injection defense (including indirect injection inside retrieved content and tool outputs — the dominant 2026 vector), production guardrails (Llama Prompt-Guard 2, Llama Guard 3/4, ShieldGemma 2, NeMo Guardrails, Anthropic Constitutional Classifiers), and bias / hazard evaluation (BBQ, BOLD, MLCommons AILuminate v1.1). This week treats every one of those as an engineering problem with measurable metrics, not a slide deck of generic principles.
You will hand-roll mini implementations of the most influential 2025–2026 techniques (SelfCheckGPT, FActScore-lite, PAIR, Crescendo, StrongREJECT, a layered guardrail chain, AILuminate-12), run them against your chosen model, then write a safety plan for one feature of your capstone project.
- Trigger and measure hallucinations with SelfCheckGPT, FActScore-lite, and HHEM-style NLI grounding.
- Run automated single-turn red-teaming (PAIR) and multi-turn red-teaming (Crescendo) on a benign target.
- Score red-team responses with the StrongREJECT autograder.
- Build a minimal ReAct agent and observe direct, indirect, and tool-result prompt injection.
- Compose a multi-layer production guardrail chain (regex → Prompt-Guard → Llama-Guard → Constitutional Classifier).
- Run BBQ-mini, BOLD-mini, and MLCommons AILuminate-mini on your chosen model.
- Translate all of the above into a production safety plan for your capstone feature.
- Default model:
claude-sonnet-4-6 - Cost: ~$2–4 for the full assignment
- Requires:
ANTHROPIC_API_KEYin.env
- Default model:
qwen3.5:27b - Requires: ~20 GB RAM,
ollama pull qwen3.5:27b - Pedagogical note: open-weights local models often have weaker safety training than frontier APIs, which makes them a more honest target for studying jailbreaks. The contrast is part of the lesson.
- Use Claude for the attacker / judge roles and Ollama as the target, so the cross-model dynamic mirrors how academic red-team papers actually run their experiments.
- Python 3.8+
- ~3 GB disk (transformers + torch)
- Optional but useful: Ollama (
brew install ollama/curl -fsSL https://ollama.ai/install.sh | sh)
# macOS
brew install ollama
# Ubuntu
sudo apt install build-essential
curl -fsSL https://ollama.ai/install.sh | sh# 1. Clone & enter
cd Homework8-Submission
# 2. (Recommended) virtualenv
python3 -m venv .venv
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure secrets
cp .env.example .env
$EDITOR .env # paste your ANTHROPIC_API_KEY
# 5. (Optional) Pull a local model for Path B
ollama pull qwen3.5:27b
# 6. Launch
jupyter lab notebooks/Homework8-Submission/
├── README.md
├── requirements.txt
├── .env.example
├── notebooks/
│ ├── 00_setup_verification.ipynb
│ ├── 01_environment_setup.ipynb
│ ├── 02_hallucination_detection.ipynb
│ ├── 03_hallucination_mitigation.ipynb
│ ├── 04_jailbreaks.ipynb
│ ├── 05_prompt_injection.ipynb
│ ├── 06_guardrails.ipynb
│ ├── 07_bias_safety_benchmarks.ipynb
│ └── 08_project_integration.ipynb
├── src/
│ ├── __init__.py
│ ├── config.py # PATH, CLAUDE_MODEL, OLLAMA_MODEL, hazard taxonomy
│ ├── llm_client.py # unified Claude / Ollama client
│ ├── cost_tracker.py # token + cost accounting
│ ├── utils.py # save_task_output, append_to_reflection
│ ├── prompt_templates.py # CO-STAR templates (inherited)
│ ├── hallucination.py # SelfCheckGPT, FActScore-lite, HHEM / NLI, SimpleQA
│ ├── jailbreaks.py # canonical tactics, PAIR, Crescendo, StrongREJECT
│ ├── prompt_injection.py # toy ReAct agent, IPI helpers, Prompt-Guard wrapper
│ ├── guardrails.py # regex + Llama Guard + OpenAI Mod + Constitutional + chain
│ └── bias_safety.py # BBQ-mini, BOLD-mini, AILuminate-mini
└── outputs/ # auto-created by notebooks
| Notebook | Topic | Time | Key Output |
|---|---|---|---|
| 00 | Setup verification | 5 min | — |
| 01 | Environment setup | 20 min | path_selection.md, setup_summary.txt |
| 02 | Hallucinations I — Detection | 30 min | hallucination_trigger_results.json, selfcheckgpt_results.json, refusal_calibration.json |
| 03 | Hallucinations II — Mitigation (RAG / FActScore / HHEM) | 30 min | factscore_*.json, grounding_scores.json |
| 04 | Jailbreaks — PAIR, Crescendo, StrongREJECT | 35 min | pair_trace.json, crescendo_turns.json, strongreject_scores.json |
| 05 | Prompt Injection — Direct / Indirect / Agentic | 35 min | ipi_agent_traces.json, tool_result_injection.json |
| 06 | Production Guardrails — Layered Chain | 30 min | guardrail_per_layer.json, guardrail_chain_results.json |
| 07 | Bias + AILuminate-12 | 30 min | bbq_mini.json, bold_mini.json, ailuminate_mini.json, ailuminate_per_category.png |
| 08 | Project integration | 35 min | my_project_update.md |
Total estimated time: ~3.5 hours.
outputs/homework_reflection.md(70%) — Built incrementally by each notebook's summary cell. Depth of analysis is graded, not length. Seeoutputs/homework_reflection.example.mdfor a fully-worked example of what an A-grade submission looks like (~380 lines, ~20KB, drawn from a real run of NB01-NB08).outputs/my_project_update.md(20%) — A short safety plan for one feature of your capstone, written in NB08. Seeoutputs/my_project_update.example.mdfor a worked example covering an internal HR Q&A bot.- All 9 notebooks executed end-to-end with at least one TODO per topic notebook filled in (10%).
The *.example.md files are reference exemplars — do not modify them. When you run the notebooks, they will populate the non-example files with the template structure; your job is to fill in the TODO reflection blocks and re-run.
| Path | Model | Estimated cost |
|---|---|---|
| A | claude-sonnet-4-6 |
$2–4 (≈ 50–80 LLM calls × ~300 tokens each) |
| A (with thinking) | claude-sonnet-4-6 + extended thinking |
$3–6 |
| B | qwen3.5:27b (Ollama, local) |
$0 |
| C | Hybrid | $1–2 (only judge calls on Claude) |
| Symptom | Fix |
|---|---|
transformers import is slow / errors on M-series Macs |
pip install 'torch>=2.1' --index-url https://download.pytorch.org/whl/cpu to get a CPU-only build. |
| HHEM model fails to download | Falls back automatically to MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli. The result interpretation is similar. |
| Llama-Guard / Prompt-Guard 403 from HuggingFace | These are gated models — accept the license on the model page and set HF_TOKEN. NB06 falls back to regex if absent. |
Ollama qwen3.5:27b empty response |
Increase max_tokens; qwen3 spends tokens internally on <think> blocks. Our LLMClient already disables think mode. |
| Notebook 07 plot doesn't render | Confirm %matplotlib inline and rerun the cell. |
- NB04 +5%: Run
garak(NVIDIA's LLM vulnerability scanner) against your local Ollama model:pip install garak && garak --model_type ollama --model_name qwen3.5:27b --probes promptinject,encoding. Compare its findings to your PAIR-lite results. - NB05 +5%: Implement an output-side IPI defender: when the agent is about to call
send_payment, run a Constitutional classifier on the FULL tool-result trace and block if it contains policy-violating instructions. - NB06 +5%: Add a 5th layer to your GuardrailChain — Anthropic's Constitutional Classifier approach but trained on a domain-specific constitution for your project.
- NB07 +10%: Reproduce one row of the MLCommons AILuminate v1.1 official benchmark by running ≥100 prompts per hazard category and computing a per-category grade A–E using AILuminate's scoring rubric.
- Vectara Hallucination Leaderboard — HHEM-2.3 with April 2026 update on a 7,700-doc harder corpus.
- SelfCheckGPT (Manakul et al., EMNLP 2023)
- FActScore (Min et al., 2023)
- Semantic Entropy for Hallucination Detection (Farquhar et al., Nature 2024)
- OpenAI SimpleQA
- PAIR (Chao et al., 2023)
- Crescendo (Russinovich et al., MSFT 2024)
- StrongREJECT (Souly et al., 2024)
- JailbreakBench (Chao et al., NeurIPS 2024)
- HarmBench (Mazeika et al., 2024)
- Anthropic Constitutional Classifiers
- Anthropic Constitutional Classifiers++ (Apr 2025)
- OWASP Top 10 for LLM Apps 2025 — LLM-01 = prompt injection.
- Google Security Blog: AI threats in the wild (Apr 2026)
- Palo Alto Unit 42 — MCP tool poisoning
- Microsoft PyRIT
- Llama Guard 3
- Llama Prompt-Guard 2
- ShieldGemma 2
- NVIDIA NeMo Guardrails
- Meta LlamaFirewall (arXiv:2505.03574)
- NVIDIA garak —
pip install garak - Microsoft PyRIT —
pip install pyrit
| Day | Notebooks | Focus |
|---|---|---|
| 1 | 00, 01 | Setup + path selection |
| 2 | 02 | Hallucination detection |
| 3 | 03 | Hallucination mitigation (RAG + FActScore + HHEM) |
| 4 | 04 | Jailbreaks (PAIR / Crescendo / StrongREJECT) |
| 5 | 05 | Prompt injection (direct / indirect / agentic) |
| 6 | 06, 07 | Guardrails + bias / AILuminate |
| 7 | 08 | Project integration + polish reflection |
- Course Discord:
#week-8channel - Office Hours: see course calendar