A self-taught engineer's structured map of the AI evaluation field.
-
Updated
May 22, 2026
A self-taught engineer's structured map of the AI evaluation field.
🧪 Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.
daily puzzle for ai agents
Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evaluates.
Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.
Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.
The Eval Codex — Claude-tutored AI-eval learning engine. Build eval expertise via guided practice.
Add a description, image, and links to the ai-eval topic page so that developers can more easily learn about it.
To associate your repository with the ai-eval topic, visit your repo's landing page and select "manage topics."