Why
The eval/ directory ships reproducible scenarios for grading investigation quality. Publishing model-by-model pass rates in the README helps users (and operators picking a provider) see which LLM actually finds root causes on log data, rather than choosing by brand.
Scope
- Define a pass/fail criterion per eval scenario (e.g.
root_cause substring match plus affected_services superset check).
- Add
eval/run_all.py that runs each scenario against each configured provider (OpenAI, Anthropic, Mistral, Gemini, optionally Ollama) and emits a JSON report.
- Render the latest report as a markdown table in the README.
- Manual run is acceptable for v1; automating in CI can come later.
Acceptance
- README has a leaderboard table with at least 3 providers x 2 scenarios.
eval/run_all.py is documented in the README so contributors can reproduce the numbers.
Why
The
eval/directory ships reproducible scenarios for grading investigation quality. Publishing model-by-model pass rates in the README helps users (and operators picking a provider) see which LLM actually finds root causes on log data, rather than choosing by brand.Scope
root_causesubstring match plusaffected_servicessuperset check).eval/run_all.pythat runs each scenario against each configured provider (OpenAI, Anthropic, Mistral, Gemini, optionally Ollama) and emits a JSON report.Acceptance
eval/run_all.pyis documented in the README so contributors can reproduce the numbers.