E1: Publish per-model eval leaderboard in README

## Why

The `eval/` directory ships reproducible scenarios for grading investigation quality. Publishing model-by-model pass rates in the README helps users (and operators picking a provider) see which LLM actually finds root causes on log data, rather than choosing by brand.

## Scope

- Define a pass/fail criterion per eval scenario (e.g. `root_cause` substring match plus `affected_services` superset check).
- Add `eval/run_all.py` that runs each scenario against each configured provider (OpenAI, Anthropic, Mistral, Gemini, optionally Ollama) and emits a JSON report.
- Render the latest report as a markdown table in the README.
- Manual run is acceptable for v1; automating in CI can come later.

## Acceptance

- README has a leaderboard table with at least 3 providers x 2 scenarios.
- `eval/run_all.py` is documented in the README so contributors can reproduce the numbers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E1: Publish per-model eval leaderboard in README #16

Why

Scope

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

E1: Publish per-model eval leaderboard in README #16

Description

Why

Scope

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions