Skip to content

E1: Publish per-model eval leaderboard in README #16

@VarunGitGood

Description

@VarunGitGood

Why

The eval/ directory ships reproducible scenarios for grading investigation quality. Publishing model-by-model pass rates in the README helps users (and operators picking a provider) see which LLM actually finds root causes on log data, rather than choosing by brand.

Scope

  • Define a pass/fail criterion per eval scenario (e.g. root_cause substring match plus affected_services superset check).
  • Add eval/run_all.py that runs each scenario against each configured provider (OpenAI, Anthropic, Mistral, Gemini, optionally Ollama) and emits a JSON report.
  • Render the latest report as a markdown table in the README.
  • Manual run is acceptable for v1; automating in CI can come later.

Acceptance

  • README has a leaderboard table with at least 3 providers x 2 scenarios.
  • eval/run_all.py is documented in the README so contributors can reproduce the numbers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    evalEvaluation harness and benchmarksfeatureNew feature

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions