Skip to content

linny006/agent-eval-harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

225 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Agent Eval Harness

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

Stars Last Commit Items Updated

โญ Star this repo to bookmark โ€” fresh data every 15 minutes

English ยท ไธญๆ–‡ ยท ๆ—ฅๆœฌ่ชž ยท ํ•œ๊ตญ์–ด ยท Espaรฑol ยท Portuguรชs


๐Ÿ’ก What is this?

A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.

This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source โ€” new items added, expired items removed โ€” so you can rely on what you see being current.


๐Ÿ“‹ Current Items

โฐ Last updated: 2026-05-26 16:30 UTC

Data source: GitHub Search API

The table below is rewritten on every cron tick. Star the repo to bookmark.

# Name โญ Lang Updated Description
1 Kondwani10/Origin-Continuum 0 โ€” 2026-05-26 ๐ŸŒ Define and explore the Origin โ†” Continuum framework, ensuring proper attribution and continuity in dependency relation
2 Arize-ai/phoenix 9847 Python 2026-05-26 AI Observability & Evaluation
3 Kamixon131/claude-config 0 โ€” 2026-05-26 โš™๏ธ Enhance Claude Code with a powerful configuration framework that features specialized agents and workflows for effici
4 Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer- 0 โ€” 2026-05-26 โ™ป๏ธ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou
5 bhavya7995/AI_governance 1 PowerShell 2026-05-26 ๐Ÿค– Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a
6 promptfoo/promptfoo 21618 TypeScript 2026-05-26 Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C
7 Phinchanbora/llm-evaluation 0 Python 2026-05-26 ๐ŸŽฏ Benchmark LLMs effectively with over 10 tests and 108,000 real questions to assess model performance and enhance AI ev
8 penpoen/llm-SugarScape 1 Python 2026-05-26 ๐ŸŒ Explore AI behaviors in a Sugarscape simulation, revealing insights into cooperation and survival instincts using Grok
9 saddled-panicattack529/idea-evaluation-pipeline 0 โ€” 2026-05-26 Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist
10 NoesisVision/nasde-toolkit 10 Python 2026-05-26 CLI for benchmarks & evals of AI coding agents โ€” on tasks you already understand, using your Claude / Codex / Gemini ind
11 verifywise-ai/verifywise 291 TypeScript 2026-05-26 Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo
12 rogue-socket/focusgroup 0 Python 2026-05-26 Persona-driven dynamic testing for conversational AI products. Focus groups for your agents.
13 MyForgeLabs/myforge-vault-1111 0 Python 2026-05-26 An open-source 8-axis methodology + working tooling for evolving a personal Obsidian-vault into a self-improving knowled
14 isoc-il-labs/agent-reliability-eval 0 HTML 2026-05-26 Framework for evaluating the accuracy & reliability of news-credibility agents โ€” QA'd EN/HE test units + harness
15 jeremylongshore/j-rig-skill-binary-eval 0 TypeScript 2026-05-26 Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e
16 reaatech/agent-eval-harness 0 TypeScript 2026-05-25 End-to-end agent evaluation โ€” trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w
17 truera/trulens 3345 Python 2026-05-25 Evaluation and Tracking for LLM Experiments and AI Agents
18 homemade-software-inc/completion-kit 1 Ruby 2026-05-25 Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c
19 Giskard-AI/giskard-oss 5384 Python 2026-05-25 ๐Ÿข Open-Source Evaluation & Testing library for LLM Agents
20 matt-rachlin/agent-eval-harness 0 Python 2026-05-24 Online evals on live agent traces. Open-source, self-hostable, OpenTelemetry-native eval harness with regression detecti
21 harnexa/nexa-gauge 36 Python 2026-05-26 An graph-eval framework for LLM's
22 reaatech/rag-eval-pack 0 TypeScript 2026-05-25 RAG evaluation toolkit โ€” faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with
23 Mike-E-Log/learn-ai-eval 0 HTML 2026-05-22 The Eval Codex โ€” Claude-tutored AI-eval learning engine. Build eval expertise via guided practice.
24 chquandogong/mission-spec 0 TypeScript 2026-05-25 Mission Spec โ€” AI ์—์ด์ „ํŠธ ์›Œํฌํ”Œ๋กœ๋ฅผ ์œ„ํ•œ task contract layer
25 sanya2025/edututor-eval 0 Python 2026-05-21 A lightweight evaluation framework for AI tutoring responses, built for education-focused LLM systems
26 reaatech/hybrid-rag-qdrant 1 TypeScript 2026-05-20 Serious hybrid RAG reference โ€” vector + BM25 + reranker over Qdrant, chunking strategies benchmarked, eval set included,
27 reaatech/classifier-evals 0 TypeScript 2026-05-20 Offline classifier evaluation harness โ€” dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio
28 Alexanderk30/context-override-resistance 0 Python 2026-05-19 RL-style eval measuring intent/action divergence in frontier agents: model acknowledges a correction, then acts on the s
29 melody-ling-L/eval-resume 0 HTML 2026-05-19 ็ฌฌไธ€ไธช่š็„ฆ"็ฎ€ๅކๆ”นๅ†™่ฏšๅฎžๅบฆ"็š„ไธญๆ–‡ LLM benchmark๏ผš20 ็œŸๅฎž่„ฑๆ•็ฎ€ๅކ ร— 3 ๆจกๅž‹ ร— 4 ่ฏ„ๅˆ†็ปดๅบฆ
30 melody-ling-L/judgebuddy 0 HTML 2026-05-19 Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment.
31 GiuseppeSp/n8n-customer-interview-synthesizer 0 โ€” 2026-05-19 Multi-agent customer-interview synthesis pipeline in n8n with LLM-as-judge eval, Slack human-in-the-loop approval, and d
32 ajmeese7/local-llms 1 Python 2026-05-18 Use local Large Language Models for production use cases, and perform benchmarking for task-specific performance evaluat
33 monkeyin92/voice-agent-testops 0 TypeScript 2026-05-18 Regression testing for voice agents: scripted conversations, safety assertions, CI-ready reports.
34 gmitt98/fieldtest 0 Python 2026-05-16 LLM evaluation framework โ€” define what correct, well-formed, and safe means before you measure
35 verifywise-ai/plugin-marketplace 3 TypeScript 2026-05-15 VerifyWise AI Governance Plugin Marketplace
36 AI-QL/tuui 1146 TypeScript 2026-05-14 A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context
37 prompt-foundry/typescript-sdk 6 TypeScript 2026-05-13 The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
38 prompt-foundry/python-sdk 8 Python 2026-05-13 The prompt engineering, prompt management, and prompt evaluation tool for Python
39 mizcausevic-dev/agent-eval-arena 0 TypeScript 2026-05-12 Agent and LLM evaluation harness โ€” golden datasets, multi-scorer execution, regression detection across model versions,
40 fastxyz/skill-optimizer 57 TypeScript 2026-05-26 Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
41 Ruthwik-Data/mechanictrust 0 โ€” 2026-05-11 AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair.
42 SAY-5/eval-observability 0 Python 2026-05-10 Python LLM eval framework with full OTel tracing, structured logs, and daily Welch's-t-test regression detection persist
43 Ruthwik-Data/finrag-eval 0 Python 2026-05-10 RAG eval pipeline on Apple's FY 2024 10-K โ€” found confident hallucinations, filed a metric-level bug in DeepEval, and bu
44 Ruthwik-Data/self-improving-prompt-agent 0 Python 2026-05-10 Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 โ†’
45 SAY-5/genai-eval 0 Python 2026-05-07 Multilingual GenAI evaluation service across 5 task types and 3 languages, with regression-trend dashboard
46 HumphreySun98/repoagentbench 31 Python 2026-04-30 SWE-bench for your codebase โ€” mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: cla
47 YagneshKhamar/phasio 0 TypeScript 2026-04-29 Jest-style testing for LLM prompts. Version prompts, run evals across OpenAI and Anthropic, catch regressions in CI.
48 lehigh-university-libraries/htr 2 Go 2026-05-22 Handwritten Text Recognition llm eval tool
49 JSLEEKR/evaltrack 0 TypeScript 2026-04-24 Local-first regression and trend CLI for promptfoo eval histories โ€” the git log + git diff for LLM eval outputs.
50 izam-mohammed/ragrank 47 Python 2026-04-21 ๐ŸŽฏ Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, an
51 arthursoares/openclaw-llm-bench 2 Python 2026-04-11 A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-j
52 YuanyangLiNEU/mini-claude 0 TypeScript 2026-04-11 A minimal Claude Code built from scratch โ€” agent loop, tool calling, web search, permissions, and a black-box LLM eval h
53 webrenew/models-dilemma 4 TypeScript 2026-04-08 The Prisoner's Dilemma played by LLMs
54 AdirAmsalem/openclaw-eval 0 Python 2026-03-31 Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, la
55 Data-ScienceTech/forcefield 1 Python 2026-03-30 ForceField Python SDK -- AI security in 3 lines of code. Prompt injection detection, PII redaction, security evals, tool
56 alyssadata/continuity-keys 1 โ€” 2026-03-29 Continuity Keys: tests for โ€œsame someoneโ€ returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen)
57 klausners/prompt-optimizer 0 TypeScript 2026-03-26 Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evalua
58 Aysnc-Labs/llm-eval 1 PHP 2026-03-20 A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correc
59 asarnaout/veritail 6 Python 2026-03-15 LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues
60 vola-trebla/llm-infrastructure 0 โ€” 2026-03-14 Full-stack AI infrastructure - 5 projects from data ingestion to autonomous agents
61 whitecircle/circle-guard-bench 69 Python 2026-03-07 First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (g
62 tpertner/squeeze 5 Python 2026-03-01 Squeeze your model with pressure prompts to see if its behavior leaks.
63 grigio/llm-eval-simple 68 Python 2026-02-28 llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection
64 QuesmaOrg/BinaryAudit 91 Shell 2026-02-27 An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
65 paradime-io/dbt-llm-evals 27 Python 2026-02-10 The warehouse-native LLM evaluation package for dbtโ„ข - monitor AI quality without data egress
66 Striveworks/valor 41 Python 2026-02-09 Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models.
67 TADSTech/llm-output-grader 0 Python 2026-01-24 systematic llm grading
68 3ahmood/Agentic-Author-CrewAI 1 Jupyter Notebook 2026-01-15 On device autonomous research and content writing using open-sourced LLMs and Crew AI.
69 Supahands/llm-comparison-backend 22 Python 2026-01-13 This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be r
70 thedataquarry/structured-outputs 28 Python 2025-12-23 Structured output benchmarks comparing DSPy and BAML with different LLMs
71 higuseonhye/worldsim-eval 0 โ€” 2025-12-20 Evaluate AI agents by simulating world-level consequences.
72 yukincom/llm-SugarScape 6 Python 2025-11-28 Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in
73 IAAR-Shanghai/GuessArena 10 Python 2025-11-15 [ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Re
74 iltutishrak/eval-metrics-lab 0 Python 2025-11-10 Text-only playground for evaluating reasoning model outputs with mock accuracy, hallucination, and trust metrics โ€” runs
75 artefactop/promptdev 2 Python 2025-09-22 A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
76 multinear/multinear 45 Python 2025-09-02 Develop reliable AI apps
77 attogram/ollama-multirun 16 Shell 2025-08-30 Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance stat
78 khoj-ai/llm-coup 12 TypeScript 2025-08-18 Let LLMs play coup with each other and see who's the best at deception & strategy
79 jaaack-wang/multi-problem-eval-llm 3 Jupyter Notebook 2025-08-08 Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
80 alan-turing-institute/prompto 37 Python 2025-07-18 An open source library for asynchronous querying of LLM endpoints
81 athina-ai/athina-evals 300 Python 2025-06-06 Python SDK for running evaluations on LLM generated responses
82 amplifying-ai/ai-product-bench 22 HTML 2025-05-27
83 regankight/mirror-model-eval-tests 0 โ€” 2025-05-17 LLM behavior QA: tone collapse, false consent, and reroute logic scoring.
84 pyladiesams/eval-llm-based-apps-jan2025 8 Jupyter Notebook 2025-05-06 Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundatio
85 daqh/llm-eval 0 Python 2025-03-24 This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational co
86 parea-ai/parea-sdk-py 82 Python 2025-02-13 Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
87 parea-ai/parea-sdk-ts 4 TypeScript 2025-01-17 TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
88 yukinagae/genkitx-promptfoo 7 TypeScript 2025-01-03 Community Plugin for Genkit to use Promptfoo
89 honeyhiveai/realign 19 Python 2024-12-04 Realign is a testing and simulation framework for AI applications.
90 harlev/eva-l 5 Python 2024-11-27 LLM Evaluation Framework
91 genia-dev/vibraniumdome 27 Python 2024-10-28 LLM Security Platform.
92 Human-Centric-Machine-Learning/prediction-powered-ranking 9 Jupyter Notebook 2024-10-28 Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
93 yuzu-ai/ShinRakuda 3 Python 2024-09-17 Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering resear
94 yukinagae/genkit-promptfoo-sample 0 TypeScript 2024-09-11 Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
95 yukinagae/promptfoo-sample 2 โ€” 2024-09-10 Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
96 uptrain-ai/uptrain 2350 Python 2024-08-18 UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+
97 prompt-foundry/dotnet-sdk 0 โ€” 2024-06-16 The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET
98 prompt-foundry/ruby-sdk 1 โ€” 2024-06-16 The prompt engineering, prompt management, and prompt evaluation tool for Ruby.
99 prompt-foundry/kotlin-sdk 0 โ€” 2024-06-16 The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.
100 prompt-foundry/go-sdk 1 โ€” 2024-06-16 The prompt engineering, prompt management, and prompt evaluation tool for Go.

๐Ÿ” How it works

Every 15 minutes, a GitHub Action runs tracker.py. That script:

  1. Fetches the latest state from GitHub Search API.
  2. Diffs against data/items.json (the previous snapshot).
  3. Rewrites the table above between the <!-- TRACKER_TABLE_* --> markers.
  4. Commits feat: +N added, -M removed (timestamp) if anything changed.

No external services. No paid APIs. Just a public data source and a free GitHub Action.


๐Ÿค Contributing

See CONTRIBUTING.md โ€” usually you don't need to: the tracker keeps itself current. If you spot a data-source bug or want to suggest a new column for the table, open an issue.


๐Ÿ”— Related live trackers

If you find this useful, you might also like these other auto-updated trackers from the same maintainer โ€” same mechanism, different upstream:

  • trending-claude-skills โ€” What's shipping in Claude Skills this week (topic:claude-skills)
  • mcp-servers-live โ€” Live index of newest MCP servers (topic:mcp-server)
  • cursor-rules-live โ€” Newest Cursor rules and .cursorrules patterns (topic:cursor-rules)
  • claude-code-plugin-tracker โ€” Claude Code plugins and hook configs (topic:claude-code)
  • llm-agents-radar โ€” Newest LLM agent frameworks (topic:llm-agent)
  • rag-radar โ€” Newest RAG implementations and tools (topic:rag)
  • llm-eval-tracker โ€” Newest LLM evaluation tools and benchmarks (topic:llm-eval)
  • agent-framework-radar โ€” Newest agent frameworks shipping on GitHub (topic:agent-framework)
  • vector-db-live โ€” Newest vector DB projects and integrations (topic:vector-database)
  • llmops-radar โ€” Newest LLMOps tooling (observability, deployment) (topic:llmops)
  • prompt-tools-live โ€” Newest prompt-engineering tools and prompt repos (topic:prompt-engineering)
  • skills-tracker โ€” Tracking new GitHub 'skills' repos (topic:agent-skills)
  • awesome-agent-skills โ€” Curated auto-updated awesome-list of AI agent skills (topic:agent-skills)

๐Ÿ“œ License

MIT โ€” see LICENSE.

More from linny006

  • Awesome Agent Skills โ€” Curated, auto-updated awesome-list of vetted AI agent skills with quality ratings for Claude, GPT, and open-source agents (โญ 0)

  • Agent Skills Daily Tracker โ€” Real-time tracking of every new GitHub 'skills' repo to capture the AI agent skill ecosystem trend (โญ 0)

  • Agent Eval Harness โ€” Live, open-source benchmark for comparing AI coding agents on real GitHub issues (โญ 0)

  • Prompt Tools Live โ€” Live-updating tracker of prompt engineering tools, libraries, and techniques โ€” refreshed every 15 minutes (โญ 0)

  • LLMOps Radar โ€” Live index of the newest LLMOps tooling โ€” track what's shipping in LLM observability and deployment (โญ 0)