llm-evals

Here are 35 public repositories matching this topic...

ALucek / evaluizer

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

llm-optimizer llm-evals prompt-annotation prompt-optimizer

Updated Nov 22, 2025
TypeScript

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

LLMSystems / llm-evals

Star

A framework for evaluating large language models (LLMs) across a variety of tasks.

nlp benchmark ai evaluation-framework ai-evaluation llm llm-evaluation llm-as-a-judge g-eval llm-evals

Updated Mar 18, 2026
Python

tpertner / squeeze

Star

Squeeze your model with pressure prompts to see if its behavior leaks.

reliability evaluation calibration alignment quality-assurance metamorphic-testing ai-safety trustworthiness hallucinations prompt-engineering llm-eval llm-evals

Updated Mar 1, 2026
Python

kevinschaul / llm-evals

Star

Because we should all have our own set of LLM evals.

llm llm-evals

Updated Apr 17, 2026
Python

aelaguiz / codex-autoresearcher

Star

Codex-native autoresearch harness with structured worker/judge turns for optimizing anything you can measure.

python research optimization codex ai-agents llm-evals experiment-runner autoresearch

Updated Mar 21, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

dhirajxai / llm-evals-and-anti-hallucination

Star

Evaluation patterns, release gates, and anti-hallucination techniques for developer-focused AI workflows.

evaluation llmops prompt-testing promptfoo llm-evals ai-reliability anti-hallucination groundedness

Updated Mar 27, 2026
Python

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

Sponsor

Star

In this we evaluate the LLM responses and find accuracy

llm-evaluation-metrics llm-evals geval

Updated Jul 8, 2025
Python

abhijeetnardele24-hash / dev-eval-innovator

Star

Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.

python ci developer-tools prompt-engineering llm-testing llm-evals openai-compatible eval-framework

Updated Apr 13, 2026
Python

SiddhantaShrestha / autonomous-research-eval-agent

Star

Agentic research pipeline with local retrieval, structured evaluation, conditional revision, and traceable outputs using Groq.

python cli retrieval evaluation ai-agents groq llm prompt-engineering research-agent agentic-workflow llm-evals

Updated Mar 23, 2026
TypeScript

dicnunz / dicnunz.github.io

Star

Portfolio site for AI-assisted software experiments, prototypes, and polished demos.

github-pages portfolio developer-tools local-first llm-evals

Updated Apr 19, 2026
HTML

peterderdak / evalgate

Star

CLI release gate for structured AI changes.

cli json-schema regression-testing prompt-testing llm-evals ai-evals

Updated Mar 12, 2026
TypeScript

adrianrossts / adrianrossts

Star

Germany-based AI/backend engineer. Shipping production quality loops, build logs, and practical engineering notes.

backend software-engineering observability ai-engineering build-in-public llm-evals

Updated Apr 1, 2026

BuildWithAbid / ai-stability

Star

Measure LLM output consistency from the command line.

python cli openai developer-tools ai-evaluation llm prompt-engineering prompt-testing reliability-testing llm-evals

Updated Apr 6, 2026
Python

KIM3310 / stage-pilot

Star

Tool-calling reliability runtime — 25% to 90% benchmark lift. Published as @ai-sdk-tool/parser on npm.

benchmarking typescript ai-sdk tool-calling llm-evals agentic-systems review-pack reliability-runtime review-first

Updated Apr 22, 2026
TypeScript

codernate92 / Horizon-Eval

Star

Horizon-Eval: evaluation-integrity framework for portable long-horizon agent benchmarks, with QA gates, trajectory auditing, replayable run bundles, and safety-gap analysis.

evaluation alignment reproducibility trajectory-analysis ai-safety safety-engineering llm-evals agent-evals benchmark-integrity

Updated Mar 21, 2026
Python

Course-Correct-Labs / ai-agency-evals

Star

Reproducible evaluation suite for LLM behavior research: epistemic pathology, delegated introspection, and temporal consciousness diagnostics

python alignment language-models reproducibility ai-safety interpretability research-engineering ai-alignment research-tools rlhf llm-evaluation behavioral-research llm-evals epistemic-alignment temporal-consciousness epistemic-pathology

Updated Nov 16, 2025
Python

scienceaditya / agentic-research-kit

Star

Evaluation-first research kit for biology. Structured logs, reproducible specs, and lightweight validation for agentic workflows.

benchmarking reliability computational-biology crispr llm-evals membrane-trafficking

Updated Dec 25, 2025
Python

Improve this page

Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evals

Here are 35 public repositories matching this topic...

ALucek / evaluizer

The-Swarm-Corporation / StatisticalModelEvaluator

pyladiesams / eval-llm-based-apps-jan2025

LLMSystems / llm-evals

tpertner / squeeze

kevinschaul / llm-evals

aelaguiz / codex-autoresearcher

tpertner / confess

dhirajxai / llm-evals-and-anti-hallucination

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

abhijeetnardele24-hash / dev-eval-innovator

SiddhantaShrestha / autonomous-research-eval-agent

dicnunz / dicnunz.github.io

peterderdak / evalgate

adrianrossts / adrianrossts

BuildWithAbid / ai-stability

KIM3310 / stage-pilot

codernate92 / Horizon-Eval

Course-Correct-Labs / ai-agency-evals

scienceaditya / agentic-research-kit

Improve this page

Add this topic to your repo