Limestone

LLM-as-a-judge evaluators are fundamentally broken. We're building ones you can trust.

The LLM Judge Reliability Crisis

Every AI team using LLM-as-a-judge hits the same nightmare: evaluators that can't be trusted.

Your "helpful response" judge gives different scores to identical inputs. Your "accuracy" evaluator drifts over time as models update. Your criteria are so vague ("is this good?") that even humans disagree on what they mean.

The brutal truth: You're making critical AI decisions based on evaluators that are less consistent than a coin flip.

You've built evaluation pipelines on quicksand. Your judges work fine in demos, fail spectacularly in production, and nobody knows why. The worst part? You only discover this after shipping to users.

Why LLM Judges Fail

Vague criteria create chaos. "Is this response helpful?" means something different to every reviewer and every model run.

Zero consistency testing. Your judge passes eval on Monday, fails the same examples on Tuesday. Did your AI get worse, or did your judge?

No expert alignment. Your LLM judge thinks it's great, your domain experts disagree. Who do you trust?

Drift is invisible. Model updates break your judges in subtle ways you won't catch until it's too late.

Introducing Limestone

Evaluators that earn your trust before they touch production. Limestone brings rigorous methodology to LLM-as-a-judge, so you stop guessing and start measuring.

Built from expert feedback. Run structured sessions where domain experts critique real outputs in natural language. No more inventing criteria in a vacuum.

Auto-extracted error categories. Limestone identifies recurring failure patterns from expert feedback and converts them into precise, structured evaluation criteria.

Stress-tested for reliability. Every evaluator is validated against alignment datasets before deployment. You know exactly how consistent your judge is — and where it breaks.

Measurable confidence. Target 100% reliability before deploying an evaluator. If your judge can't pass its own eval, it doesn't ship.

The Basalt Stack

Limestone is part of a complete AI evaluation ecosystem:

Cobalt — CI-native testing for AI agents
Diamond — Dataset engine for AI evals
Limestone — Build trustworthy LLM-as-a-judge evaluators ← You are here
Asphalt — Self-improving engine for production AI agents

⭐ Star this repo to follow progress

We're building Limestone in the open. Star this repo to get notified about major updates and releases.

💬 Join the discussion

Struggling with unreliable LLM judges? Join the conversation and help us build evaluation you can trust.

Built and maintained by Basalt. Open source forever under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Limestone

The LLM Judge Reliability Crisis

Why LLM Judges Fail

Introducing Limestone

The Basalt Stack

⭐ Star this repo to follow progress

💬 Join the discussion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Limestone

The LLM Judge Reliability Crisis

Why LLM Judges Fail

Introducing Limestone

The Basalt Stack

⭐ Star this repo to follow progress

💬 Join the discussion

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages