LLM-as-a-judge evaluators are fundamentally broken. We're building ones you can trust.
Every AI team using LLM-as-a-judge hits the same nightmare: evaluators that can't be trusted.
Your "helpful response" judge gives different scores to identical inputs. Your "accuracy" evaluator drifts over time as models update. Your criteria are so vague ("is this good?") that even humans disagree on what they mean.
The brutal truth: You're making critical AI decisions based on evaluators that are less consistent than a coin flip.
You've built evaluation pipelines on quicksand. Your judges work fine in demos, fail spectacularly in production, and nobody knows why. The worst part? You only discover this after shipping to users.
Vague criteria create chaos. "Is this response helpful?" means something different to every reviewer and every model run.
Zero consistency testing. Your judge passes eval on Monday, fails the same examples on Tuesday. Did your AI get worse, or did your judge?
No expert alignment. Your LLM judge thinks it's great, your domain experts disagree. Who do you trust?
Drift is invisible. Model updates break your judges in subtle ways you won't catch until it's too late.
Evaluators that earn your trust before they touch production. Limestone brings rigorous methodology to LLM-as-a-judge, so you stop guessing and start measuring.
Built from expert feedback. Run structured sessions where domain experts critique real outputs in natural language. No more inventing criteria in a vacuum.
Auto-extracted error categories. Limestone identifies recurring failure patterns from expert feedback and converts them into precise, structured evaluation criteria.
Stress-tested for reliability. Every evaluator is validated against alignment datasets before deployment. You know exactly how consistent your judge is — and where it breaks.
Measurable confidence. Target 100% reliability before deploying an evaluator. If your judge can't pass its own eval, it doesn't ship.
Limestone is part of a complete AI evaluation ecosystem:
- Cobalt — CI-native testing for AI agents
- Diamond — Dataset engine for AI evals
- Limestone — Build trustworthy LLM-as-a-judge evaluators ← You are here
- Asphalt — Self-improving engine for production AI agents
We're building Limestone in the open. Star this repo to get notified about major updates and releases.
Struggling with unreliable LLM judges? Join the conversation and help us build evaluation you can trust.
Built and maintained by Basalt. Open source forever under Apache 2.0.