Skip to content

basalt-ai/limestone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Limestone

🌐 Visit the website →

LLM-as-a-judge evaluators are fundamentally broken. We're building ones you can trust.


The LLM Judge Reliability Crisis

Every AI team using LLM-as-a-judge hits the same nightmare: evaluators that can't be trusted.

Your "helpful response" judge gives different scores to identical inputs. Your "accuracy" evaluator drifts over time as models update. Your criteria are so vague ("is this good?") that even humans disagree on what they mean.

The brutal truth: You're making critical AI decisions based on evaluators that are less consistent than a coin flip.

You've built evaluation pipelines on quicksand. Your judges work fine in demos, fail spectacularly in production, and nobody knows why. The worst part? You only discover this after shipping to users.


Why LLM Judges Fail

Vague criteria create chaos. "Is this response helpful?" means something different to every reviewer and every model run.

Zero consistency testing. Your judge passes eval on Monday, fails the same examples on Tuesday. Did your AI get worse, or did your judge?

No expert alignment. Your LLM judge thinks it's great, your domain experts disagree. Who do you trust?

Drift is invisible. Model updates break your judges in subtle ways you won't catch until it's too late.


Introducing Limestone

Evaluators that earn your trust before they touch production. Limestone brings rigorous methodology to LLM-as-a-judge, so you stop guessing and start measuring.

Built from expert feedback. Run structured sessions where domain experts critique real outputs in natural language. No more inventing criteria in a vacuum.

Auto-extracted error categories. Limestone identifies recurring failure patterns from expert feedback and converts them into precise, structured evaluation criteria.

Stress-tested for reliability. Every evaluator is validated against alignment datasets before deployment. You know exactly how consistent your judge is — and where it breaks.

Measurable confidence. Target 100% reliability before deploying an evaluator. If your judge can't pass its own eval, it doesn't ship.


The Basalt Stack

Limestone is part of a complete AI evaluation ecosystem:

  • Cobalt — CI-native testing for AI agents
  • Diamond — Dataset engine for AI evals
  • Limestone — Build trustworthy LLM-as-a-judge evaluators ← You are here
  • Asphalt — Self-improving engine for production AI agents

⭐ Star this repo to follow progress

We're building Limestone in the open. Star this repo to get notified about major updates and releases.

💬 Join the discussion

Struggling with unreliable LLM judges? Join the conversation and help us build evaluation you can trust.

Built and maintained by Basalt. Open source forever under Apache 2.0.

About

Build LLM-as-a-judge evaluators you can trust

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors