Build evaluation datasets that don't lie to you
Every AI team hits the same wall: flaky datasets make bad Evals.
Your datasets are full of poorly defined expected outputs (if any) and rarely made by domain experts. Random samples from production logs with no strategy behind what actually gets included. Zero visibility into what scenarios you're missing. Millions of valuable interactions sitting unused because no one knows which ones actually matter.
The painful reality: You're making critical AI decisions based on datasets that don't represent your users, don't cover your edge cases, and drift away from reality over time.
Your evals are only as good as your datasets, this is the hard part.
Manual curation doesn't scale. Your dataset should constantly evolve with new scenarios, or you risk overfitting your AI on a fixed number of scenarios.
Random sampling misses what matters. Most production logs are routine — the interesting edge cases that break your AI get lost in the noise.
No systematic coverage. You have 1,000 examples but zero insight into what user scenarios you're actually testing.
Ground truth chaos. Three reviewers, three different "correct" answers. Your dataset quality depends on who had coffee that morning.
Evaluation datasets that evolve with your product. Diamond turns the mess of production logs and expert knowledge into structured, high-quality datasets you can actually trust.
Domain experts at the center. Author structured scenarios with the people who know what good looks like. No more engineers guessing at ground truth.
Continuously evolving from production. Diamond connects to your production logs and surfaces the scenarios that matter — edge cases, failures, emerging patterns — so your datasets never go stale.
Coverage you can measure. See exactly which user scenarios you're testing and which ones you're missing. No more blind spots.
Heterogeneity built in. Ensure your datasets represent the full diversity of real-world behavior, not just the easy cases.
Diamond is part of a complete AI evaluation ecosystem:
- Cobalt — CI-native testing for AI agents
- Diamond — Dataset engine for AI evals ← You are here
- Limestone — Build trustworthy LLM-as-a-judge evaluators
- Asphalt — Self-improving engine for production AI agents
We're building Diamond in the open. Star this repo to get notified about major updates and releases.
Have thoughts on AI evaluation datasets? Join the conversation and help shape the future of eval tooling.
Built and maintained by Basalt. Open source forever under Apache 2.0.