Diamond

Build evaluation datasets that don't lie to you

The Evaluation Dataset Crisis

Every AI team hits the same wall: flaky datasets make bad Evals.

Your datasets are full of poorly defined expected outputs (if any) and rarely made by domain experts. Random samples from production logs with no strategy behind what actually gets included. Zero visibility into what scenarios you're missing. Millions of valuable interactions sitting unused because no one knows which ones actually matter.

The painful reality: You're making critical AI decisions based on datasets that don't represent your users, don't cover your edge cases, and drift away from reality over time.

Your evals are only as good as your datasets, this is the hard part.

Why Status Quo Fails

Manual curation doesn't scale. Your dataset should constantly evolve with new scenarios, or you risk overfitting your AI on a fixed number of scenarios.

Random sampling misses what matters. Most production logs are routine — the interesting edge cases that break your AI get lost in the noise.

No systematic coverage. You have 1,000 examples but zero insight into what user scenarios you're actually testing.

Ground truth chaos. Three reviewers, three different "correct" answers. Your dataset quality depends on who had coffee that morning.

Introducing Diamond

Evaluation datasets that evolve with your product. Diamond turns the mess of production logs and expert knowledge into structured, high-quality datasets you can actually trust.

Domain experts at the center. Author structured scenarios with the people who know what good looks like. No more engineers guessing at ground truth.

Continuously evolving from production. Diamond connects to your production logs and surfaces the scenarios that matter — edge cases, failures, emerging patterns — so your datasets never go stale.

Coverage you can measure. See exactly which user scenarios you're testing and which ones you're missing. No more blind spots.

Heterogeneity built in. Ensure your datasets represent the full diversity of real-world behavior, not just the easy cases.

The Basalt Stack

Diamond is part of a complete AI evaluation ecosystem:

Cobalt — CI-native testing for AI agents
Diamond — Dataset engine for AI evals ← You are here
Limestone — Build trustworthy LLM-as-a-judge evaluators
Asphalt — Self-improving engine for production AI agents

⭐ Star this repo to follow progress

We're building Diamond in the open. Star this repo to get notified about major updates and releases.

💬 Join the discussion

Have thoughts on AI evaluation datasets? Join the conversation and help shape the future of eval tooling.

Built and maintained by Basalt. Open source forever under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
public		public
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diamond

The Evaluation Dataset Crisis

Why Status Quo Fails

Introducing Diamond

The Basalt Stack

⭐ Star this repo to follow progress

💬 Join the discussion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diamond

The Evaluation Dataset Crisis

Why Status Quo Fails

Introducing Diamond

The Basalt Stack

⭐ Star this repo to follow progress

💬 Join the discussion

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages