A Benchmark For Vericoding: Formally Verified Program Synthesis

Paper currently under ICLR review.

Disclaimer: some of the scripts may not run because they are stripped down versions of our original scripts to anonymize them. We will provide the full scripts on GitHub after the paper is accepted.

Benchmarks

The benchmarks folder contains our three benchmarks in Dafny, Verus and Lean.

Each language folder contains a tasks folder with specs which compile, up to some sorry or assume false in the code. There is also an issues folder with specs which do not compile. We keep them in the benchmark for researchers who want to fix them or use them for other experiments.
There are two files lean_tasks.jsonl and lean_issues.jsonl which contain each task as a JSON line. The task is decomposed into different components, such as the preamble, spec and code. We also provide additional metadata from our quality analysis. This data is also available as CSV files.
The README.md file lists the original sources used in constructing these benchmarks, with download links. The vericoding_benchmark_v1.csv provides a list of Vericoding IDs and the source IDs for all 12,504 tasks. The information is also available in JSONL format: verucoding_benchmark_v1.jsonl.

Experiments

The experiments folder contains the outcomes of our vericoding experiments on the benchmarks.

The file vericoding_results_v1.csv which is a list of the outcomes of all 55397 experiments involving vericoding tasks across different models.
The folder inspection contains our manual inspection reports on the successful runs.

Source code

The src folder contains scripts that we used during the construction of this benchmark, such as translation, formatting and quality analysis. It also contains scripts for running the vericoding experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 1,128 Commits
.cursor		.cursor
.github/workflows		.github/workflows
benchmarks		benchmarks
code2verus		code2verus
config		config
docs		docs
examples		examples
experiment_traces		experiment_traces
experiments		experiments
handcoded		handcoded
scripts		scripts
src		src
vericoded		vericoded
verus		verus
.coveragerc		.coveragerc
.gitignore		.gitignore
.mcp.json		.mcp.json
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
lake-manifest.json		lake-manifest.json
lakefile.lean		lakefile.lean
lean-toolchain		lean-toolchain
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Benchmark For Vericoding: Formally Verified Program Synthesis

Benchmarks

Experiments

Source code

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Benchmark For Vericoding: Formally Verified Program Synthesis

Benchmarks

Experiments

Source code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages