Our paper (currently undergoing review) can be found on ArXiv.
Our scripts can be found in this GitHub repo.
The vericoding_benchmark_v1.csv provides a list of Vericoding IDs, sources and the source IDs for all 12,504 tasks. It also contains additional metadata from our quality analysis.
The file vericoding_results_v1.csv which is a list of the outcomes of all 55,397 experiments involving vericoding tasks across different models.
The folder specs contains all 12,504 tasks in Dafny, Lean and Verus, which includes the following.
-
Files which compile, up to some
sorryorassume falsein the code. The file is decomposed into different components, such as the preamble, the spec, the code and the postamble. -
Files which do not compile, especially after translation from another langauge. We keep them in the benchmark for researchers who want use them for other experiments, e.g. spec repair.
The folder jsonl contains the tasks and issues sorted into the different languages, parsed into different components, and stored as JSON lines for easier experimentation. It also contains natural language descriptions for some of the tasks, which are not contained in the files in the specs folder.
The folder vericoded will contain some of the tasks with solutions filled in by an AI vericoder.
The folder handcoded will contain some of the tasks with solutions filled in by human coders.
benchmarks/dafny/dafnybench- https://github.com/sun-wendy/DafnyBench
benchmarks/lean/numpy_triple- Derived from NumPy documentation
- Specs in new Hoare triple format
benchmarks/lean/numpy_simple- Derived from NumPy documentation
- Specs in classical Lean format
benchmarks/lean/verina- https://github.com/sunblaze-ucb/verina
benchmarks/dafny/apps- https://github.com/hendrycks/apps
benchmarks/lean/fvapps- https://huggingface.co/datasets/quinn-dougherty/fvapps
benchmarks/verus/verified_cogen- https://github.com/JetBrains-Research/verified-cogen
benchmarks/dafny/bignum- Written from scratch
benchmarks/dafny/humanevalbenchmarks/lean/clever- https://github.com/openai/human-eval
- https://github.com/JetBrains-Research/HumanEval-Dafny
- https://github.com/trishullab/clever
The Dafny files were mostly translated from the original Python source. Some of the Dafny files were taken from Jetbrains Research's HumanEval Dafny repo.
The Lean files were taken from the Clever benchmark which were originally derived from the HumanEval benchmark. HumanEval problems 22, 137, 162 are missing from Clever for reasons mentioned in the Clever paper.