Paper currently under ICLR review.
Disclaimer: some of the scripts may not run because they are stripped down versions of our original scripts to anonymize them. We will provide the full scripts on GitHub after the paper is accepted.
The benchmarks folder contains our three benchmarks in Dafny, Verus and Lean.
-
Each language folder contains a
tasksfolder with specs which compile, up to somesorryorassume falsein the code. There is also anissuesfolder with specs which do not compile. We keep them in the benchmark for researchers who want to fix them or use them for other experiments. -
There are two files
lean_tasks.jsonlandlean_issues.jsonlwhich contain each task as a JSON line. The task is decomposed into different components, such as the preamble, spec and code. We also provide additional metadata from our quality analysis. This data is also available as CSV files. -
The
README.mdfile lists the original sources used in constructing these benchmarks, with download links. Thevericoding_benchmark_v1.csvprovides a list of Vericoding IDs and the source IDs for all 12,504 tasks. The information is also available in JSONL format:verucoding_benchmark_v1.jsonl.
The experiments folder contains the outcomes of our vericoding experiments on the benchmarks.
-
The file
vericoding_results_v1.csvwhich is a list of the outcomes of all 55397 experiments involving vericoding tasks across different models. -
The folder
inspectioncontains our manual inspection reports on the successful runs.
The src folder contains scripts that we used during the construction of this benchmark, such as translation, formatting and quality analysis. It also contains scripts for running the vericoding experiments.