Skip to content

Beneficial-AI-Foundation/vericoding-benchmark

Repository files navigation

A Benchmark For Vericoding: Formally Verified Program Synthesis

Our paper (currently undergoing review) can be found on ArXiv.

Our scripts can be found in this GitHub repo.

The vericoding_benchmark_v1.csv provides a list of Vericoding IDs, sources and the source IDs for all 12,504 tasks. It also contains additional metadata from our quality analysis.

The file vericoding_results_v1.csv which is a list of the outcomes of all 55,397 experiments involving vericoding tasks across different models.

Specs

The folder specs contains all 12,504 tasks in Dafny, Lean and Verus, which includes the following.

  • Files which compile, up to some sorry or assume false in the code. The file is decomposed into different components, such as the preamble, the spec, the code and the postamble.

  • Files which do not compile, especially after translation from another langauge. We keep them in the benchmark for researchers who want use them for other experiments, e.g. spec repair.

JSONL

The folder jsonl contains the tasks and issues sorted into the different languages, parsed into different components, and stored as JSON lines for easier experimentation. It also contains natural language descriptions for some of the tasks, which are not contained in the files in the specs folder.

Vericoded

The folder vericoded will contain some of the tasks with solutions filled in by an AI vericoder.

Handcoded

The folder handcoded will contain some of the tasks with solutions filled in by human coders.

Original Sources

DafnyBench

NumPyTriple

  • benchmarks/lean/numpy_triple
  • Derived from NumPy documentation
  • Specs in new Hoare triple format

NumPySimple

  • benchmarks/lean/numpy_simple
  • Derived from NumPy documentation
  • Specs in classical Lean format

Verina

APPS

FVAPPS

VerifiedCogen

BigNum

  • benchmarks/dafny/bignum
  • Written from scratch

HumanEval, Clever

The Dafny files were mostly translated from the original Python source. Some of the Dafny files were taken from Jetbrains Research's HumanEval Dafny repo.

The Lean files were taken from the Clever benchmark which were originally derived from the HumanEval benchmark. HumanEval problems 22, 137, 162 are missing from Clever for reasons mentioned in the Clever paper.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors