Add `programbench submit` (package / verify / recombine) by john-b-yang · Pull Request #39 · facebookresearch/ProgramBench

john-b-yang · 2026-06-17T00:42:57Z

Adds a submit subcommand group for the leaderboard submission lifecycle:

Major changes:

programbench submit package <run-dir>
- turn a programbench eval run directory into a submission, in place. Writes the following:
  - submission.yaml manifest
  - _stats/score.json (per-instance, per-test pass/fail)
  - splits eval.json into a light eval.json + heavy eval.log.json (raw log + failure text).
- --upload-to <HF org> flag automatically uploads submission.tar.gz and eval.log.json artifacts to a per-submission HuggingFace dataset (resumable), replacing each with a .url + .sha256
programbench submit verify <dir>
- Tier-0 (default) recomputes the score from the submission's own eval.json and checks it matches the manifest (no Docker, no network);
- --tier1 resolves each solution and re-runs programbench eval to confirm the artifacts reproduce the score.

Minor changes:

programbench submit recombine <dir> (minor): reassembles the original eval.json from the split pieces (downloading the heavy part from HF if needed).

New modules:

submission.py (shared scoring/aggregation, eval-split, HF helpers)
package.py
verify.py
cli/submit.py
submission.yaml / README.md templates. Scaffold-agnostic: cost/calls stats are out of scope (submitter-provided, derived from trajectories).

john-b-yang · 2026-06-17T00:55:48Z

Fixed a lint issue, should be ready for review!

john-b-yang · 2026-06-17T01:04:06Z

Workflow I'm imagining, tl;dr'ed, is:

programbench eval run_name
programbench submit package run_name --upload-to hf/dataset
(User fills out missing metadata)
programbench submit verify run_name
programbench submit push run_name github.com/owner/repo
programbench submit register run_name

Fully described:

The user runs evaluation (step 1)
Creates the metadata seeded with eval results, then fills out remaining info (2, 3)
Run sanity check that reported numbers match eval results (4)
Push the folder to a standalone GitHub repository (5)
Create a PR at ProgramBench/submissions (link) (6)

Copilot

Pull request overview

Adds a new programbench submit command group implementing the submission lifecycle: packaging evaluated runs into a standardized submission format, verifying submissions (offline and via re-eval), recombining split eval artifacts, and registering submissions into the leaderboard registry via an automated PR flow.

Changes:

Introduces shared submission helpers (submission.py) for scoring/aggregation, eval JSON split+recombine, and artifact resolution.
Adds submit package, submit verify (tier0/tier1), submit recombine, and submit register CLI commands plus supporting modules.
Wires the new submit Typer app into the top-level CLI and adds Jinja templates for submission.yaml and README.md.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
src/programbench/verify.py	Implements Tier-0/Tier-1 verification logic for packaged submissions.
src/programbench/submission.py	Adds shared scoring/aggregation, eval split/recombine, and artifact resolution helpers.
src/programbench/register.py	Implements registry PR plan/build/write logic and optional `gh`-based automation.
src/programbench/package.py	Implements in-place packaging of eval runs into leaderboard submissions, with optional HF upload.
src/programbench/data/templates/submission.yaml.j2	Adds the submission manifest template used by `package`.
src/programbench/data/templates/README.md.j2	Adds a submission README template with reproduction/checklist guidance.
src/programbench/cli/submit.py	Adds the `submit` CLI group and subcommands (package/verify/register/recombine).
src/programbench/cli/main.py	Registers the `submit` CLI group at the top level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def verify_tier1(submission_dir: Path, *, workers: int = 1, filter_spec: str = "") -> VerifyResult:
+    from programbench.eval.eval_batch import run_eval_batch
+
+    instances = benchmark_instances()
+    sub_root = submission_dir
+    submitted = score_run(sub_root, instances)
+
+    with tempfile.TemporaryDirectory() as tmp:
+        run = Path(tmp)
+        for iid in submitted:
+            (run / iid).mkdir(parents=True)
+            resolve_submission_tar(sub_root / iid, run / iid / "submission.tar.gz")
+        run_eval_batch(sources=[run], workers=workers, filter_spec=filter_spec, force=True)
+        fresh = score_run(run, instances)
+
+    checks = [
+        Check(
+            iid,
+            round(submitted[iid], 4),
+            round(fresh.get(iid, float("nan")), 4),
+            _close(submitted[iid], fresh.get(iid)),
+        )
+        for iid in submitted
+        if not filter_spec or iid in fresh
+    ]


+    return scores
+
+
+def write_stat(run_dir: Path, stat: str, by_instance: dict[str, float]) -> None:


+    elif url_file.exists():
+        with urllib.request.urlopen(url_file.read_text().strip()) as r:  # noqa: S310
+            heavy = json.loads(r.read())
+    else:


+    Supports the artifact forms in SPEC.md: inline file, ``.url`` (downloaded), or
+    ``submission.ref.yaml`` (git checkout packed). The sha256 sidecar, when present, is


+# Generated by `programbench package` from: {{ run_dir }}
+# [auto] fields are recomputed on every `package`; all other fields are preserved.


+    plan = build_plan(submission_dir, registry)
+    if source:
+        plan.source = source
+    if commit:
+        plan.commit = commit


meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 17, 2026

john-b-yang requested a review from klieret June 17, 2026 00:43

john-b-yang force-pushed the add-submit-commands branch from f3f9a68 to 4cd2f25 Compare June 17, 2026 00:53

john-b-yang marked this pull request as draft June 17, 2026 00:57

Add programbench submit (package / verify / register / recombine)

b1e9e94

john-b-yang force-pushed the add-submit-commands branch from 4cd2f25 to b1e9e94 Compare June 17, 2026 01:03

klieret requested a review from Copilot June 17, 2026 01:08

Copilot started reviewing on behalf of klieret June 17, 2026 01:08 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `programbench submit` (package / verify / recombine)#39

Add `programbench submit` (package / verify / recombine)#39
john-b-yang wants to merge 1 commit into
mainfrom
add-submit-commands

john-b-yang commented Jun 17, 2026 •

edited

Loading

Uh oh!

john-b-yang commented Jun 17, 2026

Uh oh!

john-b-yang commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return scores


		def write_stat(run_dir: Path, stat: str, by_instance: dict[str, float]) -> None:

		Supports the artifact forms in SPEC.md: inline file, ``.url`` (downloaded), or
		``submission.ref.yaml`` (git checkout packed). The sha256 sidecar, when present, is

		# Generated by `programbench package` from: {{ run_dir }}
		# [auto] fields are recomputed on every `package`; all other fields are preserved.

Conversation

john-b-yang commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

john-b-yang commented Jun 17, 2026

Uh oh!

john-b-yang commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

john-b-yang commented Jun 17, 2026 •

edited

Loading

john-b-yang commented Jun 17, 2026 •

edited

Loading