Skip to content

Add programbench submit (package / verify / recombine)#39

Draft
john-b-yang wants to merge 1 commit into
mainfrom
add-submit-commands
Draft

Add programbench submit (package / verify / recombine)#39
john-b-yang wants to merge 1 commit into
mainfrom
add-submit-commands

Conversation

@john-b-yang

@john-b-yang john-b-yang commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Adds a submit subcommand group for the leaderboard submission lifecycle:

Major changes:

  • programbench submit package <run-dir>
    • turn a programbench eval run directory into a submission, in place. Writes the following:
      • submission.yaml manifest
      • _stats/score.json (per-instance, per-test pass/fail)
      • splits eval.json into a light eval.json + heavy eval.log.json (raw log + failure text).
    • --upload-to <HF org> flag automatically uploads submission.tar.gz and eval.log.json artifacts to a per-submission HuggingFace dataset (resumable), replacing each with a .url + .sha256
  • programbench submit verify <dir>
    • Tier-0 (default) recomputes the score from the submission's own eval.json and checks it matches the manifest (no Docker, no network);
    • --tier1 resolves each solution and re-runs programbench eval to confirm the artifacts reproduce the score.

Minor changes:

  • programbench submit recombine <dir> (minor): reassembles the original eval.json from the split pieces (downloading the heavy part from HF if needed).

New modules:

  • submission.py (shared scoring/aggregation, eval-split, HF helpers)
  • package.py
  • verify.py
  • cli/submit.py
  • submission.yaml / README.md templates. Scaffold-agnostic: cost/calls stats are out of scope (submitter-provided, derived from trajectories).

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 17, 2026
@john-b-yang john-b-yang requested a review from klieret June 17, 2026 00:43
@john-b-yang john-b-yang force-pushed the add-submit-commands branch from f3f9a68 to 4cd2f25 Compare June 17, 2026 00:53
@john-b-yang

Copy link
Copy Markdown
Contributor Author

Fixed a lint issue, should be ready for review!

@john-b-yang john-b-yang marked this pull request as draft June 17, 2026 00:57
@john-b-yang john-b-yang force-pushed the add-submit-commands branch from 4cd2f25 to b1e9e94 Compare June 17, 2026 01:03
@john-b-yang

john-b-yang commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Workflow I'm imagining, tl;dr'ed, is:

  1. programbench eval run_name
  2. programbench submit package run_name --upload-to hf/dataset
  3. (User fills out missing metadata)
  4. programbench submit verify run_name
  5. programbench submit push run_name github.com/owner/repo
  6. programbench submit register run_name

Fully described:

  • The user runs evaluation (step 1)
  • Creates the metadata seeded with eval results, then fills out remaining info (2, 3)
  • Run sanity check that reported numbers match eval results (4)
  • Push the folder to a standalone GitHub repository (5)
  • Create a PR at ProgramBench/submissions (link) (6)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new programbench submit command group implementing the submission lifecycle: packaging evaluated runs into a standardized submission format, verifying submissions (offline and via re-eval), recombining split eval artifacts, and registering submissions into the leaderboard registry via an automated PR flow.

Changes:

  • Introduces shared submission helpers (submission.py) for scoring/aggregation, eval JSON split+recombine, and artifact resolution.
  • Adds submit package, submit verify (tier0/tier1), submit recombine, and submit register CLI commands plus supporting modules.
  • Wires the new submit Typer app into the top-level CLI and adds Jinja templates for submission.yaml and README.md.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/programbench/verify.py Implements Tier-0/Tier-1 verification logic for packaged submissions.
src/programbench/submission.py Adds shared scoring/aggregation, eval split/recombine, and artifact resolution helpers.
src/programbench/register.py Implements registry PR plan/build/write logic and optional gh-based automation.
src/programbench/package.py Implements in-place packaging of eval runs into leaderboard submissions, with optional HF upload.
src/programbench/data/templates/submission.yaml.j2 Adds the submission manifest template used by package.
src/programbench/data/templates/README.md.j2 Adds a submission README template with reproduction/checklist guidance.
src/programbench/cli/submit.py Adds the submit CLI group and subcommands (package/verify/register/recombine).
src/programbench/cli/main.py Registers the submit CLI group at the top level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +99
def verify_tier1(submission_dir: Path, *, workers: int = 1, filter_spec: str = "") -> VerifyResult:
from programbench.eval.eval_batch import run_eval_batch

instances = benchmark_instances()
sub_root = submission_dir
submitted = score_run(sub_root, instances)

with tempfile.TemporaryDirectory() as tmp:
run = Path(tmp)
for iid in submitted:
(run / iid).mkdir(parents=True)
resolve_submission_tar(sub_root / iid, run / iid / "submission.tar.gz")
run_eval_batch(sources=[run], workers=workers, filter_spec=filter_spec, force=True)
fresh = score_run(run, instances)

checks = [
Check(
iid,
round(submitted[iid], 4),
round(fresh.get(iid, float("nan")), 4),
_close(submitted[iid], fresh.get(iid)),
)
for iid in submitted
if not filter_spec or iid in fresh
]
Comment thread src/programbench/verify.py
Comment thread src/programbench/verify.py
return scores


def write_stat(run_dir: Path, stat: str, by_instance: dict[str, float]) -> None:
Comment thread src/programbench/submission.py
Comment on lines +137 to +140
elif url_file.exists():
with urllib.request.urlopen(url_file.read_text().strip()) as r: # noqa: S310
heavy = json.loads(r.read())
else:
Comment on lines +189 to +190
Supports the artifact forms in SPEC.md: inline file, ``.url`` (downloaded), or
``submission.ref.yaml`` (git checkout packed). The sha256 sidecar, when present, is
Comment on lines +1 to +2
# Generated by `programbench package` from: {{ run_dir }}
# [auto] fields are recomputed on every `package`; all other fields are preserved.
Comment on lines +164 to +168
plan = build_plan(submission_dir, registry)
if source:
plan.source = source
if commit:
plan.commit = commit
Comment thread src/programbench/cli/submit.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants