Add programbench submit (package / verify / recombine)#39
Draft
john-b-yang wants to merge 1 commit into
Draft
Conversation
f3f9a68 to
4cd2f25
Compare
Contributor
Author
|
Fixed a lint issue, should be ready for review! |
4cd2f25 to
b1e9e94
Compare
Contributor
Author
|
Workflow I'm imagining, tl;dr'ed, is:
Fully described:
|
There was a problem hiding this comment.
Pull request overview
Adds a new programbench submit command group implementing the submission lifecycle: packaging evaluated runs into a standardized submission format, verifying submissions (offline and via re-eval), recombining split eval artifacts, and registering submissions into the leaderboard registry via an automated PR flow.
Changes:
- Introduces shared submission helpers (
submission.py) for scoring/aggregation, eval JSON split+recombine, and artifact resolution. - Adds
submit package,submit verify(tier0/tier1),submit recombine, andsubmit registerCLI commands plus supporting modules. - Wires the new
submitTyper app into the top-level CLI and adds Jinja templates forsubmission.yamlandREADME.md.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/programbench/verify.py | Implements Tier-0/Tier-1 verification logic for packaged submissions. |
| src/programbench/submission.py | Adds shared scoring/aggregation, eval split/recombine, and artifact resolution helpers. |
| src/programbench/register.py | Implements registry PR plan/build/write logic and optional gh-based automation. |
| src/programbench/package.py | Implements in-place packaging of eval runs into leaderboard submissions, with optional HF upload. |
| src/programbench/data/templates/submission.yaml.j2 | Adds the submission manifest template used by package. |
| src/programbench/data/templates/README.md.j2 | Adds a submission README template with reproduction/checklist guidance. |
| src/programbench/cli/submit.py | Adds the submit CLI group and subcommands (package/verify/register/recombine). |
| src/programbench/cli/main.py | Registers the submit CLI group at the top level. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+75
to
+99
| def verify_tier1(submission_dir: Path, *, workers: int = 1, filter_spec: str = "") -> VerifyResult: | ||
| from programbench.eval.eval_batch import run_eval_batch | ||
|
|
||
| instances = benchmark_instances() | ||
| sub_root = submission_dir | ||
| submitted = score_run(sub_root, instances) | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmp: | ||
| run = Path(tmp) | ||
| for iid in submitted: | ||
| (run / iid).mkdir(parents=True) | ||
| resolve_submission_tar(sub_root / iid, run / iid / "submission.tar.gz") | ||
| run_eval_batch(sources=[run], workers=workers, filter_spec=filter_spec, force=True) | ||
| fresh = score_run(run, instances) | ||
|
|
||
| checks = [ | ||
| Check( | ||
| iid, | ||
| round(submitted[iid], 4), | ||
| round(fresh.get(iid, float("nan")), 4), | ||
| _close(submitted[iid], fresh.get(iid)), | ||
| ) | ||
| for iid in submitted | ||
| if not filter_spec or iid in fresh | ||
| ] |
| return scores | ||
|
|
||
|
|
||
| def write_stat(run_dir: Path, stat: str, by_instance: dict[str, float]) -> None: |
Comment on lines
+137
to
+140
| elif url_file.exists(): | ||
| with urllib.request.urlopen(url_file.read_text().strip()) as r: # noqa: S310 | ||
| heavy = json.loads(r.read()) | ||
| else: |
Comment on lines
+189
to
+190
| Supports the artifact forms in SPEC.md: inline file, ``.url`` (downloaded), or | ||
| ``submission.ref.yaml`` (git checkout packed). The sha256 sidecar, when present, is |
Comment on lines
+1
to
+2
| # Generated by `programbench package` from: {{ run_dir }} | ||
| # [auto] fields are recomputed on every `package`; all other fields are preserved. |
Comment on lines
+164
to
+168
| plan = build_plan(submission_dir, registry) | ||
| if source: | ||
| plan.source = source | ||
| if commit: | ||
| plan.commit = commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a
submitsubcommand group for the leaderboard submission lifecycle:Major changes:
programbench submit package <run-dir>programbench evalrun directory into a submission, in place. Writes the following:submission.yamlmanifest_stats/score.json(per-instance, per-test pass/fail)eval.jsoninto a lighteval.json+ heavyeval.log.json(raw log + failure text).--upload-to <HF org>flag automatically uploadssubmission.tar.gzandeval.log.jsonartifacts to a per-submission HuggingFace dataset (resumable), replacing each with a.url+.sha256programbench submit verify <dir>eval.jsonand checks it matches the manifest (no Docker, no network);--tier1resolves each solution and re-runsprogrambench evalto confirm the artifacts reproduce the score.Minor changes:
programbench submit recombine <dir>(minor): reassembles the originaleval.jsonfrom the split pieces (downloading the heavy part from HF if needed).New modules:
submission.py(shared scoring/aggregation, eval-split, HF helpers)package.pyverify.pycli/submit.pysubmission.yaml/README.mdtemplates. Scaffold-agnostic: cost/calls stats are out of scope (submitter-provided, derived from trajectories).