Problem
Submissions share code but not trained weights. Anyone who wants to evaluate, compare, or build on a submission must re-run the full 10-minute 8xH100 training from scratch. This is expensive, wasteful in training credits / favorizes contributorcontributors without H100 access, and blocks a whole category of downstream work.
Why this matters
-
Compute
Every time someone wants to compare against a prior submission, they re-train it. Multiply that across contributors and seeds and we're burning significant H100-hours reproducing identical runs. Even with the compute grant, credits are finite. The current setup favors contributors who can afford to re-run freely over those who need to be deliberate with their budget
-
Reusing the research benchmark
This repo is accumulating dozens of small LMs trained on the same data with diverse architectures. With published weights this becomes a standardized collection of tiny LMs useful for interpretability research, architecture comparison, and compression studies. Without weights
-
Downstream tooling
Post-training quantization experiments, interpretability tools, distillation, and model merging are all bottlenecked by "step 0: re-train the model." Shipping weights makes these lines of work trivially accessible.
Implementation
- A shared HuggingFace repo (e.g. openai/parameter-golf-weights) where each record submission uploads its final_model.pt and/or the compressed artifact
- The eval harness could upload automatically after a successful run, or submitters could upload manually as part of the PR checklist
- Reasonable overhead: the compressed artifacts are already under 20 MB. Full-precision checkpoints are ~50-100 MB
Open questions
- Required or opt-in?
- Where should it be hosted?
- Compressed artifact or full-precision final_model.pt, or both?
Problem
Submissions share code but not trained weights. Anyone who wants to evaluate, compare, or build on a submission must re-run the full 10-minute 8xH100 training from scratch. This is expensive, wasteful in training credits / favorizes contributorcontributors without H100 access, and blocks a whole category of downstream work.
Why this matters
Compute
Every time someone wants to compare against a prior submission, they re-train it. Multiply that across contributors and seeds and we're burning significant H100-hours reproducing identical runs. Even with the compute grant, credits are finite. The current setup favors contributors who can afford to re-run freely over those who need to be deliberate with their budget
Reusing the research benchmark
This repo is accumulating dozens of small LMs trained on the same data with diverse architectures. With published weights this becomes a standardized collection of tiny LMs useful for interpretability research, architecture comparison, and compression studies. Without weights
Downstream tooling
Post-training quantization experiments, interpretability tools, distillation, and model merging are all bottlenecked by "step 0: re-train the model." Shipping weights makes these lines of work trivially accessible.
Implementation
Open questions