Benchmark Workflow Refactor Start#317
Merged
Merged
Conversation
# Conflicts: # tabarena/tabarena/benchmark/experiment/experiment_constructor.py
Innixma
reviewed
May 31, 2026
Innixma
reviewed
May 31, 2026
Innixma
reviewed
May 31, 2026
Innixma
reviewed
May 31, 2026
Innixma
reviewed
May 31, 2026
Innixma
reviewed
May 31, 2026
Innixma
approved these changes
May 31, 2026
Collaborator
Innixma
left a comment
There was a problem hiding this comment.
LGTM! Added some comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor: consolidate the TabArena benchmarking workflow
Motivation
Running a TabArena benchmark used to require stitching together several loosely
coupled pieces, and the definition of what to run was spread across the
codebase rather than living in one place.
preprocessing pipeline, the fold-fitting strategy, and the validation
protocol were filled in later — partly by a separate config post-processing
step at launch time, partly by the runner itself. The config that landed on
disk was therefore not self-describing: you could not just load it and run
it, you had to re-apply the same mutations, in the right order, in the right
place.
generation, path handling, resource budgeting, and the experiment definition
all together, with little reuse and no clear package boundary.
made it hard to produce, inspect, filter, or reuse the "which datasets and
splits do we run" information on its own.
The net effect was a workflow that was hard to reason about, hard to reproduce,
and hard to extend: a small change (a new preprocessing option, a per-model
constraint, a resource tweak) rippled across multiple files and multiple
execution stages.
What this PR does
This PR reorganizes the benchmarking workflow around a few self-contained,
serializable concepts, so the path from "define a benchmark" to "run it" is
linear and explicit.
Experiments are now fully self-describing. Everything needed to run an
experiment — model and hyperparameters, preprocessing pipeline, fold-fitting
strategy, compute resources, validation protocol, and per-model constraints —
is captured on the experiment itself and serialized with it. The workflow
becomes simply: populate → save to disk → load from disk → run, with no
post-load mutation. Where a setting genuinely depends on the runtime node
(e.g. CPU/memory auto-detection) or on the specific task (dynamic validation),
it is resolved lazily at run time inside the experiment, so the on-disk
artifact stays portable and environment-independent.
Bundles tie the pieces together. A single experiment bundle now describes
which models/configurations to run and how to build them, and emits the
ready-to-run, serialized experiments. The same idea applies to tasks: a task
bundle describes which datasets/splits to run and applies any filtering. These
bundles are the one obvious place to look for "what is this benchmark".
Task metadata is a first-class, standalone concept. It has its own schema
and lives independently of the OpenML / user-task plumbing, so it can be
created, inspected, filtered, and reused on its own — including when it comes
from external sources.
tabflow_slurmis a proper, installable package built from focused,composable setup components (resources, scheduler, paths, job candidates, and
the top-level benchmark setup) rather than monolithic scripts. It now consumes
the shared concepts above and is responsible only for orchestration: enumerating
jobs, filtering, batching, and submission.
Clear separation between core and orchestration. Reusable,
environment-independent logic (experiment and task definitions, resource
detection) lives in the core TabArena package; SLURM-specific orchestration
lives in
tabflow_slurm. The two communicate through the serialized artifactsand a small, explicit interface, instead of sharing mutable state across stages.
Notable behaviour changes
launch-time re-parsing/mutation step has been removed.
Nonestill means"auto-detect on the node at run time", so per-node sizing is preserved.
properties of the experiment/bundle rather than global flags threaded through
the launcher.
Compatibility / migration notes
!old/area for reference) and superseded by the new package. Run configs that
subclassed the old setup will need to migrate to the new components.
disposable (they are regenerated for each run), so no on-disk backward
compatibility is maintained.
Testing
Unit tests cover the new building blocks: experiment building with
serialize/load round-trips, task-metadata loading and filtering, resource
baking and lazy auto-detection, and the SLURM setup components. The
tabflow_slurmtests skip gracefully when the package is not installed, so thecore test suite runs without it.