Skip to content

Benchmark Workflow Refactor Start#317

Merged
LennartPurucker merged 42 commits into
mainfrom
benchmark_workflow_refactor
Jun 1, 2026
Merged

Benchmark Workflow Refactor Start#317
LennartPurucker merged 42 commits into
mainfrom
benchmark_workflow_refactor

Conversation

@LennartPurucker
Copy link
Copy Markdown
Collaborator

@LennartPurucker LennartPurucker commented May 29, 2026

Refactor: consolidate the TabArena benchmarking workflow

Motivation

Running a TabArena benchmark used to require stitching together several loosely
coupled pieces, and the definition of what to run was spread across the
codebase rather than living in one place.

  • An experiment was only half-defined up front. Compute resources, the
    preprocessing pipeline, the fold-fitting strategy, and the validation
    protocol were filled in later — partly by a separate config post-processing
    step at launch time, partly by the runner itself. The config that landed on
    disk was therefore not self-describing: you could not just load it and run
    it, you had to re-apply the same mutations, in the right order, in the right
    place.
  • SLURM orchestration was a set of monolithic scripts that mixed job
    generation, path handling, resource budgeting, and the experiment definition
    all together, with little reuse and no clear package boundary.
  • Task metadata was entangled with the OpenML / user-task machinery, which
    made it hard to produce, inspect, filter, or reuse the "which datasets and
    splits do we run" information on its own.

The net effect was a workflow that was hard to reason about, hard to reproduce,
and hard to extend: a small change (a new preprocessing option, a per-model
constraint, a resource tweak) rippled across multiple files and multiple
execution stages.

What this PR does

This PR reorganizes the benchmarking workflow around a few self-contained,
serializable concepts, so the path from "define a benchmark" to "run it" is
linear and explicit.

Experiments are now fully self-describing. Everything needed to run an
experiment — model and hyperparameters, preprocessing pipeline, fold-fitting
strategy, compute resources, validation protocol, and per-model constraints —
is captured on the experiment itself and serialized with it. The workflow
becomes simply: populate → save to disk → load from disk → run, with no
post-load mutation. Where a setting genuinely depends on the runtime node
(e.g. CPU/memory auto-detection) or on the specific task (dynamic validation),
it is resolved lazily at run time inside the experiment, so the on-disk
artifact stays portable and environment-independent.

Bundles tie the pieces together. A single experiment bundle now describes
which models/configurations to run and how to build them, and emits the
ready-to-run, serialized experiments. The same idea applies to tasks: a task
bundle describes which datasets/splits to run and applies any filtering. These
bundles are the one obvious place to look for "what is this benchmark".

Task metadata is a first-class, standalone concept. It has its own schema
and lives independently of the OpenML / user-task plumbing, so it can be
created, inspected, filtered, and reused on its own — including when it comes
from external sources.

tabflow_slurm is a proper, installable package built from focused,
composable setup components (resources, scheduler, paths, job candidates, and
the top-level benchmark setup) rather than monolithic scripts. It now consumes
the shared concepts above and is responsible only for orchestration: enumerating
jobs, filtering, batching, and submission.

Clear separation between core and orchestration. Reusable,
environment-independent logic (experiment and task definitions, resource
detection) lives in the core TabArena package; SLURM-specific orchestration
lives in tabflow_slurm. The two communicate through the serialized artifacts
and a small, explicit interface, instead of sharing mutable state across stages.

Notable behaviour changes

  • Generated experiment configs are now self-contained and ready to run; the
    launch-time re-parsing/mutation step has been removed.
  • Compute resources are recorded with each experiment, but None still means
    "auto-detect on the node at run time", so per-node sizing is preserved.
  • The dynamic validation protocol and per-model dataset constraints are now
    properties of the experiment/bundle rather than global flags threaded through
    the launcher.

Compatibility / migration notes

  • The pre-refactor monolithic SLURM scripts are retired (kept under an !old/
    area for reference) and superseded by the new package. Run configs that
    subclassed the old setup will need to migrate to the new components.
  • Serialized configs/artifacts from before the refactor are treated as
    disposable (they are regenerated for each run), so no on-disk backward
    compatibility is maintained.

Testing

Unit tests cover the new building blocks: experiment building with
serialize/load round-trips, task-metadata loading and filtering, resource
baking and lazy auto-detection, and the SLURM setup components. The
tabflow_slurm tests skip gracefully when the package is not installed, so the
core test suite runs without it.

@Innixma Innixma self-requested a review May 29, 2026 17:41
Comment thread tabarena/tabarena/benchmark/experiment/bundle.py
Comment thread tabarena/tabarena/benchmark/experiment/bundle.py
Comment thread tabarena/tabarena/benchmark/task/metadata/bundles/base.py Outdated
Comment thread tabarena/tabarena/benchmark/task/metadata/sources/tabarena_v0pt1.py Outdated
Comment thread tabarena/tabarena/benchmark/task/metadata/schema.py
Copy link
Copy Markdown
Collaborator

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added some comments

@LennartPurucker LennartPurucker merged commit 8951e1e into main Jun 1, 2026
6 checks passed
@LennartPurucker LennartPurucker deleted the benchmark_workflow_refactor branch June 1, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants