Benchmark Workflow Refactor Start by LennartPurucker · Pull Request #317 · autogluon/tabarena

LennartPurucker · 2026-05-29T17:04:59Z

Refactor: consolidate the TabArena benchmarking workflow

Motivation

Running a TabArena benchmark used to require stitching together several loosely
coupled pieces, and the definition of what to run was spread across the
codebase rather than living in one place.

An experiment was only half-defined up front. Compute resources, the
preprocessing pipeline, the fold-fitting strategy, and the validation
protocol were filled in later — partly by a separate config post-processing
step at launch time, partly by the runner itself. The config that landed on
disk was therefore not self-describing: you could not just load it and run
it, you had to re-apply the same mutations, in the right order, in the right
place.
SLURM orchestration was a set of monolithic scripts that mixed job
generation, path handling, resource budgeting, and the experiment definition
all together, with little reuse and no clear package boundary.
Task metadata was entangled with the OpenML / user-task machinery, which
made it hard to produce, inspect, filter, or reuse the "which datasets and
splits do we run" information on its own.

The net effect was a workflow that was hard to reason about, hard to reproduce,
and hard to extend: a small change (a new preprocessing option, a per-model
constraint, a resource tweak) rippled across multiple files and multiple
execution stages.

What this PR does

This PR reorganizes the benchmarking workflow around a few self-contained,
serializable concepts, so the path from "define a benchmark" to "run it" is
linear and explicit.

Experiments are now fully self-describing. Everything needed to run an
experiment — model and hyperparameters, preprocessing pipeline, fold-fitting
strategy, compute resources, validation protocol, and per-model constraints —
is captured on the experiment itself and serialized with it. The workflow
becomes simply: populate → save to disk → load from disk → run, with no
post-load mutation. Where a setting genuinely depends on the runtime node
(e.g. CPU/memory auto-detection) or on the specific task (dynamic validation),
it is resolved lazily at run time inside the experiment, so the on-disk
artifact stays portable and environment-independent.

Bundles tie the pieces together. A single experiment bundle now describes
which models/configurations to run and how to build them, and emits the
ready-to-run, serialized experiments. The same idea applies to tasks: a task
bundle describes which datasets/splits to run and applies any filtering. These
bundles are the one obvious place to look for "what is this benchmark".

Task metadata is a first-class, standalone concept. It has its own schema
and lives independently of the OpenML / user-task plumbing, so it can be
created, inspected, filtered, and reused on its own — including when it comes
from external sources.

tabflow_slurm is a proper, installable package built from focused,
composable setup components (resources, scheduler, paths, job candidates, and
the top-level benchmark setup) rather than monolithic scripts. It now consumes
the shared concepts above and is responsible only for orchestration: enumerating
jobs, filtering, batching, and submission.

Clear separation between core and orchestration. Reusable,
environment-independent logic (experiment and task definitions, resource
detection) lives in the core TabArena package; SLURM-specific orchestration
lives in tabflow_slurm. The two communicate through the serialized artifacts
and a small, explicit interface, instead of sharing mutable state across stages.

Notable behaviour changes

Generated experiment configs are now self-contained and ready to run; the
launch-time re-parsing/mutation step has been removed.
Compute resources are recorded with each experiment, but None still means
"auto-detect on the node at run time", so per-node sizing is preserved.
The dynamic validation protocol and per-model dataset constraints are now
properties of the experiment/bundle rather than global flags threaded through
the launcher.

Compatibility / migration notes

The pre-refactor monolithic SLURM scripts are retired (kept under an !old/
area for reference) and superseded by the new package. Run configs that
subclassed the old setup will need to migrate to the new components.
Serialized configs/artifacts from before the refactor are treated as
disposable (they are regenerated for each run), so no on-disk backward
compatibility is maintained.

Testing

Unit tests cover the new building blocks: experiment building with
serialize/load round-trips, task-metadata loading and filtering, resource
baking and lazy auto-detection, and the SLURM setup components. The
tabflow_slurm tests skip gracefully when the package is not installed, so the
core test suite runs without it.

# Conflicts: # tabarena/tabarena/benchmark/experiment/experiment_constructor.py

Innixma

LGTM! Added some comments

LennartPurucker added 19 commits May 28, 2026 16:41

add: integrate data foundry into tabarena usage

ca8891b

maint: add note to toml

685ce5a

wip setup refactor commit

26285a2

factor out resources

d896433

wip changes

14d014d

wip refactor model constraints

db78208

wip refactor of job candidates

863d190

minor refactor

20be45d

add: refactor path setup simple

17709ba

maint: better docs

16314ee

maint: start making tabflow slurm a package

50ea635

refactor: make tabarena task metadata its own thing

a5db79f

refactor: add bundle to tabarena code

361e287

refactor: name change

3b0f751

refactor experiment constructor

1a8c126

update test for refactor

b2ff6d8

make dynamic val protocol part of experiment

cb1d216

fix: refactor step

dcb1345

Merge branch 'main' into benchmark_workflow_refactor

3fdc701

# Conflicts: # tabarena/tabarena/benchmark/experiment/experiment_constructor.py

Innixma self-requested a review May 29, 2026 17:41

LennartPurucker added 10 commits May 29, 2026 17:57

fix: refactor step

04b63f3

add: example for TabArena-v0.1 and some small fixes

da61dc5

fix: tests work again

34ed733

fix: tests work again

19a2ab9

fix: str check

d95a9d4

maint: improve logging

54c87ab

add: refactor eval

a97506f

add: eval example

99334b4

add: better support for setting the caches

13a136e

add: start adding beyond arena workflow

77712a4

LennartPurucker added 2 commits May 31, 2026 12:27

refactor: better workflow and large parts of beyond arena support

ff6f60c

refactor: change user task readable string and add migration script

a93ec51