Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
ca8891b
add: integrate data foundry into tabarena usage
LennartPurucker May 28, 2026
685ce5a
maint: add note to toml
LennartPurucker May 28, 2026
26285a2
wip setup refactor commit
LennartPurucker May 28, 2026
d896433
factor out resources
LennartPurucker May 28, 2026
14d014d
wip changes
LennartPurucker May 28, 2026
db78208
wip refactor model constraints
LennartPurucker May 28, 2026
863d190
wip refactor of job candidates
LennartPurucker May 28, 2026
20be45d
minor refactor
LennartPurucker May 28, 2026
17709ba
add: refactor path setup simple
LennartPurucker May 29, 2026
16314ee
maint: better docs
LennartPurucker May 29, 2026
50ea635
maint: start making tabflow slurm a package
LennartPurucker May 29, 2026
a5db79f
refactor: make tabarena task metadata its own thing
LennartPurucker May 29, 2026
361e287
refactor: add bundle to tabarena code
LennartPurucker May 29, 2026
3b0f751
refactor: name change
LennartPurucker May 29, 2026
1a8c126
refactor experiment constructor
LennartPurucker May 29, 2026
b2ff6d8
update test for refactor
LennartPurucker May 29, 2026
cb1d216
make dynamic val protocol part of experiment
LennartPurucker May 29, 2026
dcb1345
fix: refactor step
LennartPurucker May 29, 2026
3fdc701
Merge branch 'main' into benchmark_workflow_refactor
LennartPurucker May 29, 2026
04b63f3
fix: refactor step
LennartPurucker May 29, 2026
da61dc5
add: example for TabArena-v0.1 and some small fixes
LennartPurucker May 29, 2026
34ed733
fix: tests work again
LennartPurucker May 29, 2026
19a2ab9
fix: tests work again
LennartPurucker May 30, 2026
d95a9d4
fix: str check
LennartPurucker May 30, 2026
54c87ab
maint: improve logging
LennartPurucker May 30, 2026
a97506f
add: refactor eval
LennartPurucker May 30, 2026
99334b4
add: eval example
LennartPurucker May 30, 2026
13a136e
add: better support for setting the caches
LennartPurucker May 30, 2026
77712a4
add: start adding beyond arena workflow
LennartPurucker May 30, 2026
ff6f60c
refactor: better workflow and large parts of beyond arena support
LennartPurucker May 31, 2026
a93ec51
refactor: change user task readable string and add migration script
LennartPurucker May 31, 2026
c4f490f
fix: int only reading for task_id_str and circular import fix
LennartPurucker May 31, 2026
c6622e3
add: support for migrating old csv metadata
LennartPurucker May 31, 2026
19f1568
fix: switch to migrated beyond arena results
LennartPurucker May 31, 2026
ff150ec
remake fetching and eval
LennartPurucker Jun 1, 2026
3d01e27
fix: correct eval workflow
LennartPurucker Jun 1, 2026
25b0e36
fix: load per model cache
LennartPurucker Jun 1, 2026
648d090
add: text cache refactor
LennartPurucker Jun 1, 2026
e129c7f
remove old files and improve logging
LennartPurucker Jun 1, 2026
8f69f64
move: migration scripts out of main code base
LennartPurucker Jun 1, 2026
31a4619
add: make docu a bit better and add some for tabflow
LennartPurucker Jun 1, 2026
f06dadf
avoid backward tabarena name, add tests for state
LennartPurucker Jun 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ TabArena is a living benchmark for tabular machine learning. It evaluates ML met

## Repository Layout

This repo is a **uv workspace** (root `pyproject.toml`) with three packages plus standalone tooling:
This repo is a **uv workspace** (root `pyproject.toml`; workspace members `tabarena` + `bencheval`) plus two more installable packages and supporting dirs:

- `tabarena/` — Core package. Repository pattern for benchmark data, model wrappers, simulation, evaluation, plotting. Depends on AutoGluon and `bencheval`.
- `bencheval/` — Standalone lightweight metrics/leaderboard package (ELO, win-rates, ranks, improvability). Computes leaderboards from results DataFrames. No dependency on `tabarena`.
- `tabflow/` — AWS SageMaker workflow orchestration. CLI entry points: `tabflow` (launch jobs) and `tabflow-download` (download results). Depends on `tabarena`.
- `tabflow_slurm/` — Standalone scripts (not a package) for running experiments on SLURM clusters.
- `tabflow_slurm/` — Package (own `pyproject.toml`, not a uv-workspace member) for running experiments on SLURM clusters. See `tabflow_slurm/README.md` and `tabflow_slurm/AGENTS.md`.
- `examples/` — Usage examples for benchmarking, plotting, meta-learning, custom models.
- `tst/` — Tests (note: `tst/`, **not** `tests/`).

Expand Down Expand Up @@ -64,7 +64,7 @@ Raw predictions → EvaluationRepository → Simulation/Portfolio → Results Da

- **`EvaluationRepository`** (`tabarena/tabarena/repository/evaluation_repository.py`) — Central class combining config metadata/rankings (`ZeroshotSimulatorContext`), cached val/test predictions (`TabularModelPredictions`), and `GroundTruth`. Supports subsetting by datasets/folds/configs/problem_types and ensemble selection via mixins.
- **`TabularModelPredictions`** (`tabarena/tabarena/predictions/`) — Abstract base for prediction storage. Implementations: `TabularPredictionsInMemory` (dict-based) and `TabularPredictionsMemmap` (disk-based memory-mapped for large benchmarks). Structure: `{dataset: {fold: {val/test: {config: predictions}}}}`.
- **`AbstractExecModel`** (`tabarena/tabarena/benchmark/models/wrapper/abstract_class.py`) — Base for benchmarked models. AutoGluon model wrappers live under `tabarena/tabarena/benchmark/models/ag/<model>/`; matching HPO search-space generators live under `tabarena/tabarena/models/<model>/generate.py`. Registry: `tabarena/tabarena/benchmark/models/model_registry.py`.
- **`AbstractExecModel`** (`tabarena/tabarena/benchmark/models/wrapper/abstract_class.py`) — Base for the benchmark *execution* wrappers. New benchmarked models live in one folder per model at `tabarena/tabarena/models/<model>/` (`model.py` = AutoGluon wrapper subclassing AG's `AbstractModel`/`AbstractTorchModel`, `hpo.py` = search-space generator, `info.py` = `ModelInfo`/`MethodMetadata` registry entry), auto-discovered by `tabarena/tabarena/models/_registry.py::discover_models()` (which `tabarena/tabarena/benchmark/models/model_registry.py` then derives the AG registry from). Use the **`add-model` skill** — there is no `benchmark/models/ag/<model>/` layout for new models.
- **`ExperimentRunner` / `ExperimentBatchRunner`** (`tabarena/tabarena/benchmark/experiment/`) — Execute model fitting across tasks. Configured via YAML (`experiment_constructor.py`).
- **`ZeroshotSimulatorContext`** (`tabarena/tabarena/simulation/`) — Manages config rankings for HPO simulation and portfolio generation.
- **`TabArena`** (`bencheval/bencheval/tabarena.py`) — Leaderboard computation from results DataFrames. Independent of the core `tabarena` package.
Expand All @@ -75,7 +75,7 @@ Artifacts download to `~/.cache/tabarena/` by default; override with `TABARENA_C

## Conventions

- **Add a new model**: touches ~7 locations (AG wrapper, search-space generator, two `__init__.py`s, `model_registry.py`, `models/utils.py` import map, `tabarena/pyproject.toml` extras, and a test). Use existing models in `tabarena/tabarena/benchmark/models/ag/` as templates — pick a structural neighbor (foundation/torch model vs. CPU/sklearn model).
- **Add a new model**: create one folder `tabarena/tabarena/models/<model>/` (`model.py`, `hpo.py`, `info.py`, `__init__.py`), then edit `models/__init__.py` (lazy class entry), `models/utils.py` (name→generator map), and `tabarena/pyproject.toml` (a per-model extra), plus a `tst/models/test_<model>.py`. The registry auto-discovers the model from its `info.py` — no manual registry edit. **Use the `add-model` skill**, which encodes this and points to reference implementations (foundation / torch / sklearn).
- **Imports**: `from __future__ import annotations` must be the first import in every `.py` file. Use absolute imports rooted at the package (e.g., `from tabarena.repository import EvaluationRepository`).
- **Optional dependencies**: each model has its own pyproject extra under `tabarena/pyproject.toml`; the `benchmark` extra is the union. Heavy/optional libs must never be imported at module top-level in core paths — import inside the model wrapper.
- **No new top-level docs files** unless the user asks. Edit existing files in place.
Expand Down
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This file only documents Claude-specific extensions.

### Skills (`.claude/skills/`)

- **`add-model`** — Use whenever the user asks to add/integrate/wrap a new tabular ML model. It encodes the full 7-location change required (AG wrapper, search-space generator, registry entries, `pyproject.toml` extra, test) and points to reference implementations for each model class (foundation, torch, sklearn-like).
- **`add-model`** — Use whenever the user asks to add/integrate/wrap a new tabular ML model. It encodes the full change: a per-model folder (`model.py` wrapper, `hpo.py` search space, `info.py` registry entry) plus edits to `models/__init__.py`, `models/utils.py`, and the `pyproject.toml` extra, and a testand points to reference implementations for each model class (foundation, torch, sklearn-like). The model is auto-discovered from its `info.py`.

When the user describes work that matches a skill's trigger criteria, invoke the skill via the Skill tool instead of recreating the steps manually.

Expand Down
2 changes: 1 addition & 1 deletion examples/!experimental/run_tabarena_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
TabArenaModelAgnosticPreprocessing,
TabArenaModelSpecificPreprocessing,
)
from tabarena.benchmark.task.user_task import GroupLabelTypes
from tabarena.benchmark.task.metadata import GroupLabelTypes

# ---------------------------------------------------------------------------
# 1. Synthetic dataset
Expand Down
25 changes: 25 additions & 0 deletions scripts/generate_beyond_arena_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
"""Regenerate the committed BeyondArena reference-metadata CSV (maintainer tool).

Run this once when the ``BeyondArena`` collection contents change. It downloads
and converts every container (large, one-off), then writes a portable CSV to the
package-data location. Commit the resulting file so users can filter datasets
before downloading anything.

python scripts/generate_beyond_arena_metadata.py

Requires the optional ``data-foundry`` dependency (``tabarena[data-foundry]``).
"""

from __future__ import annotations

from tabarena.benchmark.task.data_foundry import (
generate_reference_metadata,
get_beyond_arena_collection,
reference_metadata_package_path,
)

if __name__ == "__main__":
collection = get_beyond_arena_collection()
out_path = reference_metadata_package_path(collection.name)
generate_reference_metadata(collection=collection, out_path=out_path)
print(f"Wrote reference metadata to {out_path}. Commit it to ship the fast filter path.")
57 changes: 57 additions & 0 deletions scripts/generate_beyond_arena_text_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""Generate the semantic-text embedding caches for the BeyondArena text tasks (maintainer tool).

For every BeyondArena task that carries text, this downloads/converts the task (via the
``BeyondArena`` metadata bundle), computes its semantic embeddings, and writes the per-task cache to
the canonical, encoder-versioned location
(:func:`~tabarena.benchmark.preprocessing.text_cache.text_cache_path`). This is the *producer* side;
end users instead download these caches (see
:func:`~tabarena.benchmark.task.data_foundry.text_cache.download_text_cache`).

Heavy + GPU-bound (loads the sentence-transformer encoder) and requires the optional ``data-foundry``
extra. Run once when the BeyondArena text tasks or the encoder change.

python scripts/generate_beyond_arena_text_cache.py [--ignore-cache] [--dataset-names a b c]

In production these caches are *shipped inside each Data Foundry container* (as
``tabarena_text_cache.parquet``) and imported automatically when the dataset is materialized — see
:mod:`tabarena.benchmark.task.data_foundry.text_cache`. This script is the local (re)generation path:
it writes each cache to the canonical, encoder-versioned location
(:func:`~tabarena.benchmark.preprocessing.text_cache.text_cache_path`), useful for regenerating a
cache to upload into a container, or for purely-local use without re-downloading.
"""

from __future__ import annotations

import argparse


def generate_beyond_arena_text_caches(*, dataset_names: list[str] | None = None, ignore_cache: bool = False) -> int:
"""Generate caches for all (or the named) BeyondArena text tasks; returns the count generated."""
from tabarena.benchmark.preprocessing.text_cache import generate_text_cache
from tabarena.benchmark.task.metadata import BeyondArenaMetadataBundle
from tabarena.benchmark.task.user_task import UserTask

bundle = BeyondArenaMetadataBundle(dataset_names_to_run=dataset_names)
task_metadata = bundle.load_task_metadata() # downloads/converts the (filtered) tasks
text_tasks = [ttm for ttm in task_metadata if ttm.has_text]
print(f"Found {len(text_tasks)} BeyondArena text task(s) to cache (of {len(task_metadata)} total).")

for ttm in text_tasks:
user_task = UserTask.from_task_id_str(ttm.task_id_str)
generate_text_cache(user_task, ignore_cache=ignore_cache)
return len(text_tasks)


if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--ignore-cache", action="store_true", help="Regenerate even if a cache already exists.")
parser.add_argument(
"--dataset-names",
nargs="+",
default=None,
help="Restrict to these dataset names (default: all BeyondArena text tasks).",
)
args = parser.parse_args()

n = generate_beyond_arena_text_caches(dataset_names=args.dataset_names, ignore_cache=args.ignore_cache)
print(f"Done. Generated/checked {n} text-task cache(s).")
18 changes: 18 additions & 0 deletions scripts/generate_tabarena_v0pt1_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
"""Regenerate the committed TabArena v0.1 reference-metadata CSV (maintainer tool).

Rebuilds the per-task x split metadata from the curated v0.1 metadata and writes it
to the package-data location read by ``TabArenaV0pt1TaskMetadataSource``. Run this
when the curated v0.1 metadata changes, then commit the resulting CSV.

python scripts/generate_tabarena_v0pt1_metadata.py
"""

from __future__ import annotations

from tabarena.benchmark.task.metadata.sources.tabarena_v0pt1 import (
generate_tabarena_v0_1_reference_metadata,
)

if __name__ == "__main__":
out_path = generate_tabarena_v0_1_reference_metadata()
print(f"Wrote reference metadata to {out_path}. Commit it to skip the on-the-fly rebuild.")
8 changes: 8 additions & 0 deletions tabarena/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,12 @@ dependencies = [
]

[project.optional-dependencies]

# Data Foundry integration: lets `tabarena.benchmark.task.data_foundry` convert
# curated containers into TabArena UserTasks. Included in the `benchmark` union.
# FIXME: To use `uv`, you need to do `uv pip install --prerelease=allow .` so it recognizes pre-release data-foundry
data-foundry = ["data-foundry>=0.0.3"]

# Core model extras — installed together via the `benchmark` union below.
tabpfn = [
"tabpfn>=8.0.3",
Expand Down Expand Up @@ -98,6 +104,7 @@ benchmark = [
"tabarena[realmlp]",
"tabarena[tabdpt]",
"tabarena[tabm]",
"tabarena[data-foundry]",
]

# Extended set: extra models layered on top of `benchmark` when needed.
Expand Down Expand Up @@ -144,6 +151,7 @@ tabarena = [
"metrics/_roc_auc_cpp/compile.sh",
"metrics/_roc_auc_cpp/cpp_auc.cpp",
"nips2025_utils/metadata/task_metadata_tabarena51.csv",
"benchmark/task/metadata/sources/data/*.csv",
"benchmark/models/ag/limix/_vendor/config/*.json",
"benchmark/models/ag/limix/_vendor/LICENSE.txt",
"benchmark/models/ag/limix/_vendor/README.md",
Expand Down
10 changes: 10 additions & 0 deletions tabarena/tabarena/benchmark/experiment/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
from __future__ import annotations # noqa: I001

from tabarena.benchmark.experiment.bundle import (
ModelConstraints,
TabArenaExperimentBundle,
TabArenaV0pt1ExperimentBundle,
BeyondArenaExperimentBundle,
)
from tabarena.benchmark.experiment.experiment_constructor import (
AGExperiment,
AGModelBagExperiment,
Expand All @@ -25,10 +31,14 @@
"AGModelBagExperiment",
"AGModelExperiment",
"AGModelOuterExperiment",
"BeyondArenaExperimentBundle",
"Experiment",
"ExperimentBatchRunner",
"ExperimentRunner",
"ModelConstraints",
"OOFExperimentRunner",
"TabArenaExperimentBundle",
"TabArenaV0pt1ExperimentBundle",
"YamlExperimentSerializer",
"run_experiments",
"run_experiments_new",
Expand Down
Loading
Loading