autogluon · LennartPurucker · Jun 1, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -8,12 +8,12 @@ TabArena is a living benchmark for tabular machine learning. It evaluates ML met
 
 ## Repository Layout
 
-This repo is a **uv workspace** (root `pyproject.toml`) with three packages plus standalone tooling:
+This repo is a **uv workspace** (root `pyproject.toml`; workspace members `tabarena` + `bencheval`) plus two more installable packages and supporting dirs:
 
 - `tabarena/` — Core package. Repository pattern for benchmark data, model wrappers, simulation, evaluation, plotting. Depends on AutoGluon and `bencheval`.
 - `bencheval/` — Standalone lightweight metrics/leaderboard package (ELO, win-rates, ranks, improvability). Computes leaderboards from results DataFrames. No dependency on `tabarena`.
 - `tabflow/` — AWS SageMaker workflow orchestration. CLI entry points: `tabflow` (launch jobs) and `tabflow-download` (download results). Depends on `tabarena`.
-- `tabflow_slurm/` — Standalone scripts (not a package) for running experiments on SLURM clusters.
+- `tabflow_slurm/` — Package (own `pyproject.toml`, not a uv-workspace member) for running experiments on SLURM clusters. See `tabflow_slurm/README.md` and `tabflow_slurm/AGENTS.md`.
 - `examples/` — Usage examples for benchmarking, plotting, meta-learning, custom models.
 - `tst/` — Tests (note: `tst/`, **not** `tests/`).
 
@@ -64,7 +64,7 @@ Raw predictions → EvaluationRepository → Simulation/Portfolio → Results Da
 
 - **`EvaluationRepository`** (`tabarena/tabarena/repository/evaluation_repository.py`) — Central class combining config metadata/rankings (`ZeroshotSimulatorContext`), cached val/test predictions (`TabularModelPredictions`), and `GroundTruth`. Supports subsetting by datasets/folds/configs/problem_types and ensemble selection via mixins.
 - **`TabularModelPredictions`** (`tabarena/tabarena/predictions/`) — Abstract base for prediction storage. Implementations: `TabularPredictionsInMemory` (dict-based) and `TabularPredictionsMemmap` (disk-based memory-mapped for large benchmarks). Structure: `{dataset: {fold: {val/test: {config: predictions}}}}`.
-- **`AbstractExecModel`** (`tabarena/tabarena/benchmark/models/wrapper/abstract_class.py`) — Base for benchmarked models. AutoGluon model wrappers live under `tabarena/tabarena/benchmark/models/ag/<model>/`; matching HPO search-space generators live under `tabarena/tabarena/models/<model>/generate.py`. Registry: `tabarena/tabarena/benchmark/models/model_registry.py`.
+- **`AbstractExecModel`** (`tabarena/tabarena/benchmark/models/wrapper/abstract_class.py`) — Base for the benchmark *execution* wrappers. New benchmarked models live in one folder per model at `tabarena/tabarena/models/<model>/` (`model.py` = AutoGluon wrapper subclassing AG's `AbstractModel`/`AbstractTorchModel`, `hpo.py` = search-space generator, `info.py` = `ModelInfo`/`MethodMetadata` registry entry), auto-discovered by `tabarena/tabarena/models/_registry.py::discover_models()` (which `tabarena/tabarena/benchmark/models/model_registry.py` then derives the AG registry from). Use the **`add-model` skill** — there is no `benchmark/models/ag/<model>/` layout for new models.
 - **`ExperimentRunner` / `ExperimentBatchRunner`** (`tabarena/tabarena/benchmark/experiment/`) — Execute model fitting across tasks. Configured via YAML (`experiment_constructor.py`).
 - **`ZeroshotSimulatorContext`** (`tabarena/tabarena/simulation/`) — Manages config rankings for HPO simulation and portfolio generation.
 - **`TabArena`** (`bencheval/bencheval/tabarena.py`) — Leaderboard computation from results DataFrames. Independent of the core `tabarena` package.
@@ -75,7 +75,7 @@ Artifacts download to `~/.cache/tabarena/` by default; override with `TABARENA_C
 
 ## Conventions
 
-- **Add a new model**: touches ~7 locations (AG wrapper, search-space generator, two `__init__.py`s, `model_registry.py`, `models/utils.py` import map, `tabarena/pyproject.toml` extras, and a test). Use existing models in `tabarena/tabarena/benchmark/models/ag/` as templates — pick a structural neighbor (foundation/torch model vs. CPU/sklearn model).
+- **Add a new model**: create one folder `tabarena/tabarena/models/<model>/` (`model.py`, `hpo.py`, `info.py`, `__init__.py`), then edit `models/__init__.py` (lazy class entry), `models/utils.py` (name→generator map), and `tabarena/pyproject.toml` (a per-model extra), plus a `tst/models/test_<model>.py`. The registry auto-discovers the model from its `info.py` — no manual registry edit. **Use the `add-model` skill**, which encodes this and points to reference implementations (foundation / torch / sklearn).
 - **Imports**: `from __future__ import annotations` must be the first import in every `.py` file. Use absolute imports rooted at the package (e.g., `from tabarena.repository import EvaluationRepository`).
 - **Optional dependencies**: each model has its own pyproject extra under `tabarena/pyproject.toml`; the `benchmark` extra is the union. Heavy/optional libs must never be imported at module top-level in core paths — import inside the model wrapper.
 - **No new top-level docs files** unless the user asks. Edit existing files in place.

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,7 +8,7 @@ This file only documents Claude-specific extensions.
 
 ### Skills (`.claude/skills/`)
 
-- **`add-model`** — Use whenever the user asks to add/integrate/wrap a new tabular ML model. It encodes the full 7-location change required (AG wrapper, search-space generator, registry entries, `pyproject.toml` extra, test) and points to reference implementations for each model class (foundation, torch, sklearn-like).
+- **`add-model`** — Use whenever the user asks to add/integrate/wrap a new tabular ML model. It encodes the full change: a per-model folder (`model.py` wrapper, `hpo.py` search space, `info.py` registry entry) plus edits to `models/__init__.py`, `models/utils.py`, and the `pyproject.toml` extra, and a test — and points to reference implementations for each model class (foundation, torch, sklearn-like). The model is auto-discovered from its `info.py`.
 
 When the user describes work that matches a skill's trigger criteria, invoke the skill via the Skill tool instead of recreating the steps manually.
 

diff --git a/examples/!experimental/run_tabarena_preprocessing.py b/examples/!experimental/run_tabarena_preprocessing.py
@@ -39,7 +39,7 @@
     TabArenaModelAgnosticPreprocessing,
     TabArenaModelSpecificPreprocessing,
 )
-from tabarena.benchmark.task.user_task import GroupLabelTypes
+from tabarena.benchmark.task.metadata import GroupLabelTypes
 
 # ---------------------------------------------------------------------------
 # 1. Synthetic dataset

diff --git a/scripts/generate_beyond_arena_metadata.py b/scripts/generate_beyond_arena_metadata.py
@@ -0,0 +1,25 @@
+"""Regenerate the committed BeyondArena reference-metadata CSV (maintainer tool).
+
+Run this once when the ``BeyondArena`` collection contents change. It downloads
+and converts every container (large, one-off), then writes a portable CSV to the
+package-data location. Commit the resulting file so users can filter datasets
+before downloading anything.
+
+    python scripts/generate_beyond_arena_metadata.py
+
+Requires the optional ``data-foundry`` dependency (``tabarena[data-foundry]``).
+"""
+
+from __future__ import annotations
+
+from tabarena.benchmark.task.data_foundry import (
+    generate_reference_metadata,
+    get_beyond_arena_collection,
+    reference_metadata_package_path,
+)
+
+if __name__ == "__main__":
+    collection = get_beyond_arena_collection()
+    out_path = reference_metadata_package_path(collection.name)
+    generate_reference_metadata(collection=collection, out_path=out_path)
+    print(f"Wrote reference metadata to {out_path}. Commit it to ship the fast filter path.")
diff --git a/scripts/generate_beyond_arena_text_cache.py b/scripts/generate_beyond_arena_text_cache.py
@@ -0,0 +1,57 @@
+"""Generate the semantic-text embedding caches for the BeyondArena text tasks (maintainer tool).
+
+For every BeyondArena task that carries text, this downloads/converts the task (via the
+``BeyondArena`` metadata bundle), computes its semantic embeddings, and writes the per-task cache to
+the canonical, encoder-versioned location
+(:func:`~tabarena.benchmark.preprocessing.text_cache.text_cache_path`). This is the *producer* side;
+end users instead download these caches (see
+:func:`~tabarena.benchmark.task.data_foundry.text_cache.download_text_cache`).
+
+Heavy + GPU-bound (loads the sentence-transformer encoder) and requires the optional ``data-foundry``
+extra. Run once when the BeyondArena text tasks or the encoder change.
+
+    python scripts/generate_beyond_arena_text_cache.py [--ignore-cache] [--dataset-names a b c]
+
+In production these caches are *shipped inside each Data Foundry container* (as
+``tabarena_text_cache.parquet``) and imported automatically when the dataset is materialized — see
+:mod:`tabarena.benchmark.task.data_foundry.text_cache`. This script is the local (re)generation path:
+it writes each cache to the canonical, encoder-versioned location
+(:func:`~tabarena.benchmark.preprocessing.text_cache.text_cache_path`), useful for regenerating a
+cache to upload into a container, or for purely-local use without re-downloading.
+"""
+
+from __future__ import annotations
+
+import argparse
+
+
+def generate_beyond_arena_text_caches(*, dataset_names: list[str] | None = None, ignore_cache: bool = False) -> int:
+    """Generate caches for all (or the named) BeyondArena text tasks; returns the count generated."""
+    from tabarena.benchmark.preprocessing.text_cache import generate_text_cache
+    from tabarena.benchmark.task.metadata import BeyondArenaMetadataBundle
+    from tabarena.benchmark.task.user_task import UserTask
+
+    bundle = BeyondArenaMetadataBundle(dataset_names_to_run=dataset_names)
+    task_metadata = bundle.load_task_metadata()  # downloads/converts the (filtered) tasks
+    text_tasks = [ttm for ttm in task_metadata if ttm.has_text]
+    print(f"Found {len(text_tasks)} BeyondArena text task(s) to cache (of {len(task_metadata)} total).")
+
+    for ttm in text_tasks:
+        user_task = UserTask.from_task_id_str(ttm.task_id_str)
+        generate_text_cache(user_task, ignore_cache=ignore_cache)
+    return len(text_tasks)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    parser.add_argument("--ignore-cache", action="store_true", help="Regenerate even if a cache already exists.")
+    parser.add_argument(
+        "--dataset-names",
+        nargs="+",
+        default=None,
+        help="Restrict to these dataset names (default: all BeyondArena text tasks).",
+    )
+    args = parser.parse_args()
+
+    n = generate_beyond_arena_text_caches(dataset_names=args.dataset_names, ignore_cache=args.ignore_cache)
+    print(f"Done. Generated/checked {n} text-task cache(s).")
diff --git a/scripts/generate_tabarena_v0pt1_metadata.py b/scripts/generate_tabarena_v0pt1_metadata.py
@@ -0,0 +1,18 @@
+"""Regenerate the committed TabArena v0.1 reference-metadata CSV (maintainer tool).
+
+Rebuilds the per-task x split metadata from the curated v0.1 metadata and writes it
+to the package-data location read by ``TabArenaV0pt1TaskMetadataSource``. Run this
+when the curated v0.1 metadata changes, then commit the resulting CSV.
+
+    python scripts/generate_tabarena_v0pt1_metadata.py
+"""
+
+from __future__ import annotations
+
+from tabarena.benchmark.task.metadata.sources.tabarena_v0pt1 import (
+    generate_tabarena_v0_1_reference_metadata,
+)
+
+if __name__ == "__main__":
+    out_path = generate_tabarena_v0_1_reference_metadata()
+    print(f"Wrote reference metadata to {out_path}. Commit it to skip the on-the-fly rebuild.")
diff --git a/tabarena/pyproject.toml b/tabarena/pyproject.toml
@@ -58,6 +58,12 @@ dependencies = [
 ]
 
 [project.optional-dependencies]
+
+# Data Foundry integration: lets `tabarena.benchmark.task.data_foundry` convert
+# curated containers into TabArena UserTasks. Included in the `benchmark` union.
+# FIXME: To use `uv`, you need to do `uv pip install --prerelease=allow .` so it recognizes pre-release data-foundry
+data-foundry = ["data-foundry>=0.0.3"]
+
 # Core model extras — installed together via the `benchmark` union below.
 tabpfn = [
   "tabpfn>=8.0.3",
@@ -98,6 +104,7 @@ benchmark = [
   "tabarena[realmlp]",
   "tabarena[tabdpt]",
   "tabarena[tabm]",
+  "tabarena[data-foundry]",
 ]
 
 # Extended set: extra models layered on top of `benchmark` when needed.
@@ -144,6 +151,7 @@ tabarena = [
   "metrics/_roc_auc_cpp/compile.sh",
   "metrics/_roc_auc_cpp/cpp_auc.cpp",
   "nips2025_utils/metadata/task_metadata_tabarena51.csv",
+  "benchmark/task/metadata/sources/data/*.csv",
   "benchmark/models/ag/limix/_vendor/config/*.json",
   "benchmark/models/ag/limix/_vendor/LICENSE.txt",
   "benchmark/models/ag/limix/_vendor/README.md",

diff --git a/tabarena/tabarena/benchmark/experiment/__init__.py b/tabarena/tabarena/benchmark/experiment/__init__.py
@@ -1,5 +1,11 @@
 from __future__ import annotations  # noqa: I001
 
+from tabarena.benchmark.experiment.bundle import (
+    ModelConstraints,
+    TabArenaExperimentBundle,
+    TabArenaV0pt1ExperimentBundle,
+    BeyondArenaExperimentBundle,
+)
 from tabarena.benchmark.experiment.experiment_constructor import (
     AGExperiment,
     AGModelBagExperiment,
@@ -25,10 +31,14 @@
     "AGModelBagExperiment",
     "AGModelExperiment",
     "AGModelOuterExperiment",
+    "BeyondArenaExperimentBundle",
     "Experiment",
     "ExperimentBatchRunner",
     "ExperimentRunner",
+    "ModelConstraints",
     "OOFExperimentRunner",
+    "TabArenaExperimentBundle",
+    "TabArenaV0pt1ExperimentBundle",
     "YamlExperimentSerializer",
     "run_experiments",
     "run_experiments_new",