Migrate eval system to cluster-agnostic git-syncable architecture by richardzhuang0412 · Pull Request #25 · open-thoughts/OpenThoughts-Agent

richardzhuang0412 · 2026-04-05T20:08:03Z

Summary

Restructure eval infrastructure so git pull on any cluster brings everything up to date — no manual file copying
Move cluster-agnostic scripts (listener, sbatch, check_progress) from eval/MBZ/ to eval/
Consolidate harbor eval configs into eval/configs/ with no hardcoded jobs_dir (uses --jobs-dir CLI at runtime)
Remove all hardcoded Jupiter/MBZ fallback paths from sbatch — fails clearly if not launched via listener with --cluster-config
Add eval/clusters/ YAML configs (M2, MBZ, Jupiter) parameterized with ${USER} for multi-user support
Add native vLLM data-parallel support (--dp-size flag) with proper port planning for pack-jobs
Add comprehensive cluster onboarding guide (eval/docs/CLUSTER_ONBOARDING.md)

New directory structure

eval/
  unified_eval_harbor.sbatch       # cluster-agnostic sbatch
  unified_eval_listener.py         # cluster-agnostic listener
  build_vllm_cmd.sh                # shared vLLM builder
  baseline_model_configs.yaml      # shared model configs
  configs/                         # harbor eval configs (no hardcoded paths)
  lists/                           # shared model priority lists
  clusters/                        # per-cluster YAML configs
  docs/                            # onboarding + workflow docs
  legacy/                          # frozen v4 scripts

Workflow after merge

git pull
source ~/secrets.env
python eval/unified_eval_listener.py \
  --cluster-config eval/clusters/<cluster>.yaml \
  --preset v2 --priority-file eval/lists/<models>.txt ...

Test plan

Dry-run on M2 with eval/clusters/m2.yaml — passes
DP=4 eval job completed successfully (DCAgent/a1-nl2bash on v2, 300 trials in 5h52m)
Pack-jobs dry-run with DP port planning — correct port assignment
Syntax check on listener and all Python files
Test on Jupiter after merge (pull + dry-run)
Test on MBZ/M1 after merge

🤖 Generated with Claude Code

Restructure the eval infrastructure so that `git pull` on any cluster brings everything up to date. No more manual file copying between clusters. Key changes: - Move cluster-agnostic scripts from eval/MBZ/ to eval/ (listener, sbatch, check_progress, snapshot_download, precreate_snapshots) - Consolidate harbor eval configs into eval/configs/ with no hardcoded jobs_dir (injected at runtime via `--jobs-dir` CLI flag) - Remove all hardcoded Jupiter/MBZ fallback paths from sbatch — fail clearly with error messages if not launched via listener - Add eval/clusters/ YAML configs for M2, MBZ/M1, Jupiter with ${USER} parameterization for multi-user support - Add eval/lists/ for shared model priority lists - Add eval/docs/ for all documentation (onboarding, workflow, tutorials) - Move legacy v4 scripts to eval/legacy/ (frozen, M2 backward compat) - Add native vLLM data-parallel support (--dp-size flag) with proper port planning for pack-jobs and TORCHDYNAMO handling - Add ${USER}/${HOME} expansion in cluster config YAML loader - Parameterize dotenv files (m2.env, mbz.env) with $USER - Update .gitignore for eval/local/, logs, secrets - Add comprehensive cluster onboarding guide (eval/docs/CLUSTER_ONBOARDING.md) Workflow after merge: git pull source ~/secrets.env python eval/unified_eval_listener.py \ --cluster-config eval/clusters/<cluster>.yaml \ --preset v2 ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a cluster-agnostic evaluation framework, consolidating configurations and scripts to improve maintainability across different HPC environments. Key changes include the introduction of a unified listener, cluster-agnostic sbatch scripts, and improved dataset management. I have provided feedback on path resolution, file handle management, and the removal of hardcoded configurations in the new setup. Additionally, I identified duplicate documentation and legacy script references that should be addressed to ensure consistency.

gemini-code-assist · 2026-04-05T20:14:10Z

eval/check_progress.py

+REPO_DIR = Path(__file__).resolve().parent.parent.parent
+LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs"


This script has a couple of issues that will prevent it from working correctly in the new cluster-agnostic setup:

The REPO_DIR calculation on line 22 is incorrect. Path(__file__).resolve().parent.parent.parent will point to the parent directory of the repository root. It should be Path(__file__).resolve().parent.parent.

The LOGS_DIR on line 23 is hardcoded to eval/MBZ/logs. This script is intended to be cluster-agnostic, but it will fail to find logs for any other cluster.

I recommend making LOGS_DIR configurable, for example by adding a --logs-dir command-line argument, or by reading the cluster configuration YAML like the main listener script does.

Suggested change

REPO_DIR = Path(__file__).resolve().parent.parent.parent

LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs"

REPO_DIR = Path(__file__).resolve().parent.parent

LOGS_DIR = REPO_DIR / "eval" / "local" / "logs" # Or make this configurable

gemini-code-assist · 2026-04-05T20:14:11Z

eval/jupiter/dcagent_eval_config.yaml

@@ -1,4 +1,4 @@
-jobs_dir: /e/data1/datasets/playground/mmlaion/shared/guha1/eval_jobs
+jobs_dir: /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs


This configuration file contains a hardcoded jobs_dir. The pull request summary states a goal to 'Consolidate harbor eval configs into eval/configs/ with no hardcoded jobs_dir'. The new files in eval/configs/ follow this principle. This file contradicts that goal. If this file is for legacy purposes, it might be better to move it to the eval/legacy/ directory to make its status clear.

gemini-code-assist · 2026-04-05T20:14:11Z

eval/snapshot_download.py

+import os
+import sys
+import argparse
+from huggingface_hub import snapshot_download
+
+def is_valid_task_dir(path):
+    """
+    Check if a directory is a valid task directory by verifying it has instruction.md
+    and excludes Git-related files.
+    """
+    # Skip Git-related files and directories
+    basename = os.path.basename(path)
+    if basename.startswith('.git') or basename in {'.gitignore', '.gitattributes', '.github'}:
+        return False
+
+    # Check if it's a directory and has instruction.md
+    return (os.path.isdir(path) and
+            os.path.isfile(os.path.join(path, 'instruction.md')))
+
+def get_dataset_path(repo_id):
+    """
+    Get the path to the dataset without downloading
+    Returns the path if it exists, None otherwise
+
+    Args:
+        repo_id (str): The repository ID to look for
+    """
+
+    # Construct the path using HF_HUB_CACHE environment variable
+    hf_cache = os.getenv('HF_HUB_CACHE', os.path.expanduser('~/.cache/huggingface/hub'))
+    dataset_path = os.path.join(hf_cache, f"datasets--{repo_id.replace('/', '--')}")
+
+    # Find the latest snapshot
+    snapshots_dir = os.path.join(dataset_path, 'snapshots')
+    if os.path.exists(snapshots_dir):
+        try:
+            snapshots = [d for d in os.listdir(snapshots_dir) if os.path.isdir(os.path.join(snapshots_dir, d))]
+        except OSError as e:
+            print(f"Could not read snapshots directory {snapshots_dir}: {e}", file=sys.stderr)
+            snapshots = []
+        if snapshots:
+            latest_snapshot = sorted(snapshots)[-1]  # Get the latest snapshot
+            snapshot_path = os.path.join(snapshots_dir, latest_snapshot)
+
+            # Verify we have valid task directories
+            if os.path.exists(snapshot_path):
+                task_dirs = [d for d in os.listdir(snapshot_path)
+                            if is_valid_task_dir(os.path.join(snapshot_path, d))]
+
+                if task_dirs:
+                    print(f"Found dataset at {snapshot_path}")
+                    print(f"Found {len(task_dirs)} valid task directories")
+                    return snapshot_path
+                else:
+                    print("No valid task directories found in snapshot")
+                    return None
+
+    print("Dataset not found, downloading...")
+    return None
+
+def download_sandboxes_dataset(repo_id, local_dir=None, cache_dir=None):
+    """
+    Download the dataset using snapshot_download
+
+    Args:
+        repo_id (str): The repository ID (e.g., 'mlfoundations-dev/sandboxes-tasks')
+        local_dir (str, optional): Local directory to save the dataset
+        cache_dir (str, optional): Cache directory for huggingface hub
+    """
+
+    try:
+        print(f"Starting download of {repo_id}...")
+
+        # Download the entire dataset repository
+        local_path = snapshot_download(
+            repo_id=repo_id,
+            repo_type="dataset",
+            local_dir=local_dir,
+            cache_dir=cache_dir,
+        )
+
+        # Remove .gitattributes file if it exists
+        gitattributes_path = os.path.join(local_path, '.gitattributes')
+        if os.path.exists(gitattributes_path):
+            os.remove(gitattributes_path)
+            print("Removed .gitattributes file")
+
+        # Verify we have valid task directories
+        task_dirs = [d for d in os.listdir(local_path)
+                    if is_valid_task_dir(os.path.join(local_path, d))]
+
+        if task_dirs:
+            print(f"DATASET_PATH={local_path}")
+            print(f"Found {len(task_dirs)} valid task directories")
+            return local_path
+        else:
+            print("No valid task directories found in downloaded dataset")
+            return None
+
+    except Exception as e:
+        print(f"Error downloading dataset: {e}")
+        return None
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Download or locate a Hugging Face dataset')
+    parser.add_argument('repo_id', help='Repository ID (e.g., mlfoundations-dev/clean-sandboxes-tasks)')
+    parser.add_argument('--local-dir', help='Local directory to save the dataset')
+    parser.add_argument('--cache-dir', help='Cache directory for huggingface hub')
+
+    args = parser.parse_args()
+
+    path = None
+    if args.local_dir:
+        # When --local-dir is specified, download real files (no symlinks)
+        # Check if local_dir already has valid task dirs
+        if os.path.isdir(args.local_dir):
+            task_dirs = [d for d in os.listdir(args.local_dir)
+                        if is_valid_task_dir(os.path.join(args.local_dir, d))]
+            if task_dirs:
+                print(f"Found existing dataset at {args.local_dir} with {len(task_dirs)} tasks")
+                path = args.local_dir
+        if not path:
+            print("Downloading dataset to local dir (real files, no symlinks)...", file=sys.stderr)
+            path = download_sandboxes_dataset(
+                repo_id=args.repo_id,
+                local_dir=args.local_dir,
+                cache_dir=args.cache_dir
+            )
+    else:
+        # First try to get existing cached path
+        path = get_dataset_path(args.repo_id)
+        if not path:
+            # If not found, download it
+            print("Dataset not found, downloading...", file=sys.stderr)
+            path = download_sandboxes_dataset(
+                repo_id=args.repo_id,
+                local_dir=args.local_dir,
+                cache_dir=args.cache_dir
+            )
+
+    if path:
+        print(f"DATASET_PATH={path}")
+        return 0
+    else:
+        print("Failed to download dataset", file=sys.stderr)
+        return 1
+
+if __name__ == "__main__":
+    sys.exit(main())


This new unified snapshot_download.py script is missing the file lock mechanism that was added to eval/jupiter/snapshot_download.py in this same pull request. Without the lock, running multiple jobs that use the same dataset can lead to race conditions during download. Please consider adding the fcntl.flock logic to this script as well to ensure safe concurrent execution.

gemini-code-assist · 2026-04-05T20:14:11Z

eval/test_dp_eval.sh

+  --cpus-per-task=32 \
+  --job-name data_dp_test \
+  --output eval/MBZ/logs/data_dp_test_%j.out \
+  eval/MBZ/unified_eval_harbor_v6.sbatch \


This test script for data-parallel evaluation is using eval/MBZ/unified_eval_harbor_v6.sbatch, which appears to be a legacy script. The new cluster-agnostic DP script is eval/unified_eval_harbor_dp.sbatch. Please update this test to use the new script to ensure it's testing the correct implementation.

Suggested change

eval/MBZ/unified_eval_harbor_v6.sbatch \

eval/unified_eval_harbor_dp.sbatch \

gemini-code-assist · 2026-04-05T20:14:11Z

eval/check_progress.py

+    if not rf.exists():
+        return None, None, None, None, None, None
+    try:
+        d = json.load(open(rf))


The file opened here is not guaranteed to be closed, especially if json.load() raises an exception. It's better to use a with statement to ensure the file handle is properly managed.

Suggested change

d = json.load(open(rf))

with open(rf) as f:

d = json.load(f)

gemini-code-assist · 2026-04-05T20:14:11Z

eval/docs/V6_MIGRATION.md

+
+### 8. Real run (example)
+```bash
+source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \


This command refers to eval/MBZ/unified_eval_listener_v6.py. Based on the PR description, the new cluster-agnostic listener is located at eval/unified_eval_listener.py. Please update this path to avoid confusion.

Suggested change

source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \

source ~/secrets.env && python eval/unified_eval_listener.py \

gemini-code-assist · 2026-04-05T20:14:11Z

eval/jupiter/V6_MIGRATION.md

+# Jupiter v6 Listener Migration
+
+## What's already done (in this repo, will arrive via `git pull`)
+
+1. **`eval/clusters/jupiter.yaml`** — updated sbatch paths to shared `eval/MBZ/unified_eval_harbor.sbatch`
+2. **`eval/jupiter/dcagent_eval_config.yaml`** — updated `jobs_dir` to `zhuang1_eval_jobs`
+3. **`eval/jupiter/dcagent_eval_config_no_override.yaml`** — created (swebench/tb2 variant)
+4. **`eval/MBZ/unified_eval_harbor.sbatch`** — cluster-agnostic v6 sbatch (shared across clusters)
+5. **`eval/MBZ/unified_eval_harbor_dp.sbatch`** — cluster-agnostic DP sbatch
+6. **`eval/MBZ/unified_eval_listener_v6.py`** — shared v6 listener
+7. **`eval/baseline_model_configs.yaml`** — shared model configs
+
+## Steps to run on Jupiter
+
+### 1. Pull latest code
+```bash
+source ~/.bashrc; conda activate otagent
+cd /e/scratch/jureap59/zhuang1/OpenThoughts-Agent
+GIT_TERMINAL_PROMPT=0 git pull
+```
+
+### 2. Pin harbor to known-good commit
+```bash
+cd /e/scratch/jureap59/feuer1/harbor
+git fetch && git checkout 6fdb92e7f5707c2b01214933f1622771784e6f67
+# Reinstall in your conda env
+pip install -e .
+```
+
+### 3. Install hf_transfer
+```bash
+pip install hf_transfer
+```
+
+### 4. Create jobs dir (if it doesn't exist)
+```bash
+mkdir -p /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs
+mkdir -p eval/jupiter/logs
+```
+
+### 5. Pre-download datasets
+```bash
+source ~/secrets.env
+python eval/jupiter/snapshot_download.py DCAgent/dev_set_v2
+python eval/jupiter/snapshot_download.py DCAgent2/terminal_bench_2
+python eval/jupiter/snapshot_download.py DCAgent2/swebench-verified-random-100-folders
+```
+
+### 6. Verify secrets.env has all required keys
+```bash
+source ~/secrets.env
+echo "DAYTONA_API_KEY: ${DAYTONA_API_KEY:0:12}..."
+echo "SUPABASE_URL: ${SUPABASE_URL:0:20}..."
+echo "HF_TOKEN: ${HF_TOKEN:0:8}..."
+```
+
+### 7. Dry-run
+```bash
+source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
+  --cluster-config eval/clusters/jupiter.yaml \
+  --preset v2 \
+  --priority-file eval/MBZ/lists/a1_retrained.txt \
+  --baseline-model-config eval/baseline_model_configs.yaml \
+  --timeout-multiplier 2.0 \
+  --tp-size 2 \
+  --enable-thinking \
+  --slurm-time 12:00:00 \
+  --max-jobs-submitted 32 \
+  --dry-run --once --verbose
+```
+
+### 8. Real run (example)
+```bash
+source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
+  --cluster-config eval/clusters/jupiter.yaml \
+  --preset swebench \
+  --priority-file eval/MBZ/lists/no_eval_models_latest.txt \
+  --baseline-model-config eval/baseline_model_configs.yaml \
+  --timeout-multiplier 2.0 \
+  --tp-size 2 \
+  --enable-thinking \
+  --slurm-time 12:00:00 \
+  --max-jobs-submitted 32 \
+  --pack-jobs \
+  --stagger-delay 1 --chain-batch-size 10 \
+  --no-disk-resume \
+  --once
+```
+
+## Key differences from M2
+
+| Setting | M2 | Jupiter |
+|---------|-----|---------|
+| Partition | `main` | `booster` |
+| Account | (none) | `reformo` |
+| Time limit | 24:00:00 | 12:00:00 |
+| GPUs/node | 8 | 4 |
+| Arch | x86_64 | aarch64 (GH200) |
+| Internet on compute | yes | **no** (proxy required) |
+| Conda env | otagent/otagent2 | otagent/otagent2 (different paths) |
+| Harbor | local install | feuer1's shared install |
+| HF cache | `~/.cache/huggingface/hub` | `/e/data1/datasets/playground/ot/hf_hub` |
+| Jobs dir | `$PWD/jobs` | `/e/data1/.../zhuang1_eval_jobs` |
+| Pre-download | optional (has internet) | **required** (no internet on compute) |
+
+## Proxy note
+
+Jupiter compute nodes have no internet. The v6 sbatch auto-detects proxy settings from `jupiter.yaml`:
+- Uses proxychains for HF downloads on compute
+- SSH tunnel via `jpbl-s01-02` login node
+- `--pre-download` flag on listener pre-downloads models on login node before submission


This file appears to be an exact duplicate of eval/docs/V6_MIGRATION.md. Having duplicate documentation can lead to maintenance issues where one file is updated but the other is forgotten. Consider removing this file and keeping only the one in eval/docs/.

- Fix check_progress.py: correct REPO_DIR (parent.parent not parent^3), update LOGS_DIR to eval/local/logs, add --logs-dir CLI arg, fix unclosed file handle with `with` statement - Fix snapshot_download.py: use Jupiter version with fcntl file locking to prevent race conditions during concurrent downloads - Fix test_dp_eval.sh: update sbatch path from legacy eval/MBZ/ to eval/unified_eval_harbor.sbatch - Fix V6_MIGRATION.md: update all eval/MBZ/ paths to new eval/ locations - Revert eval/jupiter/dcagent_eval_config.yaml to original (legacy file, canonical configs are now in eval/configs/) - Remove duplicate eval/jupiter/V6_MIGRATION.md (kept in eval/docs/) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

richardzhuang0412 and others added 2 commits April 5, 2026 20:07

Remove duplicate nested eval/lists/lists/ directory

0f1366a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate eval system to cluster-agnostic git-syncable architecture#25

Migrate eval system to cluster-agnostic git-syncable architecture#25
richardzhuang0412 wants to merge 3 commits intopenfever/workingfrom
rz/eval-migration

richardzhuang0412 commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		REPO_DIR = Path(__file__).resolve().parent.parent.parent
		LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs"

		@@ -1,4 +1,4 @@
		jobs_dir: /e/data1/datasets/playground/mmlaion/shared/guha1/eval_jobs
		jobs_dir: /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs

	eval/MBZ/unified_eval_harbor_v6.sbatch \
	eval/unified_eval_harbor_dp.sbatch \

	source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
	source ~/secrets.env && python eval/unified_eval_listener.py \

Conversation

richardzhuang0412 commented Apr 5, 2026

Summary

New directory structure

Workflow after merge

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant