Skip to content

Migrate eval system to cluster-agnostic git-syncable architecture#25

Open
richardzhuang0412 wants to merge 3 commits intopenfever/workingfrom
rz/eval-migration
Open

Migrate eval system to cluster-agnostic git-syncable architecture#25
richardzhuang0412 wants to merge 3 commits intopenfever/workingfrom
rz/eval-migration

Conversation

@richardzhuang0412
Copy link
Copy Markdown
Collaborator

Summary

  • Restructure eval infrastructure so git pull on any cluster brings everything up to date — no manual file copying
  • Move cluster-agnostic scripts (listener, sbatch, check_progress) from eval/MBZ/ to eval/
  • Consolidate harbor eval configs into eval/configs/ with no hardcoded jobs_dir (uses --jobs-dir CLI at runtime)
  • Remove all hardcoded Jupiter/MBZ fallback paths from sbatch — fails clearly if not launched via listener with --cluster-config
  • Add eval/clusters/ YAML configs (M2, MBZ, Jupiter) parameterized with ${USER} for multi-user support
  • Add native vLLM data-parallel support (--dp-size flag) with proper port planning for pack-jobs
  • Add comprehensive cluster onboarding guide (eval/docs/CLUSTER_ONBOARDING.md)

New directory structure

eval/
  unified_eval_harbor.sbatch       # cluster-agnostic sbatch
  unified_eval_listener.py         # cluster-agnostic listener
  build_vllm_cmd.sh                # shared vLLM builder
  baseline_model_configs.yaml      # shared model configs
  configs/                         # harbor eval configs (no hardcoded paths)
  lists/                           # shared model priority lists
  clusters/                        # per-cluster YAML configs
  docs/                            # onboarding + workflow docs
  legacy/                          # frozen v4 scripts

Workflow after merge

git pull
source ~/secrets.env
python eval/unified_eval_listener.py \
  --cluster-config eval/clusters/<cluster>.yaml \
  --preset v2 --priority-file eval/lists/<models>.txt ...

Test plan

  • Dry-run on M2 with eval/clusters/m2.yaml — passes
  • DP=4 eval job completed successfully (DCAgent/a1-nl2bash on v2, 300 trials in 5h52m)
  • Pack-jobs dry-run with DP port planning — correct port assignment
  • Syntax check on listener and all Python files
  • Test on Jupiter after merge (pull + dry-run)
  • Test on MBZ/M1 after merge

🤖 Generated with Claude Code

richardzhuang0412 and others added 2 commits April 5, 2026 20:07
Restructure the eval infrastructure so that `git pull` on any cluster
brings everything up to date. No more manual file copying between clusters.

Key changes:
- Move cluster-agnostic scripts from eval/MBZ/ to eval/ (listener, sbatch,
  check_progress, snapshot_download, precreate_snapshots)
- Consolidate harbor eval configs into eval/configs/ with no hardcoded
  jobs_dir (injected at runtime via `--jobs-dir` CLI flag)
- Remove all hardcoded Jupiter/MBZ fallback paths from sbatch — fail
  clearly with error messages if not launched via listener
- Add eval/clusters/ YAML configs for M2, MBZ/M1, Jupiter with ${USER}
  parameterization for multi-user support
- Add eval/lists/ for shared model priority lists
- Add eval/docs/ for all documentation (onboarding, workflow, tutorials)
- Move legacy v4 scripts to eval/legacy/ (frozen, M2 backward compat)
- Add native vLLM data-parallel support (--dp-size flag) with proper
  port planning for pack-jobs and TORCHDYNAMO handling
- Add ${USER}/${HOME} expansion in cluster config YAML loader
- Parameterize dotenv files (m2.env, mbz.env) with $USER
- Update .gitignore for eval/local/, logs, secrets
- Add comprehensive cluster onboarding guide (eval/docs/CLUSTER_ONBOARDING.md)

Workflow after merge:
  git pull
  source ~/secrets.env
  python eval/unified_eval_listener.py \
    --cluster-config eval/clusters/<cluster>.yaml \
    --preset v2 ...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cluster-agnostic evaluation framework, consolidating configurations and scripts to improve maintainability across different HPC environments. Key changes include the introduction of a unified listener, cluster-agnostic sbatch scripts, and improved dataset management. I have provided feedback on path resolution, file handle management, and the removal of hardcoded configurations in the new setup. Additionally, I identified duplicate documentation and legacy script references that should be addressed to ensure consistency.

Comment on lines +22 to +23
REPO_DIR = Path(__file__).resolve().parent.parent.parent
LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script has a couple of issues that will prevent it from working correctly in the new cluster-agnostic setup:

  1. The REPO_DIR calculation on line 22 is incorrect. Path(__file__).resolve().parent.parent.parent will point to the parent directory of the repository root. It should be Path(__file__).resolve().parent.parent.
  2. The LOGS_DIR on line 23 is hardcoded to eval/MBZ/logs. This script is intended to be cluster-agnostic, but it will fail to find logs for any other cluster.

I recommend making LOGS_DIR configurable, for example by adding a --logs-dir command-line argument, or by reading the cluster configuration YAML like the main listener script does.

Suggested change
REPO_DIR = Path(__file__).resolve().parent.parent.parent
LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs"
REPO_DIR = Path(__file__).resolve().parent.parent
LOGS_DIR = REPO_DIR / "eval" / "local" / "logs" # Or make this configurable

@@ -1,4 +1,4 @@
jobs_dir: /e/data1/datasets/playground/mmlaion/shared/guha1/eval_jobs
jobs_dir: /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This configuration file contains a hardcoded jobs_dir. The pull request summary states a goal to 'Consolidate harbor eval configs into eval/configs/ with no hardcoded jobs_dir'. The new files in eval/configs/ follow this principle. This file contradicts that goal. If this file is for legacy purposes, it might be better to move it to the eval/legacy/ directory to make its status clear.

Comment on lines +1 to +150
import os
import sys
import argparse
from huggingface_hub import snapshot_download

def is_valid_task_dir(path):
"""
Check if a directory is a valid task directory by verifying it has instruction.md
and excludes Git-related files.
"""
# Skip Git-related files and directories
basename = os.path.basename(path)
if basename.startswith('.git') or basename in {'.gitignore', '.gitattributes', '.github'}:
return False

# Check if it's a directory and has instruction.md
return (os.path.isdir(path) and
os.path.isfile(os.path.join(path, 'instruction.md')))

def get_dataset_path(repo_id):
"""
Get the path to the dataset without downloading
Returns the path if it exists, None otherwise

Args:
repo_id (str): The repository ID to look for
"""

# Construct the path using HF_HUB_CACHE environment variable
hf_cache = os.getenv('HF_HUB_CACHE', os.path.expanduser('~/.cache/huggingface/hub'))
dataset_path = os.path.join(hf_cache, f"datasets--{repo_id.replace('/', '--')}")

# Find the latest snapshot
snapshots_dir = os.path.join(dataset_path, 'snapshots')
if os.path.exists(snapshots_dir):
try:
snapshots = [d for d in os.listdir(snapshots_dir) if os.path.isdir(os.path.join(snapshots_dir, d))]
except OSError as e:
print(f"Could not read snapshots directory {snapshots_dir}: {e}", file=sys.stderr)
snapshots = []
if snapshots:
latest_snapshot = sorted(snapshots)[-1] # Get the latest snapshot
snapshot_path = os.path.join(snapshots_dir, latest_snapshot)

# Verify we have valid task directories
if os.path.exists(snapshot_path):
task_dirs = [d for d in os.listdir(snapshot_path)
if is_valid_task_dir(os.path.join(snapshot_path, d))]

if task_dirs:
print(f"Found dataset at {snapshot_path}")
print(f"Found {len(task_dirs)} valid task directories")
return snapshot_path
else:
print("No valid task directories found in snapshot")
return None

print("Dataset not found, downloading...")
return None

def download_sandboxes_dataset(repo_id, local_dir=None, cache_dir=None):
"""
Download the dataset using snapshot_download

Args:
repo_id (str): The repository ID (e.g., 'mlfoundations-dev/sandboxes-tasks')
local_dir (str, optional): Local directory to save the dataset
cache_dir (str, optional): Cache directory for huggingface hub
"""

try:
print(f"Starting download of {repo_id}...")

# Download the entire dataset repository
local_path = snapshot_download(
repo_id=repo_id,
repo_type="dataset",
local_dir=local_dir,
cache_dir=cache_dir,
)

# Remove .gitattributes file if it exists
gitattributes_path = os.path.join(local_path, '.gitattributes')
if os.path.exists(gitattributes_path):
os.remove(gitattributes_path)
print("Removed .gitattributes file")

# Verify we have valid task directories
task_dirs = [d for d in os.listdir(local_path)
if is_valid_task_dir(os.path.join(local_path, d))]

if task_dirs:
print(f"DATASET_PATH={local_path}")
print(f"Found {len(task_dirs)} valid task directories")
return local_path
else:
print("No valid task directories found in downloaded dataset")
return None

except Exception as e:
print(f"Error downloading dataset: {e}")
return None


def main():
parser = argparse.ArgumentParser(description='Download or locate a Hugging Face dataset')
parser.add_argument('repo_id', help='Repository ID (e.g., mlfoundations-dev/clean-sandboxes-tasks)')
parser.add_argument('--local-dir', help='Local directory to save the dataset')
parser.add_argument('--cache-dir', help='Cache directory for huggingface hub')

args = parser.parse_args()

path = None
if args.local_dir:
# When --local-dir is specified, download real files (no symlinks)
# Check if local_dir already has valid task dirs
if os.path.isdir(args.local_dir):
task_dirs = [d for d in os.listdir(args.local_dir)
if is_valid_task_dir(os.path.join(args.local_dir, d))]
if task_dirs:
print(f"Found existing dataset at {args.local_dir} with {len(task_dirs)} tasks")
path = args.local_dir
if not path:
print("Downloading dataset to local dir (real files, no symlinks)...", file=sys.stderr)
path = download_sandboxes_dataset(
repo_id=args.repo_id,
local_dir=args.local_dir,
cache_dir=args.cache_dir
)
else:
# First try to get existing cached path
path = get_dataset_path(args.repo_id)
if not path:
# If not found, download it
print("Dataset not found, downloading...", file=sys.stderr)
path = download_sandboxes_dataset(
repo_id=args.repo_id,
local_dir=args.local_dir,
cache_dir=args.cache_dir
)

if path:
print(f"DATASET_PATH={path}")
return 0
else:
print("Failed to download dataset", file=sys.stderr)
return 1

if __name__ == "__main__":
sys.exit(main())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This new unified snapshot_download.py script is missing the file lock mechanism that was added to eval/jupiter/snapshot_download.py in this same pull request. Without the lock, running multiple jobs that use the same dataset can lead to race conditions during download. Please consider adding the fcntl.flock logic to this script as well to ensure safe concurrent execution.

--cpus-per-task=32 \
--job-name data_dp_test \
--output eval/MBZ/logs/data_dp_test_%j.out \
eval/MBZ/unified_eval_harbor_v6.sbatch \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test script for data-parallel evaluation is using eval/MBZ/unified_eval_harbor_v6.sbatch, which appears to be a legacy script. The new cluster-agnostic DP script is eval/unified_eval_harbor_dp.sbatch. Please update this test to use the new script to ensure it's testing the correct implementation.

Suggested change
eval/MBZ/unified_eval_harbor_v6.sbatch \
eval/unified_eval_harbor_dp.sbatch \

if not rf.exists():
return None, None, None, None, None, None
try:
d = json.load(open(rf))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file opened here is not guaranteed to be closed, especially if json.load() raises an exception. It's better to use a with statement to ensure the file handle is properly managed.

Suggested change
d = json.load(open(rf))
with open(rf) as f:
d = json.load(f)


### 8. Real run (example)
```bash
source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This command refers to eval/MBZ/unified_eval_listener_v6.py. Based on the PR description, the new cluster-agnostic listener is located at eval/unified_eval_listener.py. Please update this path to avoid confusion.

Suggested change
source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
source ~/secrets.env && python eval/unified_eval_listener.py \

Comment on lines +1 to +111
# Jupiter v6 Listener Migration

## What's already done (in this repo, will arrive via `git pull`)

1. **`eval/clusters/jupiter.yaml`** — updated sbatch paths to shared `eval/MBZ/unified_eval_harbor.sbatch`
2. **`eval/jupiter/dcagent_eval_config.yaml`** — updated `jobs_dir` to `zhuang1_eval_jobs`
3. **`eval/jupiter/dcagent_eval_config_no_override.yaml`** — created (swebench/tb2 variant)
4. **`eval/MBZ/unified_eval_harbor.sbatch`** — cluster-agnostic v6 sbatch (shared across clusters)
5. **`eval/MBZ/unified_eval_harbor_dp.sbatch`** — cluster-agnostic DP sbatch
6. **`eval/MBZ/unified_eval_listener_v6.py`** — shared v6 listener
7. **`eval/baseline_model_configs.yaml`** — shared model configs

## Steps to run on Jupiter

### 1. Pull latest code
```bash
source ~/.bashrc; conda activate otagent
cd /e/scratch/jureap59/zhuang1/OpenThoughts-Agent
GIT_TERMINAL_PROMPT=0 git pull
```

### 2. Pin harbor to known-good commit
```bash
cd /e/scratch/jureap59/feuer1/harbor
git fetch && git checkout 6fdb92e7f5707c2b01214933f1622771784e6f67
# Reinstall in your conda env
pip install -e .
```

### 3. Install hf_transfer
```bash
pip install hf_transfer
```

### 4. Create jobs dir (if it doesn't exist)
```bash
mkdir -p /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs
mkdir -p eval/jupiter/logs
```

### 5. Pre-download datasets
```bash
source ~/secrets.env
python eval/jupiter/snapshot_download.py DCAgent/dev_set_v2
python eval/jupiter/snapshot_download.py DCAgent2/terminal_bench_2
python eval/jupiter/snapshot_download.py DCAgent2/swebench-verified-random-100-folders
```

### 6. Verify secrets.env has all required keys
```bash
source ~/secrets.env
echo "DAYTONA_API_KEY: ${DAYTONA_API_KEY:0:12}..."
echo "SUPABASE_URL: ${SUPABASE_URL:0:20}..."
echo "HF_TOKEN: ${HF_TOKEN:0:8}..."
```

### 7. Dry-run
```bash
source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
--cluster-config eval/clusters/jupiter.yaml \
--preset v2 \
--priority-file eval/MBZ/lists/a1_retrained.txt \
--baseline-model-config eval/baseline_model_configs.yaml \
--timeout-multiplier 2.0 \
--tp-size 2 \
--enable-thinking \
--slurm-time 12:00:00 \
--max-jobs-submitted 32 \
--dry-run --once --verbose
```

### 8. Real run (example)
```bash
source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \
--cluster-config eval/clusters/jupiter.yaml \
--preset swebench \
--priority-file eval/MBZ/lists/no_eval_models_latest.txt \
--baseline-model-config eval/baseline_model_configs.yaml \
--timeout-multiplier 2.0 \
--tp-size 2 \
--enable-thinking \
--slurm-time 12:00:00 \
--max-jobs-submitted 32 \
--pack-jobs \
--stagger-delay 1 --chain-batch-size 10 \
--no-disk-resume \
--once
```

## Key differences from M2

| Setting | M2 | Jupiter |
|---------|-----|---------|
| Partition | `main` | `booster` |
| Account | (none) | `reformo` |
| Time limit | 24:00:00 | 12:00:00 |
| GPUs/node | 8 | 4 |
| Arch | x86_64 | aarch64 (GH200) |
| Internet on compute | yes | **no** (proxy required) |
| Conda env | otagent/otagent2 | otagent/otagent2 (different paths) |
| Harbor | local install | feuer1's shared install |
| HF cache | `~/.cache/huggingface/hub` | `/e/data1/datasets/playground/ot/hf_hub` |
| Jobs dir | `$PWD/jobs` | `/e/data1/.../zhuang1_eval_jobs` |
| Pre-download | optional (has internet) | **required** (no internet on compute) |

## Proxy note

Jupiter compute nodes have no internet. The v6 sbatch auto-detects proxy settings from `jupiter.yaml`:
- Uses proxychains for HF downloads on compute
- SSH tunnel via `jpbl-s01-02` login node
- `--pre-download` flag on listener pre-downloads models on login node before submission
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file appears to be an exact duplicate of eval/docs/V6_MIGRATION.md. Having duplicate documentation can lead to maintenance issues where one file is updated but the other is forgotten. Consider removing this file and keeping only the one in eval/docs/.

- Fix check_progress.py: correct REPO_DIR (parent.parent not parent^3),
  update LOGS_DIR to eval/local/logs, add --logs-dir CLI arg, fix
  unclosed file handle with `with` statement
- Fix snapshot_download.py: use Jupiter version with fcntl file locking
  to prevent race conditions during concurrent downloads
- Fix test_dp_eval.sh: update sbatch path from legacy eval/MBZ/ to
  eval/unified_eval_harbor.sbatch
- Fix V6_MIGRATION.md: update all eval/MBZ/ paths to new eval/ locations
- Revert eval/jupiter/dcagent_eval_config.yaml to original (legacy file,
  canonical configs are now in eval/configs/)
- Remove duplicate eval/jupiter/V6_MIGRATION.md (kept in eval/docs/)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant