Migrate eval system to cluster-agnostic git-syncable architecture#25
Migrate eval system to cluster-agnostic git-syncable architecture#25richardzhuang0412 wants to merge 3 commits intopenfever/workingfrom
Conversation
Restructure the eval infrastructure so that `git pull` on any cluster
brings everything up to date. No more manual file copying between clusters.
Key changes:
- Move cluster-agnostic scripts from eval/MBZ/ to eval/ (listener, sbatch,
check_progress, snapshot_download, precreate_snapshots)
- Consolidate harbor eval configs into eval/configs/ with no hardcoded
jobs_dir (injected at runtime via `--jobs-dir` CLI flag)
- Remove all hardcoded Jupiter/MBZ fallback paths from sbatch — fail
clearly with error messages if not launched via listener
- Add eval/clusters/ YAML configs for M2, MBZ/M1, Jupiter with ${USER}
parameterization for multi-user support
- Add eval/lists/ for shared model priority lists
- Add eval/docs/ for all documentation (onboarding, workflow, tutorials)
- Move legacy v4 scripts to eval/legacy/ (frozen, M2 backward compat)
- Add native vLLM data-parallel support (--dp-size flag) with proper
port planning for pack-jobs and TORCHDYNAMO handling
- Add ${USER}/${HOME} expansion in cluster config YAML loader
- Parameterize dotenv files (m2.env, mbz.env) with $USER
- Update .gitignore for eval/local/, logs, secrets
- Add comprehensive cluster onboarding guide (eval/docs/CLUSTER_ONBOARDING.md)
Workflow after merge:
git pull
source ~/secrets.env
python eval/unified_eval_listener.py \
--cluster-config eval/clusters/<cluster>.yaml \
--preset v2 ...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a cluster-agnostic evaluation framework, consolidating configurations and scripts to improve maintainability across different HPC environments. Key changes include the introduction of a unified listener, cluster-agnostic sbatch scripts, and improved dataset management. I have provided feedback on path resolution, file handle management, and the removal of hardcoded configurations in the new setup. Additionally, I identified duplicate documentation and legacy script references that should be addressed to ensure consistency.
eval/check_progress.py
Outdated
| REPO_DIR = Path(__file__).resolve().parent.parent.parent | ||
| LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs" |
There was a problem hiding this comment.
This script has a couple of issues that will prevent it from working correctly in the new cluster-agnostic setup:
- The
REPO_DIRcalculation on line 22 is incorrect.Path(__file__).resolve().parent.parent.parentwill point to the parent directory of the repository root. It should bePath(__file__).resolve().parent.parent. - The
LOGS_DIRon line 23 is hardcoded toeval/MBZ/logs. This script is intended to be cluster-agnostic, but it will fail to find logs for any other cluster.
I recommend making LOGS_DIR configurable, for example by adding a --logs-dir command-line argument, or by reading the cluster configuration YAML like the main listener script does.
| REPO_DIR = Path(__file__).resolve().parent.parent.parent | |
| LOGS_DIR = REPO_DIR / "eval" / "MBZ" / "logs" | |
| REPO_DIR = Path(__file__).resolve().parent.parent | |
| LOGS_DIR = REPO_DIR / "eval" / "local" / "logs" # Or make this configurable |
| @@ -1,4 +1,4 @@ | |||
| jobs_dir: /e/data1/datasets/playground/mmlaion/shared/guha1/eval_jobs | |||
| jobs_dir: /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs | |||
There was a problem hiding this comment.
This configuration file contains a hardcoded jobs_dir. The pull request summary states a goal to 'Consolidate harbor eval configs into eval/configs/ with no hardcoded jobs_dir'. The new files in eval/configs/ follow this principle. This file contradicts that goal. If this file is for legacy purposes, it might be better to move it to the eval/legacy/ directory to make its status clear.
| import os | ||
| import sys | ||
| import argparse | ||
| from huggingface_hub import snapshot_download | ||
|
|
||
| def is_valid_task_dir(path): | ||
| """ | ||
| Check if a directory is a valid task directory by verifying it has instruction.md | ||
| and excludes Git-related files. | ||
| """ | ||
| # Skip Git-related files and directories | ||
| basename = os.path.basename(path) | ||
| if basename.startswith('.git') or basename in {'.gitignore', '.gitattributes', '.github'}: | ||
| return False | ||
|
|
||
| # Check if it's a directory and has instruction.md | ||
| return (os.path.isdir(path) and | ||
| os.path.isfile(os.path.join(path, 'instruction.md'))) | ||
|
|
||
| def get_dataset_path(repo_id): | ||
| """ | ||
| Get the path to the dataset without downloading | ||
| Returns the path if it exists, None otherwise | ||
|
|
||
| Args: | ||
| repo_id (str): The repository ID to look for | ||
| """ | ||
|
|
||
| # Construct the path using HF_HUB_CACHE environment variable | ||
| hf_cache = os.getenv('HF_HUB_CACHE', os.path.expanduser('~/.cache/huggingface/hub')) | ||
| dataset_path = os.path.join(hf_cache, f"datasets--{repo_id.replace('/', '--')}") | ||
|
|
||
| # Find the latest snapshot | ||
| snapshots_dir = os.path.join(dataset_path, 'snapshots') | ||
| if os.path.exists(snapshots_dir): | ||
| try: | ||
| snapshots = [d for d in os.listdir(snapshots_dir) if os.path.isdir(os.path.join(snapshots_dir, d))] | ||
| except OSError as e: | ||
| print(f"Could not read snapshots directory {snapshots_dir}: {e}", file=sys.stderr) | ||
| snapshots = [] | ||
| if snapshots: | ||
| latest_snapshot = sorted(snapshots)[-1] # Get the latest snapshot | ||
| snapshot_path = os.path.join(snapshots_dir, latest_snapshot) | ||
|
|
||
| # Verify we have valid task directories | ||
| if os.path.exists(snapshot_path): | ||
| task_dirs = [d for d in os.listdir(snapshot_path) | ||
| if is_valid_task_dir(os.path.join(snapshot_path, d))] | ||
|
|
||
| if task_dirs: | ||
| print(f"Found dataset at {snapshot_path}") | ||
| print(f"Found {len(task_dirs)} valid task directories") | ||
| return snapshot_path | ||
| else: | ||
| print("No valid task directories found in snapshot") | ||
| return None | ||
|
|
||
| print("Dataset not found, downloading...") | ||
| return None | ||
|
|
||
| def download_sandboxes_dataset(repo_id, local_dir=None, cache_dir=None): | ||
| """ | ||
| Download the dataset using snapshot_download | ||
|
|
||
| Args: | ||
| repo_id (str): The repository ID (e.g., 'mlfoundations-dev/sandboxes-tasks') | ||
| local_dir (str, optional): Local directory to save the dataset | ||
| cache_dir (str, optional): Cache directory for huggingface hub | ||
| """ | ||
|
|
||
| try: | ||
| print(f"Starting download of {repo_id}...") | ||
|
|
||
| # Download the entire dataset repository | ||
| local_path = snapshot_download( | ||
| repo_id=repo_id, | ||
| repo_type="dataset", | ||
| local_dir=local_dir, | ||
| cache_dir=cache_dir, | ||
| ) | ||
|
|
||
| # Remove .gitattributes file if it exists | ||
| gitattributes_path = os.path.join(local_path, '.gitattributes') | ||
| if os.path.exists(gitattributes_path): | ||
| os.remove(gitattributes_path) | ||
| print("Removed .gitattributes file") | ||
|
|
||
| # Verify we have valid task directories | ||
| task_dirs = [d for d in os.listdir(local_path) | ||
| if is_valid_task_dir(os.path.join(local_path, d))] | ||
|
|
||
| if task_dirs: | ||
| print(f"DATASET_PATH={local_path}") | ||
| print(f"Found {len(task_dirs)} valid task directories") | ||
| return local_path | ||
| else: | ||
| print("No valid task directories found in downloaded dataset") | ||
| return None | ||
|
|
||
| except Exception as e: | ||
| print(f"Error downloading dataset: {e}") | ||
| return None | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser(description='Download or locate a Hugging Face dataset') | ||
| parser.add_argument('repo_id', help='Repository ID (e.g., mlfoundations-dev/clean-sandboxes-tasks)') | ||
| parser.add_argument('--local-dir', help='Local directory to save the dataset') | ||
| parser.add_argument('--cache-dir', help='Cache directory for huggingface hub') | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| path = None | ||
| if args.local_dir: | ||
| # When --local-dir is specified, download real files (no symlinks) | ||
| # Check if local_dir already has valid task dirs | ||
| if os.path.isdir(args.local_dir): | ||
| task_dirs = [d for d in os.listdir(args.local_dir) | ||
| if is_valid_task_dir(os.path.join(args.local_dir, d))] | ||
| if task_dirs: | ||
| print(f"Found existing dataset at {args.local_dir} with {len(task_dirs)} tasks") | ||
| path = args.local_dir | ||
| if not path: | ||
| print("Downloading dataset to local dir (real files, no symlinks)...", file=sys.stderr) | ||
| path = download_sandboxes_dataset( | ||
| repo_id=args.repo_id, | ||
| local_dir=args.local_dir, | ||
| cache_dir=args.cache_dir | ||
| ) | ||
| else: | ||
| # First try to get existing cached path | ||
| path = get_dataset_path(args.repo_id) | ||
| if not path: | ||
| # If not found, download it | ||
| print("Dataset not found, downloading...", file=sys.stderr) | ||
| path = download_sandboxes_dataset( | ||
| repo_id=args.repo_id, | ||
| local_dir=args.local_dir, | ||
| cache_dir=args.cache_dir | ||
| ) | ||
|
|
||
| if path: | ||
| print(f"DATASET_PATH={path}") | ||
| return 0 | ||
| else: | ||
| print("Failed to download dataset", file=sys.stderr) | ||
| return 1 | ||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) |
There was a problem hiding this comment.
This new unified snapshot_download.py script is missing the file lock mechanism that was added to eval/jupiter/snapshot_download.py in this same pull request. Without the lock, running multiple jobs that use the same dataset can lead to race conditions during download. Please consider adding the fcntl.flock logic to this script as well to ensure safe concurrent execution.
eval/test_dp_eval.sh
Outdated
| --cpus-per-task=32 \ | ||
| --job-name data_dp_test \ | ||
| --output eval/MBZ/logs/data_dp_test_%j.out \ | ||
| eval/MBZ/unified_eval_harbor_v6.sbatch \ |
There was a problem hiding this comment.
This test script for data-parallel evaluation is using eval/MBZ/unified_eval_harbor_v6.sbatch, which appears to be a legacy script. The new cluster-agnostic DP script is eval/unified_eval_harbor_dp.sbatch. Please update this test to use the new script to ensure it's testing the correct implementation.
| eval/MBZ/unified_eval_harbor_v6.sbatch \ | |
| eval/unified_eval_harbor_dp.sbatch \ |
eval/check_progress.py
Outdated
| if not rf.exists(): | ||
| return None, None, None, None, None, None | ||
| try: | ||
| d = json.load(open(rf)) |
There was a problem hiding this comment.
eval/docs/V6_MIGRATION.md
Outdated
|
|
||
| ### 8. Real run (example) | ||
| ```bash | ||
| source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \ |
There was a problem hiding this comment.
This command refers to eval/MBZ/unified_eval_listener_v6.py. Based on the PR description, the new cluster-agnostic listener is located at eval/unified_eval_listener.py. Please update this path to avoid confusion.
| source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \ | |
| source ~/secrets.env && python eval/unified_eval_listener.py \ |
eval/jupiter/V6_MIGRATION.md
Outdated
| # Jupiter v6 Listener Migration | ||
|
|
||
| ## What's already done (in this repo, will arrive via `git pull`) | ||
|
|
||
| 1. **`eval/clusters/jupiter.yaml`** — updated sbatch paths to shared `eval/MBZ/unified_eval_harbor.sbatch` | ||
| 2. **`eval/jupiter/dcagent_eval_config.yaml`** — updated `jobs_dir` to `zhuang1_eval_jobs` | ||
| 3. **`eval/jupiter/dcagent_eval_config_no_override.yaml`** — created (swebench/tb2 variant) | ||
| 4. **`eval/MBZ/unified_eval_harbor.sbatch`** — cluster-agnostic v6 sbatch (shared across clusters) | ||
| 5. **`eval/MBZ/unified_eval_harbor_dp.sbatch`** — cluster-agnostic DP sbatch | ||
| 6. **`eval/MBZ/unified_eval_listener_v6.py`** — shared v6 listener | ||
| 7. **`eval/baseline_model_configs.yaml`** — shared model configs | ||
|
|
||
| ## Steps to run on Jupiter | ||
|
|
||
| ### 1. Pull latest code | ||
| ```bash | ||
| source ~/.bashrc; conda activate otagent | ||
| cd /e/scratch/jureap59/zhuang1/OpenThoughts-Agent | ||
| GIT_TERMINAL_PROMPT=0 git pull | ||
| ``` | ||
|
|
||
| ### 2. Pin harbor to known-good commit | ||
| ```bash | ||
| cd /e/scratch/jureap59/feuer1/harbor | ||
| git fetch && git checkout 6fdb92e7f5707c2b01214933f1622771784e6f67 | ||
| # Reinstall in your conda env | ||
| pip install -e . | ||
| ``` | ||
|
|
||
| ### 3. Install hf_transfer | ||
| ```bash | ||
| pip install hf_transfer | ||
| ``` | ||
|
|
||
| ### 4. Create jobs dir (if it doesn't exist) | ||
| ```bash | ||
| mkdir -p /e/data1/datasets/playground/mmlaion/shared/zhuang1_eval_jobs | ||
| mkdir -p eval/jupiter/logs | ||
| ``` | ||
|
|
||
| ### 5. Pre-download datasets | ||
| ```bash | ||
| source ~/secrets.env | ||
| python eval/jupiter/snapshot_download.py DCAgent/dev_set_v2 | ||
| python eval/jupiter/snapshot_download.py DCAgent2/terminal_bench_2 | ||
| python eval/jupiter/snapshot_download.py DCAgent2/swebench-verified-random-100-folders | ||
| ``` | ||
|
|
||
| ### 6. Verify secrets.env has all required keys | ||
| ```bash | ||
| source ~/secrets.env | ||
| echo "DAYTONA_API_KEY: ${DAYTONA_API_KEY:0:12}..." | ||
| echo "SUPABASE_URL: ${SUPABASE_URL:0:20}..." | ||
| echo "HF_TOKEN: ${HF_TOKEN:0:8}..." | ||
| ``` | ||
|
|
||
| ### 7. Dry-run | ||
| ```bash | ||
| source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \ | ||
| --cluster-config eval/clusters/jupiter.yaml \ | ||
| --preset v2 \ | ||
| --priority-file eval/MBZ/lists/a1_retrained.txt \ | ||
| --baseline-model-config eval/baseline_model_configs.yaml \ | ||
| --timeout-multiplier 2.0 \ | ||
| --tp-size 2 \ | ||
| --enable-thinking \ | ||
| --slurm-time 12:00:00 \ | ||
| --max-jobs-submitted 32 \ | ||
| --dry-run --once --verbose | ||
| ``` | ||
|
|
||
| ### 8. Real run (example) | ||
| ```bash | ||
| source ~/secrets.env && python eval/MBZ/unified_eval_listener_v6.py \ | ||
| --cluster-config eval/clusters/jupiter.yaml \ | ||
| --preset swebench \ | ||
| --priority-file eval/MBZ/lists/no_eval_models_latest.txt \ | ||
| --baseline-model-config eval/baseline_model_configs.yaml \ | ||
| --timeout-multiplier 2.0 \ | ||
| --tp-size 2 \ | ||
| --enable-thinking \ | ||
| --slurm-time 12:00:00 \ | ||
| --max-jobs-submitted 32 \ | ||
| --pack-jobs \ | ||
| --stagger-delay 1 --chain-batch-size 10 \ | ||
| --no-disk-resume \ | ||
| --once | ||
| ``` | ||
|
|
||
| ## Key differences from M2 | ||
|
|
||
| | Setting | M2 | Jupiter | | ||
| |---------|-----|---------| | ||
| | Partition | `main` | `booster` | | ||
| | Account | (none) | `reformo` | | ||
| | Time limit | 24:00:00 | 12:00:00 | | ||
| | GPUs/node | 8 | 4 | | ||
| | Arch | x86_64 | aarch64 (GH200) | | ||
| | Internet on compute | yes | **no** (proxy required) | | ||
| | Conda env | otagent/otagent2 | otagent/otagent2 (different paths) | | ||
| | Harbor | local install | feuer1's shared install | | ||
| | HF cache | `~/.cache/huggingface/hub` | `/e/data1/datasets/playground/ot/hf_hub` | | ||
| | Jobs dir | `$PWD/jobs` | `/e/data1/.../zhuang1_eval_jobs` | | ||
| | Pre-download | optional (has internet) | **required** (no internet on compute) | | ||
|
|
||
| ## Proxy note | ||
|
|
||
| Jupiter compute nodes have no internet. The v6 sbatch auto-detects proxy settings from `jupiter.yaml`: | ||
| - Uses proxychains for HF downloads on compute | ||
| - SSH tunnel via `jpbl-s01-02` login node | ||
| - `--pre-download` flag on listener pre-downloads models on login node before submission |
- Fix check_progress.py: correct REPO_DIR (parent.parent not parent^3), update LOGS_DIR to eval/local/logs, add --logs-dir CLI arg, fix unclosed file handle with `with` statement - Fix snapshot_download.py: use Jupiter version with fcntl file locking to prevent race conditions during concurrent downloads - Fix test_dp_eval.sh: update sbatch path from legacy eval/MBZ/ to eval/unified_eval_harbor.sbatch - Fix V6_MIGRATION.md: update all eval/MBZ/ paths to new eval/ locations - Revert eval/jupiter/dcagent_eval_config.yaml to original (legacy file, canonical configs are now in eval/configs/) - Remove duplicate eval/jupiter/V6_MIGRATION.md (kept in eval/docs/) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
git pullon any cluster brings everything up to date — no manual file copyingeval/MBZ/toeval/eval/configs/with no hardcodedjobs_dir(uses--jobs-dirCLI at runtime)--cluster-configeval/clusters/YAML configs (M2, MBZ, Jupiter) parameterized with${USER}for multi-user support--dp-sizeflag) with proper port planning for pack-jobseval/docs/CLUSTER_ONBOARDING.md)New directory structure
Workflow after merge
Test plan
eval/clusters/m2.yaml— passes🤖 Generated with Claude Code