Conversation
Migrate the eval infrastructure from cluster-specific scripts to a unified, cluster-agnostic architecture. A single set of sbatch scripts and one listener work across any SLURM cluster via a per-cluster YAML config. Key additions: - eval/unified_eval_listener.py: main listener (polls DB, submits SLURM jobs) - eval/unified_eval_harbor.sbatch: single-node eval (cluster-agnostic) - eval/unified_eval_harbor_dp.sbatch: multi-node data-parallel eval - eval/clusters/jupiter.yaml: example cluster config (JSC GH200) - eval/docs/CLUSTER_ONBOARDING.md: setup guide for new clusters - eval/configs/: harbor agent/orchestrator configs - eval/check_progress.py: live eval progress dashboard - scripts/database/batch_upload_eval.py: batch upload utility - database/unified_db/utils.py: DB utility updates Features: - GPU/CPU/memory scaling based on TP size (job packing) - vLLM native data-parallel (--dp-size) and multi-node DP (--dp-nodes) - Disk-based resume for incomplete/errored jobs - Per-model vLLM overrides via baseline_model_configs.yaml - Proxy support for no-internet compute nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive evaluation system for LLM models on HPC clusters, including new scripts for checking job progress, pre-creating Daytona snapshots, and batch uploading results. It refactors model configuration in baseline_model_configs.yaml to support groups and new patterns, and updates build_vllm_cmd.sh to allow for HF overrides and configurable VLLM ports. Significant changes were made to the unified_eval_harbor.sbatch script to improve proxy setup, vLLM health checks, and database integration, including a new is_overlong parameter. A new unified_eval_harbor_dp.sbatch script enables multi-node data-parallel evaluations using Ray. Documentation for cluster onboarding and agent setup prompts were also added. Review comments highlight critical issues with incorrect path calculations in eval/precreate_snapshots.py and eval/unified_eval_harbor_dp.sbatch, a potentially imprecise regex in eval/baseline_model_configs.yaml, and excessively long retry/sleep intervals in both unified_eval_harbor.sbatch and unified_eval_harbor_dp.sbatch that could delay feedback on vLLM startup failures.
| from pathlib import Path | ||
|
|
||
| # Ensure harbor is importable | ||
| REPO_ROOT = Path(__file__).resolve().parent.parent.parent |
There was a problem hiding this comment.
The calculation of REPO_ROOT appears to be incorrect. Using .parent.parent.parent will resolve to the directory above the repository root. This will cause subsequent path constructions, such as for HARBOR_SRC, to be invalid, likely causing the script to fail.
| REPO_ROOT = Path(__file__).resolve().parent.parent.parent | |
| REPO_ROOT = Path(__file__).resolve().parent.parent |
| done | ||
| DATASET_LOCAL_DIR="${DATASET_LOCAL_DIR:-${_DS_DIRS_ARR[0]}/${SAFE_REPO}}" | ||
| DOWNLOAD_LOG=$(mktemp /tmp/download_XXXXXX.log) | ||
| proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1 |
There was a problem hiding this comment.
The path to snapshot_download.py appears to be incorrect. It's pointing to eval/jupiter/snapshot_download.py, but the script was added at eval/snapshot_download.py in this pull request. This will cause the script to fail when it attempts to download the dataset.
| proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1 | |
| proxied $PYTHON_BIN "$DCFT/eval/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1 |
| # Pattern-based configs: matched by regex when no exact model name is found. | ||
| # Checked in order; first match wins. | ||
| patterns: | ||
| - match: "131k|-lc$" |
There was a problem hiding this comment.
The regex 131k|-lc$ may not be behaving as intended. It matches 131k anywhere in the string, or -lc only at the end. This could lead to inconsistent model configuration matching. For instance, my-model-131k-variant would match, but my-model-lc-variant would not. If the goal is to match models ending with either 131k or -lc, a more precise regex like (131k|-lc)$ should be used. Adding case-insensitivity with (?i) would also be beneficial.
- match: "(?i)(131k|-lc)$"|
|
||
| # Health check loop (use configurable retries) | ||
| MAX_RETRIES=$VLLM_MAX_RETRIES | ||
| RETRY_INTERVAL=100 |
There was a problem hiding this comment.
The RETRY_INTERVAL of 100 seconds for the vLLM health check is quite long. With the default VLLM_MAX_RETRIES of 20, this could lead to a wait of over 30 minutes before the script determines that vLLM has failed to start. A shorter interval, such as 15 seconds, would provide much faster feedback on startup failures while still being robust against slow startup times.
| RETRY_INTERVAL=100 | |
| RETRY_INTERVAL=15 |
| tail -50 "$EVAL_LOGS_DIR/vllm_dp_shard${shard_idx}_${SLURM_JOB_ID}.log" 2>/dev/null || true | ||
| exit 1 | ||
| fi | ||
| sleep 100 |
There was a problem hiding this comment.
The sleep 100 interval for the vLLM health check is very long. With VLLM_MAX_RETRIES set to 20, this could cause the script to wait for over 30 minutes before failing. Reducing this to a more reasonable value, like 15 seconds, would allow for faster feedback on startup issues.
| sleep 100 | |
| sleep 15 |
- Fix REPO_ROOT in precreate_snapshots.py (was one level too deep) - Fix snapshot_download.py path in DP sbatch (eval/jupiter/ → eval/) - Fix regex pattern for 131k/lc model matching: (131k|-lc)$ - Reduce vLLM health check retry interval from 100s to 15s in both sbatches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Key additions
eval/unified_eval_listener.py— main listener (polls DB, submits SLURM jobs)eval/unified_eval_harbor.sbatch— single-node eval sbatch (cluster-agnostic)eval/unified_eval_harbor_dp.sbatch— multi-node data-parallel eval sbatcheval/clusters/jupiter.yaml— example cluster config (JSC GH200)eval/docs/CLUSTER_ONBOARDING.md— setup guide for new clusterseval/check_progress.py— live eval progress dashboardscripts/database/batch_upload_eval.py— batch result upload utilitydatabase/unified_db/utils.py— DB utility updatesFeatures
--dp-size) and multi-node DP (--dp-nodes)eval/baseline_model_configs.yamlTest plan
🤖 Generated with Claude Code