Skip to content

Add cluster-agnostic eval system#26

Open
richardzhuang0412 wants to merge 2 commits intomainfrom
pr/cluster-agnostic-eval
Open

Add cluster-agnostic eval system#26
richardzhuang0412 wants to merge 2 commits intomainfrom
pr/cluster-agnostic-eval

Conversation

@richardzhuang0412
Copy link
Copy Markdown
Collaborator

Summary

  • Migrate eval infrastructure from cluster-specific scripts to a unified, cluster-agnostic architecture
  • A single set of sbatch scripts and one listener work across any SLURM cluster via a per-cluster YAML config
  • Add condensed onboarding guide for setting up new clusters

Key additions

  • eval/unified_eval_listener.py — main listener (polls DB, submits SLURM jobs)
  • eval/unified_eval_harbor.sbatch — single-node eval sbatch (cluster-agnostic)
  • eval/unified_eval_harbor_dp.sbatch — multi-node data-parallel eval sbatch
  • eval/clusters/jupiter.yaml — example cluster config (JSC GH200)
  • eval/docs/CLUSTER_ONBOARDING.md — setup guide for new clusters
  • eval/check_progress.py — live eval progress dashboard
  • scripts/database/batch_upload_eval.py — batch result upload utility
  • database/unified_db/utils.py — DB utility updates

Features

  • GPU/CPU/memory scaling based on TP size (enables job packing on shared nodes)
  • vLLM native data-parallel (--dp-size) and multi-node DP (--dp-nodes)
  • Disk-based resume for incomplete/errored eval jobs
  • Per-model vLLM overrides via eval/baseline_model_configs.yaml
  • Proxy support for no-internet compute nodes (SSH tunnel + proxychains)
  • Preset configs for v2, tb2, swebench, aider, bfcl benchmarks

Test plan

  • Dry-run verified on Jupiter (JSC GH200, aarch64)
  • Real eval jobs submitted and completed on Jupiter
  • DP mode (4-shard) tested end-to-end
  • Verify onboarding doc is sufficient for a new cluster setup

🤖 Generated with Claude Code

Migrate the eval infrastructure from cluster-specific scripts to a unified,
cluster-agnostic architecture. A single set of sbatch scripts and one listener
work across any SLURM cluster via a per-cluster YAML config.

Key additions:
- eval/unified_eval_listener.py: main listener (polls DB, submits SLURM jobs)
- eval/unified_eval_harbor.sbatch: single-node eval (cluster-agnostic)
- eval/unified_eval_harbor_dp.sbatch: multi-node data-parallel eval
- eval/clusters/jupiter.yaml: example cluster config (JSC GH200)
- eval/docs/CLUSTER_ONBOARDING.md: setup guide for new clusters
- eval/configs/: harbor agent/orchestrator configs
- eval/check_progress.py: live eval progress dashboard
- scripts/database/batch_upload_eval.py: batch upload utility
- database/unified_db/utils.py: DB utility updates

Features:
- GPU/CPU/memory scaling based on TP size (job packing)
- vLLM native data-parallel (--dp-size) and multi-node DP (--dp-nodes)
- Disk-based resume for incomplete/errored jobs
- Per-model vLLM overrides via baseline_model_configs.yaml
- Proxy support for no-internet compute nodes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive evaluation system for LLM models on HPC clusters, including new scripts for checking job progress, pre-creating Daytona snapshots, and batch uploading results. It refactors model configuration in baseline_model_configs.yaml to support groups and new patterns, and updates build_vllm_cmd.sh to allow for HF overrides and configurable VLLM ports. Significant changes were made to the unified_eval_harbor.sbatch script to improve proxy setup, vLLM health checks, and database integration, including a new is_overlong parameter. A new unified_eval_harbor_dp.sbatch script enables multi-node data-parallel evaluations using Ray. Documentation for cluster onboarding and agent setup prompts were also added. Review comments highlight critical issues with incorrect path calculations in eval/precreate_snapshots.py and eval/unified_eval_harbor_dp.sbatch, a potentially imprecise regex in eval/baseline_model_configs.yaml, and excessively long retry/sleep intervals in both unified_eval_harbor.sbatch and unified_eval_harbor_dp.sbatch that could delay feedback on vLLM startup failures.

Comment thread eval/precreate_snapshots.py Outdated
from pathlib import Path

# Ensure harbor is importable
REPO_ROOT = Path(__file__).resolve().parent.parent.parent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calculation of REPO_ROOT appears to be incorrect. Using .parent.parent.parent will resolve to the directory above the repository root. This will cause subsequent path constructions, such as for HARBOR_SRC, to be invalid, likely causing the script to fail.

Suggested change
REPO_ROOT = Path(__file__).resolve().parent.parent.parent
REPO_ROOT = Path(__file__).resolve().parent.parent

Comment thread eval/unified_eval_harbor_dp.sbatch Outdated
done
DATASET_LOCAL_DIR="${DATASET_LOCAL_DIR:-${_DS_DIRS_ARR[0]}/${SAFE_REPO}}"
DOWNLOAD_LOG=$(mktemp /tmp/download_XXXXXX.log)
proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The path to snapshot_download.py appears to be incorrect. It's pointing to eval/jupiter/snapshot_download.py, but the script was added at eval/snapshot_download.py in this pull request. This will cause the script to fail when it attempts to download the dataset.

Suggested change
proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1
proxied $PYTHON_BIN "$DCFT/eval/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1

Comment thread eval/baseline_model_configs.yaml Outdated
# Pattern-based configs: matched by regex when no exact model name is found.
# Checked in order; first match wins.
patterns:
- match: "131k|-lc$"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex 131k|-lc$ may not be behaving as intended. It matches 131k anywhere in the string, or -lc only at the end. This could lead to inconsistent model configuration matching. For instance, my-model-131k-variant would match, but my-model-lc-variant would not. If the goal is to match models ending with either 131k or -lc, a more precise regex like (131k|-lc)$ should be used. Adding case-insensitivity with (?i) would also be beneficial.

  - match: "(?i)(131k|-lc)$"

Comment thread eval/unified_eval_harbor.sbatch Outdated

# Health check loop (use configurable retries)
MAX_RETRIES=$VLLM_MAX_RETRIES
RETRY_INTERVAL=100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The RETRY_INTERVAL of 100 seconds for the vLLM health check is quite long. With the default VLLM_MAX_RETRIES of 20, this could lead to a wait of over 30 minutes before the script determines that vLLM has failed to start. A shorter interval, such as 15 seconds, would provide much faster feedback on startup failures while still being robust against slow startup times.

Suggested change
RETRY_INTERVAL=100
RETRY_INTERVAL=15

Comment thread eval/unified_eval_harbor_dp.sbatch Outdated
tail -50 "$EVAL_LOGS_DIR/vllm_dp_shard${shard_idx}_${SLURM_JOB_ID}.log" 2>/dev/null || true
exit 1
fi
sleep 100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sleep 100 interval for the vLLM health check is very long. With VLLM_MAX_RETRIES set to 20, this could cause the script to wait for over 30 minutes before failing. Reducing this to a more reasonable value, like 15 seconds, would allow for faster feedback on startup issues.

Suggested change
sleep 100
sleep 15

- Fix REPO_ROOT in precreate_snapshots.py (was one level too deep)
- Fix snapshot_download.py path in DP sbatch (eval/jupiter/ → eval/)
- Fix regex pattern for 131k/lc model matching: (131k|-lc)$
- Reduce vLLM health check retry interval from 100s to 15s in both sbatches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant