Add cluster-agnostic eval system by richardzhuang0412 · Pull Request #26 · open-thoughts/OpenThoughts-Agent

richardzhuang0412 · 2026-04-09T05:29:51Z

Summary

Migrate eval infrastructure from cluster-specific scripts to a unified, cluster-agnostic architecture
A single set of sbatch scripts and one listener work across any SLURM cluster via a per-cluster YAML config
Add condensed onboarding guide for setting up new clusters

Key additions

eval/unified_eval_listener.py — main listener (polls DB, submits SLURM jobs)
eval/unified_eval_harbor.sbatch — single-node eval sbatch (cluster-agnostic)
eval/unified_eval_harbor_dp.sbatch — multi-node data-parallel eval sbatch
eval/clusters/jupiter.yaml — example cluster config (JSC GH200)
eval/docs/CLUSTER_ONBOARDING.md — setup guide for new clusters
eval/check_progress.py — live eval progress dashboard
scripts/database/batch_upload_eval.py — batch result upload utility
database/unified_db/utils.py — DB utility updates

Features

GPU/CPU/memory scaling based on TP size (enables job packing on shared nodes)
vLLM native data-parallel (--dp-size) and multi-node DP (--dp-nodes)
Disk-based resume for incomplete/errored eval jobs
Per-model vLLM overrides via eval/baseline_model_configs.yaml
Proxy support for no-internet compute nodes (SSH tunnel + proxychains)
Preset configs for v2, tb2, swebench, aider, bfcl benchmarks

Test plan

Dry-run verified on Jupiter (JSC GH200, aarch64)
Real eval jobs submitted and completed on Jupiter
DP mode (4-shard) tested end-to-end
Verify onboarding doc is sufficient for a new cluster setup

🤖 Generated with Claude Code

Migrate the eval infrastructure from cluster-specific scripts to a unified, cluster-agnostic architecture. A single set of sbatch scripts and one listener work across any SLURM cluster via a per-cluster YAML config. Key additions: - eval/unified_eval_listener.py: main listener (polls DB, submits SLURM jobs) - eval/unified_eval_harbor.sbatch: single-node eval (cluster-agnostic) - eval/unified_eval_harbor_dp.sbatch: multi-node data-parallel eval - eval/clusters/jupiter.yaml: example cluster config (JSC GH200) - eval/docs/CLUSTER_ONBOARDING.md: setup guide for new clusters - eval/configs/: harbor agent/orchestrator configs - eval/check_progress.py: live eval progress dashboard - scripts/database/batch_upload_eval.py: batch upload utility - database/unified_db/utils.py: DB utility updates Features: - GPU/CPU/memory scaling based on TP size (job packing) - vLLM native data-parallel (--dp-size) and multi-node DP (--dp-nodes) - Disk-based resume for incomplete/errored jobs - Per-model vLLM overrides via baseline_model_configs.yaml - Proxy support for no-internet compute nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a comprehensive evaluation system for LLM models on HPC clusters, including new scripts for checking job progress, pre-creating Daytona snapshots, and batch uploading results. It refactors model configuration in baseline_model_configs.yaml to support groups and new patterns, and updates build_vllm_cmd.sh to allow for HF overrides and configurable VLLM ports. Significant changes were made to the unified_eval_harbor.sbatch script to improve proxy setup, vLLM health checks, and database integration, including a new is_overlong parameter. A new unified_eval_harbor_dp.sbatch script enables multi-node data-parallel evaluations using Ray. Documentation for cluster onboarding and agent setup prompts were also added. Review comments highlight critical issues with incorrect path calculations in eval/precreate_snapshots.py and eval/unified_eval_harbor_dp.sbatch, a potentially imprecise regex in eval/baseline_model_configs.yaml, and excessively long retry/sleep intervals in both unified_eval_harbor.sbatch and unified_eval_harbor_dp.sbatch that could delay feedback on vLLM startup failures.

gemini-code-assist · 2026-04-09T05:38:11Z

+from pathlib import Path
+
+# Ensure harbor is importable
+REPO_ROOT = Path(__file__).resolve().parent.parent.parent


The calculation of REPO_ROOT appears to be incorrect. Using .parent.parent.parent will resolve to the directory above the repository root. This will cause subsequent path constructions, such as for HARBOR_SRC, to be invalid, likely causing the script to fail.

Suggested change

REPO_ROOT = Path(__file__).resolve().parent.parent.parent

REPO_ROOT = Path(__file__).resolve().parent.parent

gemini-code-assist · 2026-04-09T05:38:11Z

+    done
+    DATASET_LOCAL_DIR="${DATASET_LOCAL_DIR:-${_DS_DIRS_ARR[0]}/${SAFE_REPO}}"
+    DOWNLOAD_LOG=$(mktemp /tmp/download_XXXXXX.log)
+    proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1


The path to snapshot_download.py appears to be incorrect. It's pointing to eval/jupiter/snapshot_download.py, but the script was added at eval/snapshot_download.py in this pull request. This will cause the script to fail when it attempts to download the dataset.

Suggested change

proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1

proxied $PYTHON_BIN "$DCFT/eval/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1

gemini-code-assist · 2026-04-09T05:38:11Z

-# Pattern-based configs: matched by regex when no exact model name is found.
-# Checked in order; first match wins.
-patterns:
  - match: "131k|-lc$"


The regex 131k|-lc$ may not be behaving as intended. It matches 131k anywhere in the string, or -lc only at the end. This could lead to inconsistent model configuration matching. For instance, my-model-131k-variant would match, but my-model-lc-variant would not. If the goal is to match models ending with either 131k or -lc, a more precise regex like (131k|-lc)$ should be used. Adding case-insensitivity with (?i) would also be beneficial.

- match: "(?i)(131k|-lc)$"

gemini-code-assist · 2026-04-09T05:38:11Z

+
+# Health check loop (use configurable retries)
+MAX_RETRIES=$VLLM_MAX_RETRIES
+RETRY_INTERVAL=100


The RETRY_INTERVAL of 100 seconds for the vLLM health check is quite long. With the default VLLM_MAX_RETRIES of 20, this could lead to a wait of over 30 minutes before the script determines that vLLM has failed to start. A shorter interval, such as 15 seconds, would provide much faster feedback on startup failures while still being robust against slow startup times.

Suggested change

RETRY_INTERVAL=100

RETRY_INTERVAL=15

gemini-code-assist · 2026-04-09T05:38:11Z

+            tail -50 "$EVAL_LOGS_DIR/vllm_dp_shard${shard_idx}_${SLURM_JOB_ID}.log" 2>/dev/null || true
+            exit 1
+        fi
+        sleep 100


The sleep 100 interval for the vLLM health check is very long. With VLLM_MAX_RETRIES set to 20, this could cause the script to wait for over 30 minutes before failing. Reducing this to a more reasonable value, like 15 seconds, would allow for faster feedback on startup issues.

Suggested change

sleep 100

sleep 15

- Fix REPO_ROOT in precreate_snapshots.py (was one level too deep) - Fix snapshot_download.py path in DP sbatch (eval/jupiter/ → eval/) - Fix regex pattern for 131k/lc model matching: (131k|-lc)$ - Reduce vLLM health check retry interval from 100s to 15s in both sbatches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

richardzhuang0412 requested a review from penfever April 9, 2026 05:31

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster-agnostic eval system#26

Add cluster-agnostic eval system#26
richardzhuang0412 wants to merge 2 commits intomainfrom
pr/cluster-agnostic-eval

richardzhuang0412 commented Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	REPO_ROOT = Path(__file__).resolve().parent.parent.parent
	REPO_ROOT = Path(__file__).resolve().parent.parent

	proxied $PYTHON_BIN "$DCFT/eval/jupiter/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1
	proxied $PYTHON_BIN "$DCFT/eval/snapshot_download.py" "$REPO_ID" --local-dir "$DATASET_LOCAL_DIR" > "$DOWNLOAD_LOG" 2>&1

Conversation

richardzhuang0412 commented Apr 9, 2026

Summary

Key additions

Features

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant