Skip to content

Route GPU jobs by wall-time + Sol cheat sheet (CLI / PDF / skill)#34

Merged
Shu-Wan merged 4 commits into
v1.0-rustfrom
feat/partition-qos-cheatsheet
Jun 11, 2026
Merged

Route GPU jobs by wall-time + Sol cheat sheet (CLI / PDF / skill)#34
Shu-Wan merged 4 commits into
v1.0-rustfrom
feat/partition-qos-cheatsheet

Conversation

@Shu-Wan

@Shu-Wan Shu-Wan commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Closes #33.

What & why

The skill steered short GPU jobs to public ("GPU → public" reflex), parking 30-minute ablations behind multi-day jobs while hundreds of idle htc A100s sat one partition over. The root cause: the skill keyed partition choice on CPU-vs-GPU and explicitly excluded GPU from htc. This PR fixes the routing, reorganizes the skill to promote SLURM orchestration, and ships a Sol cheat sheet in three forms.

Changes

1. Partition/QOS routing fix (SKILL.md, references/{sessions,solx}.md)
Route by wall-time + priority, not CPU-vs-GPU — GPUs live in htc, public, and general:

  • ≤4h (incl. GPU) → htc · ≤15min & urgent → -p public -q debug · >4h → public · >4h preemptible → -p general -q private
  • Never -p htc -q debug (rejected) or bare -p general (defaults to a QOS it won't grant).

2. Skill reorg — SLURM orchestration moved above filesystem/keep; new personalized "Know your access" step (sacctmgr show assoc user=$USER); keep trimmed to delegate mechanics to the CLI.

3. Sol cheat sheet (single source: skills/sol-skill/references/cheatsheet.md)

  • CLI: solx cheatsheet (embedded via include_str!, byte-identical to source, wired into bash/zsh/fish completions)
  • PDF: docs/cheatsheet.pdf via scripts/build-cheatsheet.sh
  • centered README nav: cheat sheet · solx docs · skill

4. Eval harness — fixes eval #4 (it encoded the bug) and adds #8/#9/#10; new L3 l3_sbatch_test_only grader validates the agent's header against the live scheduler.

Verification

  • Eval (30-agent workflow): on the 4 partition evals the fixed skill scores 14/14 vs the pre-fix baseline 10/14 — baseline failing on exactly the GPU→public reflex (Stage 2: solx CLI (Sol-only) #4, v0.2.0: eval harness, situation-first SKILL refactor, and release docs #8) and an invalid -p htc -q debug the live scheduler rejects (Utilize Sol's AI capability #10).
  • Adversarial review (advisor workflow): 21 confirmed findings, all fixed — incl. htc does have H200; highmem wall is 7d not 2d; no myquota wrapper (→ beegfs-ctl --getquota); sq is the whole-cluster queue, not squeue --me; the scheduler nudges toward htc but does not auto-route. Every fact re-verified against the live scheduler.
  • Tests: 143 pass (104 unit + 39 integration); cargo fmt/clippy clean; all 3 completion scripts syntax-valid.

Notes

  • Targets v1.0-rust (main is still the old Python 0.5.1 line).
  • Eval ran on the pre-review skill; the review fixes are factual/wording corrections that don't change partition routing, so the 14/14 verdict stands.

🤖 Generated with Claude Code

Shu-Wan and others added 4 commits June 11, 2026 16:13
Partition/QOS rework: GPUs live in htc/public/general, so route by wall-time, not CPU-vs-GPU. <=4h (incl. GPU) -> htc; <=15m urgent -> -p public -q debug; >4h -> public; >4h preemptible -> -p general -q private. Reorg promotes SLURM orchestration above filesystem/keep and adds a personalized know-your-access step (sacctmgr show assoc). Adds the single-source cheat sheet (skills/sol-skill/references/cheatsheet.md), a rendered docs/cheatsheet.pdf via scripts/build-cheatsheet.sh, and a centered README nav. Corrects facts the adversarial review caught: htc has H200; highmem wall is 7d; no myquota wrapper (use beegfs-ctl --getquota --uid USER); sq is the whole-cluster queue, not squeue --me; the scheduler nudges toward htc but does not auto-route.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes eval #4 (it encoded the bug: expected -p public for a 4h GPU job) and adds #8 (30-min ablation -> htc), #9 (multi-day -> public/general), #10 (smoke test -> debug QOS on public/general). Adds an L3 l3_sbatch_test_only check that validates the recommended header against the live scheduler, catching invalid combos like -p htc -q debug that regex alone misses. Regexes hardened: canonical 4h forms, a100:1 vs a100:10, day and HH:MM:SS walls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prints the Sol cheat sheet as text, embedded from the skill single source via the include_str macro (zero drift; verified byte-identical). Wired into bash/zsh/fish completions and docs; the command-tree completion test is extended to require it. 143 tests pass; fmt and clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…EVELOPMENT

Folds the partition/QOS rework and the cheat sheet (solx cheatsheet, PDF, skill reference) into the unreleased [1.0.0] CHANGELOG entry; documents the new l3_sbatch_test_only grader in DEVELOPMENT.md's L3 row; adds the cheat sheet to ROADMAP's 'what solx does today'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Shu-Wan Shu-Wan merged commit c8e971f into v1.0-rust Jun 11, 2026
3 checks passed
@Shu-Wan Shu-Wan deleted the feat/partition-qos-cheatsheet branch June 11, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant