Route GPU jobs by wall-time + Sol cheat sheet (CLI / PDF / skill) by Shu-Wan · Pull Request #34 · Shu-Wan/solx

Shu-Wan · 2026-06-11T23:14:03Z

Closes #33.

What & why

The skill steered short GPU jobs to public ("GPU → public" reflex), parking 30-minute ablations behind multi-day jobs while hundreds of idle htc A100s sat one partition over. The root cause: the skill keyed partition choice on CPU-vs-GPU and explicitly excluded GPU from htc. This PR fixes the routing, reorganizes the skill to promote SLURM orchestration, and ships a Sol cheat sheet in three forms.

Changes

1. Partition/QOS routing fix (SKILL.md, references/{sessions,solx}.md)
Route by wall-time + priority, not CPU-vs-GPU — GPUs live in htc, public, and general:

≤4h (incl. GPU) → htc · ≤15min & urgent → -p public -q debug · >4h → public · >4h preemptible → -p general -q private
Never -p htc -q debug (rejected) or bare -p general (defaults to a QOS it won't grant).

2. Skill reorg — SLURM orchestration moved above filesystem/keep; new personalized "Know your access" step (sacctmgr show assoc user=$USER); keep trimmed to delegate mechanics to the CLI.

3. Sol cheat sheet (single source: skills/sol-skill/references/cheatsheet.md)

CLI: solx cheatsheet (embedded via include_str!, byte-identical to source, wired into bash/zsh/fish completions)
PDF: docs/cheatsheet.pdf via scripts/build-cheatsheet.sh
centered README nav: cheat sheet · solx docs · skill

4. Eval harness — fixes eval #4 (it encoded the bug) and adds #8/#9/#10; new L3 l3_sbatch_test_only grader validates the agent's header against the live scheduler.

Verification

Eval (30-agent workflow): on the 4 partition evals the fixed skill scores 14/14 vs the pre-fix baseline 10/14 — baseline failing on exactly the GPU→public reflex (Stage 2: solx CLI (Sol-only) #4, v0.2.0: eval harness, situation-first SKILL refactor, and release docs #8) and an invalid -p htc -q debug the live scheduler rejects (Utilize Sol's AI capability #10).
Adversarial review (advisor workflow): 21 confirmed findings, all fixed — incl. htc does have H200; highmem wall is 7d not 2d; no myquota wrapper (→ beegfs-ctl --getquota); sq is the whole-cluster queue, not squeue --me; the scheduler nudges toward htc but does not auto-route. Every fact re-verified against the live scheduler.
Tests: 143 pass (104 unit + 39 integration); cargo fmt/clippy clean; all 3 completion scripts syntax-valid.

Notes

Targets v1.0-rust (main is still the old Python 0.5.1 line).
Eval ran on the pre-review skill; the review fixes are factual/wording corrections that don't change partition routing, so the 14/14 verdict stands.

🤖 Generated with Claude Code

Partition/QOS rework: GPUs live in htc/public/general, so route by wall-time, not CPU-vs-GPU. <=4h (incl. GPU) -> htc; <=15m urgent -> -p public -q debug; >4h -> public; >4h preemptible -> -p general -q private. Reorg promotes SLURM orchestration above filesystem/keep and adds a personalized know-your-access step (sacctmgr show assoc). Adds the single-source cheat sheet (skills/sol-skill/references/cheatsheet.md), a rendered docs/cheatsheet.pdf via scripts/build-cheatsheet.sh, and a centered README nav. Corrects facts the adversarial review caught: htc has H200; highmem wall is 7d; no myquota wrapper (use beegfs-ctl --getquota --uid USER); sq is the whole-cluster queue, not squeue --me; the scheduler nudges toward htc but does not auto-route. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fixes eval #4 (it encoded the bug: expected -p public for a 4h GPU job) and adds #8 (30-min ablation -> htc), #9 (multi-day -> public/general), #10 (smoke test -> debug QOS on public/general). Adds an L3 l3_sbatch_test_only check that validates the recommended header against the live scheduler, catching invalid combos like -p htc -q debug that regex alone misses. Regexes hardened: canonical 4h forms, a100:1 vs a100:10, day and HH:MM:SS walls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Prints the Sol cheat sheet as text, embedded from the skill single source via the include_str macro (zero drift; verified byte-identical). Wired into bash/zsh/fish completions and docs; the command-tree completion test is extended to require it. 143 tests pass; fmt and clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…EVELOPMENT Folds the partition/QOS rework and the cheat sheet (solx cheatsheet, PDF, skill reference) into the unreleased [1.0.0] CHANGELOG entry; documents the new l3_sbatch_test_only grader in DEVELOPMENT.md's L3 row; adds the cheat sheet to ROADMAP's 'what solx does today'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Shu-Wan and others added 4 commits June 11, 2026 16:13

Shu-Wan merged commit c8e971f into v1.0-rust Jun 11, 2026
3 checks passed

Shu-Wan deleted the feat/partition-qos-cheatsheet branch June 11, 2026 23:38

Shu-Wan mentioned this pull request Jun 11, 2026

Add a Sol cheat sheet — CLI (solx cheatsheet), PDF, and skill reference #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route GPU jobs by wall-time + Sol cheat sheet (CLI / PDF / skill)#34

Route GPU jobs by wall-time + Sol cheat sheet (CLI / PDF / skill)#34
Shu-Wan merged 4 commits into
v1.0-rustfrom
feat/partition-qos-cheatsheet

Shu-Wan commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Shu-Wan commented Jun 11, 2026

What & why

Changes

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant