Route GPU jobs by wall-time + Sol cheat sheet (CLI / PDF / skill)#34
Merged
Conversation
Partition/QOS rework: GPUs live in htc/public/general, so route by wall-time, not CPU-vs-GPU. <=4h (incl. GPU) -> htc; <=15m urgent -> -p public -q debug; >4h -> public; >4h preemptible -> -p general -q private. Reorg promotes SLURM orchestration above filesystem/keep and adds a personalized know-your-access step (sacctmgr show assoc). Adds the single-source cheat sheet (skills/sol-skill/references/cheatsheet.md), a rendered docs/cheatsheet.pdf via scripts/build-cheatsheet.sh, and a centered README nav. Corrects facts the adversarial review caught: htc has H200; highmem wall is 7d; no myquota wrapper (use beegfs-ctl --getquota --uid USER); sq is the whole-cluster queue, not squeue --me; the scheduler nudges toward htc but does not auto-route. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes eval #4 (it encoded the bug: expected -p public for a 4h GPU job) and adds #8 (30-min ablation -> htc), #9 (multi-day -> public/general), #10 (smoke test -> debug QOS on public/general). Adds an L3 l3_sbatch_test_only check that validates the recommended header against the live scheduler, catching invalid combos like -p htc -q debug that regex alone misses. Regexes hardened: canonical 4h forms, a100:1 vs a100:10, day and HH:MM:SS walls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prints the Sol cheat sheet as text, embedded from the skill single source via the include_str macro (zero drift; verified byte-identical). Wired into bash/zsh/fish completions and docs; the command-tree completion test is extended to require it. 143 tests pass; fmt and clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…EVELOPMENT Folds the partition/QOS rework and the cheat sheet (solx cheatsheet, PDF, skill reference) into the unreleased [1.0.0] CHANGELOG entry; documents the new l3_sbatch_test_only grader in DEVELOPMENT.md's L3 row; adds the cheat sheet to ROADMAP's 'what solx does today'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #33.
What & why
The skill steered short GPU jobs to
public("GPU →public" reflex), parking 30-minute ablations behind multi-day jobs while hundreds of idlehtcA100s sat one partition over. The root cause: the skill keyed partition choice on CPU-vs-GPU and explicitly excluded GPU fromhtc. This PR fixes the routing, reorganizes the skill to promote SLURM orchestration, and ships a Sol cheat sheet in three forms.Changes
1. Partition/QOS routing fix (
SKILL.md,references/{sessions,solx}.md)Route by wall-time + priority, not CPU-vs-GPU — GPUs live in
htc,public, andgeneral:htc· ≤15min & urgent →-p public -q debug· >4h →public· >4h preemptible →-p general -q private-p htc -q debug(rejected) or bare-p general(defaults to a QOS it won't grant).2. Skill reorg — SLURM orchestration moved above filesystem/keep; new personalized "Know your access" step (
sacctmgr show assoc user=$USER);keeptrimmed to delegate mechanics to the CLI.3. Sol cheat sheet (single source:
skills/sol-skill/references/cheatsheet.md)solx cheatsheet(embedded viainclude_str!, byte-identical to source, wired into bash/zsh/fish completions)docs/cheatsheet.pdfviascripts/build-cheatsheet.sh4. Eval harness — fixes eval #4 (it encoded the bug) and adds #8/#9/#10; new L3
l3_sbatch_test_onlygrader validates the agent's header against the live scheduler.Verification
-p htc -q debugthe live scheduler rejects (Utilize Sol's AI capability #10).htcdoes have H200;highmemwall is 7d not 2d; nomyquotawrapper (→beegfs-ctl --getquota);sqis the whole-cluster queue, notsqueue --me; the scheduler nudges toward htc but does not auto-route. Every fact re-verified against the live scheduler.cargo fmt/clippyclean; all 3 completion scripts syntax-valid.Notes
v1.0-rust(main is still the old Python 0.5.1 line).🤖 Generated with Claude Code