sol-skill: agent-vs-human command taxonomy + proactive PENDING playbook#37
Merged
Conversation
4371ec2 to
e7cfa4e
Compare
Two framing changes to the skill (no new commands), per issue #36: - Tag status commands by audience. The "Asking the Cluster" table and the cheat sheet wrappers table now separate the agent-parseable form (SLURM-native / --json / -O) from the human-facing my*/show* wrapper, with the rule: agents default to the parseable form; reach for a color-coded wrapper only to show a human (myfairshare's dampened RealFairShare is the noted exception). Free-GPU lookup is sinfo with Gres - GresUsed (not the color-coded showgpus, and not bare %G, which is configured rather than free). - Add a proactive "job is PENDING" decision tree to the SKILL body: diagnose cause + ETA first (squeue -O "JobID,Reason:50,StartTime" -- the Reason column is widened so a multi-word reason like 'ReqNodeNotAvail, UnavailableNodes:scNNN' isn't truncated), classify the Reason (Priority / ReqNodeNotAvail = node unavailable / Resources), right-size, and confirm a reroute wins before cancelling. Backing Reason taxonomy in references/slurm.md; compact version in the cheat sheet. Commands verified against the live Sol scheduler; SKILL design checked against DEVELOPMENT.md (situation-first, decisions-in-SKILL.md). Reason taxonomy and sinfo/squeue forms corrected after an adversarial review. CHANGELOG [Unreleased] entry added; cheatsheet PDF regenerated. Closes #36. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e7cfa4e to
21ba6bd
Compare
This was referenced Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #36.
Two framing/packaging changes to
sol-skill(no new commands), implementing the two patterns from the issue.Pattern 1 — human-vs-agent command taxonomy
SKILL.md→ "Asking the Cluster About Yourself and Your Jobs": columns relabeled to Agent-parseable form (parse this) vs Human wrapper (to show a user), with an explicit rule of thumb — agents default to the parseable form; reach for ashow*/my*wrapper only to show a human, or formyfairshare's dampenedRealFairShare(the one noted exception).references/slurm.md(audience note) andreferences/cheatsheet.md(wrappers table now has a "Parse this (agent)" column).Pattern 2 — proactive "job is PENDING" playbook
SKILL.mdbody (per DEVELOPMENT.md "decisions in SKILL.md, detail in references/"): get cause + ETA first → classifyReason(Priority→ priority-bound, report & wait;ReqNodeNotAvail→ node unavailable;Resources→ capacity-bound, a reroute can help) → right-size → confirm a reroute wins before cancelling. Punchline: diagnose and report, don't spray partitions.Reasontaxonomy inreferences/slurm.md; compact version in the cheat sheet.Verification
%Gshows configured GPUs, not free → the "free GPUs" answer now usessinfo -h -O "Partition,StateLong,Gres,GresUsed,…"(free =Gres−GresUsed).grep 'Reason=[^ ]+'truncates multi-word reasons (live example:ReqNodeNotAvail, UnavailableNodes:sc013) → diagnosis now usessqueue -O "JobID,Reason:50,StartTime"(widened column).ReqNodeNotAvailrecharacterized as node unavailable (drained/down or reserved), with the distinctReservationreason noted; theResources↔StartTimeexample corrected to match real scheduler behavior;AssocGrp…group-cap example fixed toAssocGrpGRES.solxcrate suite passes (143 tests), includingcheatsheet_has_the_key_sectionswhich guards theinclude_str!'d cheat sheet.scripts/build-cheatsheet.sh).Notes
version:field is not bumped — per DEVELOPMENT.md the version bump and tag happen at release time. Changes are staged under## [Unreleased]in the CHANGELOG.evals/evals.json(gitignored, absent here), skill-creator, and (L3) live-Sol checks. The L2 layer — thesolxcrate suite — was run.🤖 Generated with Claude Code