Skip to content

sol-skill: agent-vs-human command taxonomy + proactive PENDING playbook#37

Merged
Shu-Wan merged 1 commit into
mainfrom
skill/issue-36-agent-guidance
Jun 15, 2026
Merged

sol-skill: agent-vs-human command taxonomy + proactive PENDING playbook#37
Shu-Wan merged 1 commit into
mainfrom
skill/issue-36-agent-guidance

Conversation

@Shu-Wan

@Shu-Wan Shu-Wan commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Closes #36.

Two framing/packaging changes to sol-skill (no new commands), implementing the two patterns from the issue.

Pattern 1 — human-vs-agent command taxonomy

  • SKILL.md → "Asking the Cluster About Yourself and Your Jobs": columns relabeled to Agent-parseable form (parse this) vs Human wrapper (to show a user), with an explicit rule of thumb — agents default to the parseable form; reach for a show*/my* wrapper only to show a human, or for myfairshare's dampened RealFairShare (the one noted exception).
  • The wrapper-first table under Situation-Aware Job Management gets the same audience caveat + a cross-ref, so an agent that lands there first is redirected.
  • Mirrored in references/slurm.md (audience note) and references/cheatsheet.md (wrappers table now has a "Parse this (agent)" column).

Pattern 2 — proactive "job is PENDING" playbook

  • New decision tree in the SKILL.md body (per DEVELOPMENT.md "decisions in SKILL.md, detail in references/"): get cause + ETA first → classify Reason (Priority → priority-bound, report & wait; ReqNodeNotAvail → node unavailable; Resources → capacity-bound, a reroute can help) → right-size → confirm a reroute wins before cancelling. Punchline: diagnose and report, don't spray partitions.
  • Backing Reason taxonomy in references/slurm.md; compact version in the cheat sheet.

Verification

  • Commands checked against the live Sol scheduler. This caught and fixed several issues during an adversarial self-review before opening the PR:
    • %G shows configured GPUs, not free → the "free GPUs" answer now uses sinfo -h -O "Partition,StateLong,Gres,GresUsed,…" (free = GresGresUsed).
    • grep 'Reason=[^ ]+' truncates multi-word reasons (live example: ReqNodeNotAvail, UnavailableNodes:sc013) → diagnosis now uses squeue -O "JobID,Reason:50,StartTime" (widened column).
    • ReqNodeNotAvail recharacterized as node unavailable (drained/down or reserved), with the distinct Reservation reason noted; the ResourcesStartTime example corrected to match real scheduler behavior; AssocGrp… group-cap example fixed to AssocGrpGRES.
  • solx crate suite passes (143 tests), including cheatsheet_has_the_key_sections which guards the include_str!'d cheat sheet.
  • Cheat-sheet PDF regenerated from source (scripts/build-cheatsheet.sh).

Notes

  • Skill-only doc change: the version: field is not bumped — per DEVELOPMENT.md the version bump and tag happen at release time. Changes are staged under ## [Unreleased] in the CHANGELOG.
  • The behavioral eval harness (L0/L1) was not run: it needs the maintainer-private evals/evals.json (gitignored, absent here), skill-creator, and (L3) live-Sol checks. The L2 layer — the solx crate suite — was run.

🤖 Generated with Claude Code

@Shu-Wan Shu-Wan force-pushed the skill/issue-36-agent-guidance branch from 4371ec2 to e7cfa4e Compare June 15, 2026 22:52
Two framing changes to the skill (no new commands), per issue #36:

- Tag status commands by audience. The "Asking the Cluster" table and
  the cheat sheet wrappers table now separate the agent-parseable form
  (SLURM-native / --json / -O) from the human-facing my*/show* wrapper,
  with the rule: agents default to the parseable form; reach for a
  color-coded wrapper only to show a human (myfairshare's dampened
  RealFairShare is the noted exception). Free-GPU lookup is sinfo with
  Gres - GresUsed (not the color-coded showgpus, and not bare %G, which
  is configured rather than free).

- Add a proactive "job is PENDING" decision tree to the SKILL body:
  diagnose cause + ETA first (squeue -O "JobID,Reason:50,StartTime" --
  the Reason column is widened so a multi-word reason like
  'ReqNodeNotAvail, UnavailableNodes:scNNN' isn't truncated), classify
  the Reason (Priority / ReqNodeNotAvail = node unavailable / Resources),
  right-size, and confirm a reroute wins before cancelling. Backing
  Reason taxonomy in references/slurm.md; compact version in the cheat
  sheet.

Commands verified against the live Sol scheduler; SKILL design checked
against DEVELOPMENT.md (situation-first, decisions-in-SKILL.md). Reason
taxonomy and sinfo/squeue forms corrected after an adversarial review.
CHANGELOG [Unreleased] entry added; cheatsheet PDF regenerated. Closes #36.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Shu-Wan Shu-Wan force-pushed the skill/issue-36-agent-guidance branch from e7cfa4e to 21ba6bd Compare June 15, 2026 23:24
@Shu-Wan Shu-Wan merged commit 6bff1e2 into main Jun 15, 2026
3 checks passed
@Shu-Wan Shu-Wan deleted the skill/issue-36-agent-guidance branch June 15, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skill guidance for AI agents: (1) human-vs-agent command taxonomy, (2) a proactive "job is PENDING" playbook

1 participant