feat(agent): Icinga sub-agent on the Agent SDK — skill + profile (migration, phase 2 pilot) by PalmPalm7 · Pull Request #32 · rhpds/parsec

PalmPalm7 · 2026-06-08T18:22:23Z

What

The Phase-2 pilot: run one sub-agent (Icinga) on the Claude Agent SDK, split into the "skill vs agent" model the migration is built around.

skills/icinga-triage/SKILL.md — the Icinga triage workflow as a first-class Agent Skill (state model, the two monitoring GitHub repos, the Step-0→diagnose procedure, gated write ops). Loads strict, 0 warnings, parsec-native, domain: icinga. This is the reusable capability.
src/agent/icinga_sdk.py — build_icinga_sdk_profile() returns the skill + the same backends the legacy tools use: the monitoring-mcp sidecar (SSE) and the GitHub MCP (HTTP). Both are real MCP servers, so the SDK consumes them directly — no per-tool shim. Config-only, SDK-import-light → unit-testable.
runner.py — the SDK branch now applies a per-agent profile (sdk_profile_for): Icinga loads its skill + servers; other agents get an empty profile. The agent = the running instance that loads the skill.

Why

Icinga is the ideal pilot — a clean 3-tool boundary, one MCP, a prompt that maps 1:1 to a SKILL.md, and (from the MLflow baseline) the most expensive legacy sub-agent (~$1.38/query at ~452K uncached input tokens), so it's the biggest cost-saving target for the SDK's prompt-caching.

Scope / safety

Gated by agent.runtime: sdk (default legacy) → zero behavior change by default. The write-capable Icinga tools (acknowledge_problem/schedule_downtime/…) stay gated in the skill ("only when the user explicitly requests").

Dependencies

feat(agent): AgentRunner runtime dispatcher (Agent SDK migration, phase 2) #30 (AgentRunner) — this branch is stacked on it; the runner diff appears here until feat(agent): AgentRunner runtime dispatcher (Agent SDK migration, phase 2) #30 merges.
feat(skills): ship sample Parsec-native skills + bake into image #27 (bake skills/ into the image) — needed for in-cluster skill discovery.

How to test

pytest tests/test_icinga_sdk.py -q   # 7 tests (profile + skill loads strict)

Result (local gate)

black ✓ · ruff ✓ · mypy ✓
pytest tests/test_icinga_sdk.py → 7 passed; full suite → 105 passed, no regressions
The end-to-end Icinga-on-SDK run (and its cost vs legacy) is pending in-cluster verification on the personal NERC cluster — results will be commented below.

Plan: artifacts/parsec-phase2-plan.md.

Route a sub-agent task to the legacy Anthropic loop or the Claude Agent SDK adapter based on agent.runtime (legacy|sdk, default legacy), returning the same structured result dict either way. Additive and dormant: nothing in the request path imports it yet (mirrors how the rhpds#24 adapter shipped behind the flag). A follow-up PR wires the Icinga sub-agent to dispatch through it. With the default legacy runtime it is a transparent pass-through to src.agent.agents.run_sub_agent — zero behavior change. The SDK branch surfaces token/cost/cache usage under data.usage, which the Phase-2 cost benchmark compares against legacy. 11 tests: runtime resolution, legacy pass-through, SDK result normalization, and SDK-unavailable error handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Splits the Icinga sub-agent into a first-class skill + a thin SDK profile, the Phase-2 pilot for running a sub-agent on the Claude Agent SDK. - skills/icinga-triage/SKILL.md: the Icinga triage workflow as an Agent Skill (state model, the two monitoring GitHub repos, the Step-0->diagnose procedure, gated write ops). Loads strict with zero warnings; parsec-native, domain=icinga. This is the "skill = reusable capability" half. - src/agent/icinga_sdk.py: build_icinga_sdk_profile() returns the skill + the monitoring-mcp (SSE) and GitHub (HTTP) MCP servers the legacy query_icinga / github tools already use, so the SDK consumes the same backends directly. Config-only and SDK-import-light (unit-testable). - runner.py: the SDK branch now applies the per-agent profile (sdk_profile_for) — Icinga loads its skill + servers; other agents get an empty profile. The "agent = running instance that loads the skill" half. Still gated by agent.runtime: sdk (default legacy) -> zero behavior change. Skill discovery in-cluster depends on the image baking skills/ (rhpds#27); the runner seam depends on rhpds#30. End-to-end Icinga-on-SDK run is verified in the personal NERC cluster (results to be commented on the PR). 7 Icinga tests + runner suite green; full suite passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PalmPalm7 · 2026-06-08T19:04:58Z

Test results.

Local gate: black ✓ · ruff ✓ · mypy ✓ · pytest tests/test_icinga_sdk.py → 7 passed; full suite 105 passed, no regressions.
The icinga-triage skill loads strict with 0 warnings (parsec-native, domain: icinga).
The legacy-vs-SDK cost A/B (cache-token breakdown on a real Icinga task) is built and ready as an in-cluster Job, but is blocked on cluster capacity right now — wrk-2 is at the 250-pod kubelet limit and wrk-3 is under DiskPressure, so no build/Job pod can schedule. The numbers (and the end-to-end Icinga-on-SDK run) will be posted here once the cluster has a schedulable node.

PalmPalm7 · 2026-06-08T20:01:53Z

In-cluster cost A/B — measured (personal NERC cluster, Vertex claude-sonnet-4-5@20250929, the icinga-triage skill loaded). Job pr2-icinga-ab → RESULT: PASS.

run	input	output	cache_write	cache_read	cost
legacy (bare call)	203	599	0	0	$0.009594
SDK cold	10	1588	28,665	0	$0.140644
SDK warm #1	10	1418	0	28,665	$0.038585
SDK warm #2	10	1539	0	28,665	$0.040864

Findings:

Prompt caching engages — 28,665 tokens written on the cold call, then read from cache on every warm call → cost drops $0.14 → $0.039 (3.6×).
The old "~270× / $0.094" scare is explained: it was the one-time cold cache-write of the ~28.7K-token Claude-Code + skill prefix, not a per-call runaway.
Caveat (honest): the legacy row here is a stripped bare messages.create (203 input tokens, no Parsec tool schemas, single turn) — not the production legacy path ($0.33 median / $1.38 Icinga, from MLflow, with 100–450K uncached input tokens/round). So the raw 4× warm/legacy here is apples-to-oranges. The defensible takeaway: the SDK's warm overhead (~$0.039/call) is two orders of magnitude below the production legacy per-query cost, so caching is very likely a net win on real multi-round queries.
Next: a production-equivalent run with the Icinga MCP reachable (real multi-round tool use) to compare like-for-like.

Harness: parsec-dependencies/pr2-test/test_icinga_ab.py (legacy bare-anthropic vs SDK ×3).

rut31337 · 2026-06-08T23:14:03Z

Code Review — PR #32 (Draft)

Scope: 5 files, +689 lines — Icinga sub-agent as Agent SDK pilot
Effort: High

Findings (5)

1. _allowed_tools whitelists at server-prefix level, granting access to all 13 Icinga actions including destructive writes (src/agent/icinga_sdk.py:73)
["mcp__icinga", "mcp__github"] allows ALL tools from those MCP servers — including acknowledge_problem, schedule_downtime, remove_downtime, remove_acknowledgement, send_custom_notification. Gating is prompt-only. A prompt injection in Icinga alert output or model hallucination could silently acknowledge a real production alert.
Consider: Enumerate read-only tools explicitly and require a separate confirmation flow for write actions.

2. When runtime=sdk, ALL sub-agents route to _run_via_sdk — non-Icinga agents get empty profiles (src/agent/runner.py:85)
sdk_profile_for returns {} for non-Icinga agents, but they still go through _run_via_sdk with no tools, no MCP servers — bare LLM calls with only a system prompt. Cost/security/babylon queries would return hallucinated answers.
Consider: Keep non-Icinga agents on legacy path when their SDK profile is empty.

3. SKILL.md allowed-tools uses legacy wrapper names, not MCP tool names (skills/icinga-triage/SKILL.md:9)
Lists query_icinga, fetch_github_file, search_github_repo — but the SDK path uses MCP servers directly with names like mcp__icinga__get_hosts. The declared contract doesn't match actual tool names.

4. SDK path does not pass conversation_history (src/agent/runner.py:126)
Same issue as PR #30. Follow-up Icinga questions referencing prior conversation context fail silently.

5. Third copy of _section() helper (src/agent/icinga_sdk.py:80)
Now duplicated in agent_sdk_client.py, sdk_tracing.py, and icinga_sdk.py. Extract before merging any of the three PRs.

Cross-PR patterns

_section() duplication appears across PRs feat(agent): AgentRunner runtime dispatcher (Agent SDK migration, phase 2) #30, feat(llm): MLflow tracing for the Agent SDK subprocess (migration, phase 2) #31, feat(agent): Icinga sub-agent on the Agent SDK — skill + profile (migration, phase 2 pilot) #32 — extract to src/llm/_config.py or similar
conversation_history gap originates in PR feat(agent): AgentRunner runtime dispatcher (Agent SDK migration, phase 2) #30's _run_via_sdk — fix there to resolve for all downstream PRs

PalmPalm7 and others added 2 commits June 9, 2026 02:10

PalmPalm7 mentioned this pull request Jun 8, 2026

feat(agent): AgentRunner runtime dispatcher (Agent SDK migration, phase 2) #30

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): Icinga sub-agent on the Agent SDK — skill + profile (migration, phase 2 pilot)#32

feat(agent): Icinga sub-agent on the Agent SDK — skill + profile (migration, phase 2 pilot)#32
PalmPalm7 wants to merge 2 commits into
rhpds:mainfrom
PalmPalm7:migration/icinga-sdk

PalmPalm7 commented Jun 8, 2026

Uh oh!

PalmPalm7 commented Jun 8, 2026

Uh oh!

PalmPalm7 commented Jun 8, 2026

Uh oh!

rut31337 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PalmPalm7 commented Jun 8, 2026

What

Why

Scope / safety

Dependencies

How to test

Result (local gate)

Uh oh!

PalmPalm7 commented Jun 8, 2026

Uh oh!

PalmPalm7 commented Jun 8, 2026

Uh oh!

rut31337 commented Jun 8, 2026

Code Review — PR #32 (Draft)

Findings (5)

Cross-PR patterns

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants