Skip to content

feat(agent): Icinga sub-agent on the Agent SDK — skill + profile (migration, phase 2 pilot)#32

Draft
PalmPalm7 wants to merge 2 commits into
rhpds:mainfrom
PalmPalm7:migration/icinga-sdk
Draft

feat(agent): Icinga sub-agent on the Agent SDK — skill + profile (migration, phase 2 pilot)#32
PalmPalm7 wants to merge 2 commits into
rhpds:mainfrom
PalmPalm7:migration/icinga-sdk

Conversation

@PalmPalm7

Copy link
Copy Markdown
Contributor

What

The Phase-2 pilot: run one sub-agent (Icinga) on the Claude Agent SDK, split into the "skill vs agent" model the migration is built around.

  • skills/icinga-triage/SKILL.md — the Icinga triage workflow as a first-class Agent Skill (state model, the two monitoring GitHub repos, the Step-0→diagnose procedure, gated write ops). Loads strict, 0 warnings, parsec-native, domain: icinga. This is the reusable capability.
  • src/agent/icinga_sdk.pybuild_icinga_sdk_profile() returns the skill + the same backends the legacy tools use: the monitoring-mcp sidecar (SSE) and the GitHub MCP (HTTP). Both are real MCP servers, so the SDK consumes them directly — no per-tool shim. Config-only, SDK-import-light → unit-testable.
  • runner.py — the SDK branch now applies a per-agent profile (sdk_profile_for): Icinga loads its skill + servers; other agents get an empty profile. The agent = the running instance that loads the skill.

Why

Icinga is the ideal pilot — a clean 3-tool boundary, one MCP, a prompt that maps 1:1 to a SKILL.md, and (from the MLflow baseline) the most expensive legacy sub-agent (~$1.38/query at ~452K uncached input tokens), so it's the biggest cost-saving target for the SDK's prompt-caching.

Scope / safety

Gated by agent.runtime: sdk (default legacy) → zero behavior change by default. The write-capable Icinga tools (acknowledge_problem/schedule_downtime/…) stay gated in the skill ("only when the user explicitly requests").

Dependencies

How to test

pytest tests/test_icinga_sdk.py -q   # 7 tests (profile + skill loads strict)

Result (local gate)

  • black ✓ · ruff ✓ · mypy
  • pytest tests/test_icinga_sdk.py7 passed; full suite → 105 passed, no regressions
  • The end-to-end Icinga-on-SDK run (and its cost vs legacy) is pending in-cluster verification on the personal NERC cluster — results will be commented below.

Plan: artifacts/parsec-phase2-plan.md.

PalmPalm7 and others added 2 commits June 9, 2026 02:10
Route a sub-agent task to the legacy Anthropic loop or the Claude Agent
SDK adapter based on agent.runtime (legacy|sdk, default legacy), returning
the same structured result dict either way.

Additive and dormant: nothing in the request path imports it yet (mirrors
how the rhpds#24 adapter shipped behind the flag). A follow-up PR wires the
Icinga sub-agent to dispatch through it. With the default legacy runtime
it is a transparent pass-through to src.agent.agents.run_sub_agent — zero
behavior change. The SDK branch surfaces token/cost/cache usage under
data.usage, which the Phase-2 cost benchmark compares against legacy.

11 tests: runtime resolution, legacy pass-through, SDK result
normalization, and SDK-unavailable error handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Splits the Icinga sub-agent into a first-class skill + a thin SDK profile,
the Phase-2 pilot for running a sub-agent on the Claude Agent SDK.

- skills/icinga-triage/SKILL.md: the Icinga triage workflow as an Agent
  Skill (state model, the two monitoring GitHub repos, the Step-0->diagnose
  procedure, gated write ops). Loads strict with zero warnings; parsec-native,
  domain=icinga. This is the "skill = reusable capability" half.
- src/agent/icinga_sdk.py: build_icinga_sdk_profile() returns the skill +
  the monitoring-mcp (SSE) and GitHub (HTTP) MCP servers the legacy
  query_icinga / github tools already use, so the SDK consumes the same
  backends directly. Config-only and SDK-import-light (unit-testable).
- runner.py: the SDK branch now applies the per-agent profile
  (sdk_profile_for) — Icinga loads its skill + servers; other agents get an
  empty profile. The "agent = running instance that loads the skill" half.

Still gated by agent.runtime: sdk (default legacy) -> zero behavior change.
Skill discovery in-cluster depends on the image baking skills/ (rhpds#27); the
runner seam depends on rhpds#30. End-to-end Icinga-on-SDK run is verified in the
personal NERC cluster (results to be commented on the PR).

7 Icinga tests + runner suite green; full suite passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@PalmPalm7

Copy link
Copy Markdown
Contributor Author

Test results.

  • Local gate: black ✓ · ruff ✓ · mypy ✓ · pytest tests/test_icinga_sdk.py7 passed; full suite 105 passed, no regressions.
  • The icinga-triage skill loads strict with 0 warnings (parsec-native, domain: icinga).
  • The legacy-vs-SDK cost A/B (cache-token breakdown on a real Icinga task) is built and ready as an in-cluster Job, but is blocked on cluster capacity right now — wrk-2 is at the 250-pod kubelet limit and wrk-3 is under DiskPressure, so no build/Job pod can schedule. The numbers (and the end-to-end Icinga-on-SDK run) will be posted here once the cluster has a schedulable node.

@PalmPalm7

Copy link
Copy Markdown
Contributor Author

In-cluster cost A/B — measured (personal NERC cluster, Vertex claude-sonnet-4-5@20250929, the icinga-triage skill loaded). Job pr2-icinga-abRESULT: PASS.

run input output cache_write cache_read cost
legacy (bare call) 203 599 0 0 $0.009594
SDK cold 10 1588 28,665 0 $0.140644
SDK warm #1 10 1418 0 28,665 $0.038585
SDK warm #2 10 1539 0 28,665 $0.040864

Findings:

  • Prompt caching engages — 28,665 tokens written on the cold call, then read from cache on every warm call → cost drops $0.14 → $0.039 (3.6×).
  • The old "~270× / $0.094" scare is explained: it was the one-time cold cache-write of the ~28.7K-token Claude-Code + skill prefix, not a per-call runaway.
  • Caveat (honest): the legacy row here is a stripped bare messages.create (203 input tokens, no Parsec tool schemas, single turn) — not the production legacy path ($0.33 median / $1.38 Icinga, from MLflow, with 100–450K uncached input tokens/round). So the raw 4× warm/legacy here is apples-to-oranges. The defensible takeaway: the SDK's warm overhead (~$0.039/call) is two orders of magnitude below the production legacy per-query cost, so caching is very likely a net win on real multi-round queries.
  • Next: a production-equivalent run with the Icinga MCP reachable (real multi-round tool use) to compare like-for-like.

Harness: parsec-dependencies/pr2-test/test_icinga_ab.py (legacy bare-anthropic vs SDK ×3).

@rut31337

rut31337 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Code Review — PR #32 (Draft)

Scope: 5 files, +689 lines — Icinga sub-agent as Agent SDK pilot
Effort: High

Findings (5)

1. _allowed_tools whitelists at server-prefix level, granting access to all 13 Icinga actions including destructive writes (src/agent/icinga_sdk.py:73)
["mcp__icinga", "mcp__github"] allows ALL tools from those MCP servers — including acknowledge_problem, schedule_downtime, remove_downtime, remove_acknowledgement, send_custom_notification. Gating is prompt-only. A prompt injection in Icinga alert output or model hallucination could silently acknowledge a real production alert.
Consider: Enumerate read-only tools explicitly and require a separate confirmation flow for write actions.

2. When runtime=sdk, ALL sub-agents route to _run_via_sdk — non-Icinga agents get empty profiles (src/agent/runner.py:85)
sdk_profile_for returns {} for non-Icinga agents, but they still go through _run_via_sdk with no tools, no MCP servers — bare LLM calls with only a system prompt. Cost/security/babylon queries would return hallucinated answers.
Consider: Keep non-Icinga agents on legacy path when their SDK profile is empty.

3. SKILL.md allowed-tools uses legacy wrapper names, not MCP tool names (skills/icinga-triage/SKILL.md:9)
Lists query_icinga, fetch_github_file, search_github_repo — but the SDK path uses MCP servers directly with names like mcp__icinga__get_hosts. The declared contract doesn't match actual tool names.

4. SDK path does not pass conversation_history (src/agent/runner.py:126)
Same issue as PR #30. Follow-up Icinga questions referencing prior conversation context fail silently.

5. Third copy of _section() helper (src/agent/icinga_sdk.py:80)
Now duplicated in agent_sdk_client.py, sdk_tracing.py, and icinga_sdk.py. Extract before merging any of the three PRs.

Cross-PR patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants