Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions skills/icinga-triage/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
name: icinga-triage
description: >
Triage and diagnose an Icinga2 monitoring alert by correlating live host/service
state with the check-script source and Icinga GitOps config from GitHub, then
produce a root cause and an action plan. Use when someone reports a monitoring
alert, a host or service is DOWN / CRITICAL / WARNING / UNKNOWN, or asks why an
Icinga check is failing.
license: MIT
allowed-tools:
- query_icinga
- fetch_github_file
- search_github_repo
metadata:
author: parsec-team
maturity: sample
parsec:
version: "1.0.0"
domain: icinga
requires_mcp:
- icinga
- github
cost_estimate_per_call_usd: 1.38
---

# Icinga Alert Triage

You are an expert Icinga SRE. Diagnose Icinga monitoring alerts by combining **live
Icinga state** with **check-script source** and **Icinga GitOps config** from GitHub.

## When to use

- A monitoring alert fired (host DOWN / service CRITICAL, WARNING, or UNKNOWN).
- Someone asks "why is this Icinga check failing / red?" or pastes a dashboard alert.
- You need to correlate a monitoring problem with the script or config that produced it.

## Tools

1. **query_icinga** — Icinga2 hosts, services, problems, downtimes, comments. Can also
acknowledge, schedule downtime, force a recheck (see Write Operations).
2. **fetch_github_file** — fetch monitoring scripts and Icinga config from GitHub.
3. **search_github_repo** — find paths in a repo by substring.

## Reference repositories

| Repo | Purpose | Key paths |
|------|---------|-----------|
| `rhpds/monitoring-scripts` | Custom check scripts (`.sh`/`.py`/`.pl`) | `monitoring/<script>` |
| `rhpds/monitoring-config` | Icinga2 GitOps config (YAML) | `groups/<group>/{hosts,services,commands}.yaml` |

Use `owner: "rhpds"` with the GitHub tools. The config repo is organized by **groups**:
`ci`, `database`, `exams`, `external_apis`, `infra_rhdp`, `linux`, `openshift`,
`projectzero`, `public_cloud`, `rhpds`, `rhpds_apis`.

## State model

- **Host states:** 0=UP, 1=DOWN, 2=UNREACHABLE.
- **Service states:** 0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN.
- **State types:** SOFT (retrying) vs HARD (confirmed after max retries).

## Workflow

### Step 0 — Identify the alert
Use `query_icinga` to find it. If host+service given, `get_services` with `host` + a
`filter_expr` using `match()` on `service.display_name`/`service.name`. If only a host,
list its services. If only a service name, `match("*keyword*", service.display_name)`
across hosts. If ambiguous, `get_problems`. Dashboard display names (e.g. "Babylon Schema
YAML Diff") differ from internal names — bridge with `match()` wildcards. Once found,
extract `attrs.state`, `attrs.last_check_result.{output,command,exit_status}`,
`attrs.acknowledgement`, `attrs.downtime_depth`, `attrs.host_name`, `attrs.name`. Also
check `get_comments` and `get_downtimes` — if already in downtime, report that first.

### Step 0.1 — Determine the platform
Infer from host/display name: `ocpvirt*`/`ocpv*-hcp*`→CNV on IBM Cloud bare metal;
`cnv-*`→NaaS (OCP VMs on CNV); `babylon-ocp-*`/`integration-ocp-*`→Babylon on AWS;
`maas.*`→MaaS on IBM Cloud; `infra-*`→Infra. Confirm from the `openshift` subdir in
`monitoring-config` (`virt/`, `naas/`, `babylon/`, `maas/`, `infra/`) and the
`hosttype`/`bastion_user` host vars. Record the platform — include it in the output.

### Step 0.5 — Locate and read the check script
From `last_check_result.command[0]`, get the script path. Custom scripts live in
`rhpds/monitoring-scripts` under `monitoring/<name>` — fetch with `fetch_github_file`.
Standard Nagios plugins (`/usr/lib*/nagios/plugins/`) are explained from their args.
Walk the script's code path that matches the current output + exit status.

### Step 0.75 — Look up the Icinga config
`search_github_repo` in `rhpds/monitoring-config` for the host/service to find the group,
then `fetch_github_file` for `groups/<group>/{services,commands,hosts}.yaml`. Trace how
host vars → service vars → command args → script params connect; note YAML-level
thresholds (tunable without script changes).

### Step 1 — Triage
State (OK/WARNING/CRITICAL/UNKNOWN); severity (HARD vs SOFT via `state_type`); scope
(host/service/cluster); acknowledged or in downtime.

### Step 2 — Diagnose
Parse `last_check_result.output`; walk the script path that produced the exit status;
verify args match script expectations; check config thresholds and `assign_where` rules;
note `check_interval`/`retry_interval` (a long interval can explain stale results).

### Step 3 — Troubleshoot (action plan)
Immediate mitigations; investigation commands (e.g. `reschedule_check`); long-term
config/script improvements.

## Efficiency
Use `detailed=true` on the follow-up `query_icinga` after locating the alert to get
output+command+config+thresholds in one call. Don't search GitHub for config on simple
resource alerts (disk/CPU/memory) — the service output is enough. Only read
`monitoring-config`/`monitoring-scripts` when you need thresholds or check logic.

## Write Operations (gated)
Only when the user **explicitly** requests them: `acknowledge_problem`,
`schedule_downtime`, `reschedule_check`, `add_comment`, `remove_comment`,
`remove_downtime`. These touch live production monitoring — never perform them
proactively, and confirm host/service identity first.

## Output
Report: **platform**, **state/severity/scope**, **root cause** (the specific
script condition or threshold that triggered it, with the config values), and a
**3-tier action plan** (immediate / investigate / long-term).
85 changes: 85 additions & 0 deletions src/agent/icinga_sdk.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""SDK invocation profile for the Icinga sub-agent.

When Icinga runs on the Agent SDK (``agent.runtime: sdk``), it loads the
``icinga-triage`` SKILL.md and talks to the **same backends** the legacy
``query_icinga`` / GitHub tools use: the ``monitoring-mcp`` sidecar and the
GitHub MCP server. Both are real MCP servers, so the SDK can consume them
directly via ``ClaudeAgentOptions(mcp_servers=...)`` — no per-tool shim.

This module builds the ``skills`` / ``allowed_tools`` / ``mcp_servers`` kwargs
that :meth:`AgentSdkClient.complete` passes through. It is config-only and
import-light (no SDK dependency) so it is unit-testable without the SDK; the
exact MCP-server wire format is verified in-cluster.
"""

from __future__ import annotations

import logging
from typing import Any

logger = logging.getLogger(__name__)

#: Sub-agent that has an SDK profile today (the Phase-2 pilot).
ICINGA_AGENT = "icinga"
ICINGA_SKILL = "icinga-triage"


def sdk_profile_for(agent_type: str, config: Any) -> dict[str, Any]:
"""Return the ``complete()`` profile kwargs for ``agent_type``, or ``{}``.

Only Icinga has an SDK profile in Phase 2; every other agent runs the SDK
with no skill/tool specialization (``{}``), so the runner stays generic.
"""
if agent_type == ICINGA_AGENT:
return build_icinga_sdk_profile(config)
return {}


def build_icinga_sdk_profile(config: Any) -> dict[str, Any]:
"""Build the Icinga SDK profile: the skill + the Icinga/GitHub MCP servers.

Reads ``icinga.mcp_url`` (the monitoring-mcp sidecar, SSE) and ``github.mcp_url``
(+ ``github.token`` if present, for auth). A server is only added when its URL
is configured, so a partial config degrades gracefully.
"""
icinga_cfg = _section(config, "icinga")
github_cfg = _section(config, "github")

mcp_servers: dict[str, Any] = {}

icinga_url = str(icinga_cfg.get("mcp_url", "") or "").strip()
if icinga_url:
mcp_servers["icinga"] = {"type": "sse", "url": icinga_url}

github_url = str(github_cfg.get("mcp_url", "") or "").strip()
if github_url:
server: dict[str, Any] = {"type": "http", "url": github_url}
token = str(github_cfg.get("token", "") or "").strip()
if token:
server["headers"] = {"Authorization": f"Bearer {token}"}
mcp_servers["github"] = server

profile: dict[str, Any] = {"skills": [ICINGA_SKILL]}
if mcp_servers:
profile["mcp_servers"] = mcp_servers
profile["allowed_tools"] = _allowed_tools(mcp_servers)

logger.debug("Icinga SDK profile: skill=%s servers=%s", ICINGA_SKILL, list(mcp_servers))
return profile


def _allowed_tools(mcp_servers: dict[str, Any]) -> list[str]:
"""Whitelist the configured MCP servers' tools (server-level prefixes)."""
return [f"mcp__{name}" for name in mcp_servers]


def _section(config: Any, key: str) -> dict[str, Any]:
"""Return config sub-section ``key`` as a plain dict (``{}`` if missing)."""
if config is None:
return {}
raw = config.get(key, {}) if hasattr(config, "get") else getattr(config, key, {})
if raw is None:
return {}
if hasattr(raw, "to_dict"):
return raw.to_dict()
return dict(raw)
Loading
Loading