____ _ _ _ ___
| _ \ _ _ _ __ | |__ ___ ___ | | __ / \ |_ _|
| |_) | | | | '_ \| '_ \ / _ \ / _ \| |/ / / _ \ | |
| _ <| |_| | | | | |_) | (_) | (_) | < / ___ \ | |
|_| \_\\__,_|_| |_|_.__/ \___/ \___/|_|\_/_/ \_\___|
Your AI SRE, always on call
RunbookAI helps on-call engineers go from alert to likely root cause faster with hypothesis-driven investigation, runbook-aware context, and approval-gated remediation.
Built for SRE and platform teams operating AWS and Kubernetes who need speed without losing auditability.
See RunbookAI's hypothesis-driven investigation in action:
npx @runbook-agent/runbook demoWatch the agent investigate a simulated incident—forming hypotheses, gathering evidence, and identifying root cause—all in your terminal.
Use --fast for a quicker demo: npx @runbook-agent/runbook demo --fast
npm install -g @runbook-agent/runbookPackage: @runbook-agent/runbook
# Set your API key
export ANTHROPIC_API_KEY=your-api-key
# Run the setup wizard
runbook initrunbook investigate PD-12345Expected output:
Investigation: PD-12345
Hypothesis: checkout-api latency spike caused by Redis connection exhaustion (confidence: 0.86)
Evidence: CloudWatch errors, Redis saturation, pod restart timeline
Next step: apply runbook "Redis Connection Exhaustion" (approval required)
Prerequisites: Node.js 20+, Bun
git clone https://github.com/Runbook-Agent/RunbookAI.git runbook
cd runbook
bun install
bun run dev investigate PD-12345- Faster triage: Research-first and hypothesis-driven workflows reduce alert-to-understanding time.
- Safer execution: Mutating actions require approval and can include rollback guidance.
- Operational memory: Knowledge retrieval uses your runbooks, postmortems, and architecture notes.
- Full audit trail of queries, hypotheses, and decisions.
- Approval gates for sensitive actions.
- Kubernetes access is read-only by default and can be explicitly enabled.
- Hypothesis-driven incident investigation with branch/prune logic.
- Runtime skill execution with approval-aware workflow steps.
- Dynamic skill and knowledge wiring at runtime.
- Incident integrations for PagerDuty and OpsGenie.
- Optional GitHub/GitLab code-fix candidate retrieval during remediation planning.
- Claude Code integration with context injection and safety hooks.
- MCP server exposing searchable operational knowledge.
Commands below use the installed runbook binary. During local development, use bun run dev <command>.
Run a pre-scripted investigation demo showcasing RunbookAI's hypothesis-driven workflow. No API keys or configuration required.
runbook demo # Normal speed
runbook demo --fast # 3x speedAsk questions about your infrastructure in natural language.
runbook ask "What's the status of the checkout-api service?"
runbook ask "Show me RDS instances with high CPU"
runbook ask "Who owns the payments service?"Perform a hypothesis-driven investigation of a PagerDuty or OpsGenie incident.
runbook investigate PD-12345
runbook investigate PD-12345 --auto-remediate
runbook investigate PD-12345 --learn
runbook investigate PD-12345 --learn --apply-runbook-updatesThe agent will:
- Gather incident context
- Form initial hypotheses
- Test each hypothesis with targeted queries
- Branch deeper on strong evidence
- Identify root cause with confidence level
- Suggest remediation
With --learn, Runbook also writes learning artifacts to .runbook/learning/<investigation-id>/:
postmortem-<incident>.mddraftknowledge-suggestions.json- runbook update proposals (or direct updates with
--apply-runbook-updates)
Get a quick overview of your infrastructure health.
Sync knowledge from all configured sources (runbooks, post-mortems, etc.).
Search the knowledge base.
runbook knowledge search "redis connection timeout"Authenticate with Google Drive for knowledge sync.
# Set up OAuth credentials first
export GOOGLE_CLIENT_ID=your-client-id
export GOOGLE_CLIENT_SECRET=your-client-secret
# Run authentication flow
runbook knowledge auth googleThis opens a browser for Google OAuth consent and saves the refresh token to your config.
Start Slack mention/event handling for @runbookAI requests in alert channels.
# Local development (Socket Mode)
runbook slack-gateway --mode socket
# HTTP Events API mode
runbook slack-gateway --mode http --port 3001See setup details in docs/SLACK_GATEWAY.md.
RunbookAI integrates deeply with Claude Code to provide contextual knowledge during your AI-assisted debugging sessions.
Install Claude Code hooks for automatic context injection:
# Project-scoped install (recommended)
runbook integrations claude enable
# Check installation
runbook integrations claude statusWhen enabled, RunbookAI automatically:
- Injects relevant context: Detects services and symptoms in your prompts and provides matching runbooks and known issues
- Blocks dangerous commands: Prevents accidental destructive operations (kubectl delete, rm -rf, etc.)
- Tracks session state: Maintains investigation context across prompts
Start an MCP server exposing RunbookAI knowledge as tools Claude Code can query:
# Start MCP server
runbook mcp serve
# List available tools
runbook mcp toolsAvailable tools: search_runbooks, get_known_issues, search_postmortems, get_knowledge_stats, list_services
Save and resume investigation state across sessions:
# List checkpoints for an investigation
runbook checkpoint list --investigation inv-12345
# Show checkpoint details
runbook checkpoint show abc123def456 --investigation inv-12345
# Delete a specific checkpoint
runbook checkpoint delete abc123def456 --investigation inv-12345
# Delete all checkpoints for an investigation
runbook checkpoint delete --investigation inv-12345 --allSee docs/CLAUDE_INTEGRATION.md for full documentation.
Generate learning artifacts directly from a stored Claude session:
runbook integrations claude learn <session-id> --incident-id PD-123See storage and ingestion architecture in docs/CLAUDE_SESSION_STORAGE_PROPOSAL.md.
Use the setup wizard to generate and update config files:
runbook initExample output (abridged):
═══════════════════════════════════════════
Runbook Setup Wizard
═══════════════════════════════════════════
Step 1: Choose your AI provider
Step 2: Enter your API key
...
Setup Complete!
Configuration complete! Your settings have been saved to .runbook/services.yaml
This writes .runbook/config.yaml and .runbook/services.yaml. A reference config.yaml looks like:
llm:
provider: anthropic
model: claude-sonnet-4-20250514
providers:
aws:
enabled: true
regions: [us-east-1, us-west-2]
kubernetes:
enabled: false
github:
enabled: false
repository: acme/platform # owner/repo
token: ${GITHUB_TOKEN}
baseUrl: https://api.github.com
timeoutMs: 5000
gitlab:
enabled: false
project: acme/platform # path or numeric project ID
token: ${GITLAB_TOKEN}
baseUrl: https://gitlab.com/api/v4
timeoutMs: 5000
operabilityContext:
enabled: false
adapter: none
baseUrl: https://context.company.internal
apiKey: ${RUNBOOK_OPERABILITY_CONTEXT_API_KEY}
timeoutMs: 5000
incident:
pagerduty:
enabled: true
apiKey: ${PAGERDUTY_API_KEY}
opsgenie:
enabled: false
apiKey: ${OPSGENIE_API_KEY}
slack:
enabled: false
botToken: ${SLACK_BOT_TOKEN}
appToken: ${SLACK_APP_TOKEN}
signingSecret: ${SLACK_SIGNING_SECRET}
events:
enabled: false
mode: socket
port: 3001
alertChannels: [C01234567]
allowedUsers: [U01234567]
requireThreadedMentions: true
knowledge:
sources:
- type: filesystem
path: .runbook/runbooks/
watch: true
# Confluence Cloud/Server
- type: confluence
baseUrl: https://mycompany.atlassian.net
spaceKey: SRE
labels: [runbook, postmortem]
auth:
email: ${CONFLUENCE_EMAIL}
apiToken: ${CONFLUENCE_API_TOKEN}
# Google Drive (requires OAuth - run `runbook knowledge auth google`)
- type: google_drive
folderIds: ['your-folder-id']
clientId: ${GOOGLE_CLIENT_ID}
clientSecret: ${GOOGLE_CLIENT_SECRET}
refreshToken: ${GOOGLE_REFRESH_TOKEN}
includeSubfolders: true
integrations:
claude:
sessionStorage:
# local | s3
backend: local
# keep a local copy even if backend is s3
mirrorLocal: true
localBaseDir: .runbook/hooks/claude
s3:
bucket: your-runbook-session-logs
prefix: runbook/hooks/claude
region: us-east-1
# optional for MinIO/custom S3-compatible endpoints
endpoint: https://s3.amazonaws.com
forcePathStyle: falseSee PLAN.md for full configuration options.
Use the built-in simulation utilities to stage deterministic chat + investigate demos:
# Create simulation runbooks and sync knowledge
bun run simulate:setup
# Optional: provision failing AWS resources + trigger PagerDuty incident
bun run simulate:setup -- --with-aws --create-pd-incident
# Cleanup simulation infra/resources
bun run simulate:cleanupDetailed guide: docs/SIMULATE_INCIDENTS.md
Run real-loop investigation benchmarks against fixture datasets:
bun run eval:investigate -- \
--fixtures examples/evals/rcaeval-fixtures.generated.json \
--out .runbook/evals/rcaeval-report.jsonRun all benchmark adapters in one command (RCAEval + Rootly + TraceRCA):
bun run eval:all -- \
--out-dir .runbook/evals/all-benchmarks \
--rcaeval-input examples/evals/rcaeval-input.sample.json \
--tracerca-input examples/evals/tracerca-input.sample.jsoneval:all now auto-runs dataset bootstrap (src/eval/setup-datasets.ts) before benchmarking.
It will attempt to clone required public dataset repos under examples/evals/datasets/, then continue
with available local inputs and fallback fixtures when network/downloads are unavailable.
To run without bootstrap:
bun run eval:all -- --no-setupThis generates per-benchmark reports plus an aggregate summary:
.runbook/evals/all-benchmarks/rcaeval-report.json.runbook/evals/all-benchmarks/rootly-report.json.runbook/evals/all-benchmarks/tracerca-report.json.runbook/evals/all-benchmarks/summary.json
See docs/INVESTIGATION_EVAL.md for dataset setup and converter workflows.
Create markdown files in .runbook/runbooks/ with frontmatter:
---
type: runbook
services: [checkout-api, cart-service]
symptoms:
- "Redis connection timeout"
- "Connection pool exhausted"
severity: sev2
---
# Redis Connection Exhaustion
## Symptoms
...
## Quick Diagnosis
...
## Mitigation Steps
...See examples/runbooks/ for examples.
Query/Incident
↓
Knowledge Retrieval (runbooks, post-mortems)
↓
Hypothesis Formation
↓
Targeted Evidence Gathering
↓
Branch (strong evidence) / Prune (no evidence)
↓
Root Cause + Confidence
↓
Remediation (with approval)
↓
Scratchpad (full audit trail)
# Run in development mode
bun run dev ask "test query"
# Type check
bun run typecheck
# Lint
bun run lint
# Format
bun run formatThis repository uses Release Please for automated versioning and GitHub releases.
- Merge regular PRs into
main. Release Pleaseworkflow updates or opens a release PR with version bumps + changelog updates.- Merge that release PR.
- Release Please creates a git tag (
vX.Y.Z) and publishes a GitHub Release. - In the same workflow run, npm publish executes automatically when enabled.
/.github/workflows/release-please.yml- Trigger: push to
main(or manual dispatch) - Responsibility: maintain release PR, create tags/releases after release PR merge, then publish to npm when a release is created
- Trigger: push to
Use this local command to run release checks and trigger Release Please:
npm run releasePrerequisites:
ghCLI installed and authenticated (gh auth login)- Clean local working tree on
main - Local
mainsynced withorigin/main
Helper variants:
npm run release:dry-runto validate preconditions without triggering workflownpm run release:skip-checksto bypass local checks (typecheck/lint/test/build)
If your GitHub organization blocks write permissions for GITHUB_TOKEN, set a repo secret:
RELEASE_PLEASE_TOKEN(PAT or fine-grained token with permission to write contents/pull requests/issues)
The release workflow automatically prefers RELEASE_PLEASE_TOKEN when present.
Use npm Trusted Publishing (OIDC), then enable publishing:
- npm package settings: add this repository/workflow as a trusted publisher
- Provider: GitHub Actions
- Repository:
Runbook-Agent/RunbookAI - Workflow filename:
release-please.yml
- GitHub repo variable:
NPM_PUBLISH_ENABLED=true
Notes:
- No npm token is required in GitHub secrets.
- Publish is skipped unless
NPM_PUBLISH_ENABLED=true. - The release tag must match
package.jsonversion. - Ensure package name/access are valid for npm before enabling publish (currently
@runbook-agent/runbookinpackage.json). - If npm publish logs show
Access token expired or revoked, removeNODE_AUTH_TOKEN/NPM_TOKENsecrets at org/repo/environment level so trusted publishing can use OIDC.
Release Please uses Conventional Commits for semver bumping:
fix:-> patchfeat:-> minorfeat!:orBREAKING CHANGE:-> major
- Dynamic runtime skills now execute workflow steps with approval hooks.
- Kubernetes tooling is available as a read-only query surface and can be gated with
providers.kubernetes.enabled. - Investigation evaluation now supports RCAEval, Rootly, and TraceRCA via a unified runner (
bun run eval:all). - Incident simulation tooling uses generic scripts:
bun run simulate:setupandbun run simulate:cleanup. - Claude Code integration includes context hooks, checkpoints, and MCP knowledge tools.
- Operability context provider contract added for external context backends (Sourcegraph/checkpoints style): docs/OPERABILITY_CONTEXT_PROVIDER.md.
- Added operability ingestion commands with local spool replay (
runbook operability ingest|replay|status) and automatic Claude hook forwarding. - Added built-in operability context adapters (
sourcegraph,entireio,runbook_context,custom) with config-based provider factory wiring. - Ingestion setup/runbook for teams: docs/OPERABILITY_INGESTION.md.
- Full implementation details: docs/CHANGES_2026-02-08.md and CODEX_PLAN.md
MIT