Agentic Root Cause Analysis engine for AI-powered autonomous reliability, SRE, and support.
AutoRCA-Core is a graph-based RCA engine that analyzes logs, metrics, traces, configs, and documentation to automatically identify root causes and recommend remediation steps. It's designed as a reference architecture for building autonomous operations and reliability agents.
AutoRCA-Core provides:
- Multi-signal ingestion: Logs, metrics, distributed traces, and config changes
- Graph-based topology: Builds service dependency graphs and causal relationships
- Rule-based reasoning: Deterministic heuristics for identifying root causes
- LLM integration (optional): Enhance analysis with natural language insights
- Autonomous-first design: Built to be called by AI agents, UIs, and automation workflows
Key differentiators:
- Graph-based causal analysis over temporal event correlation
- Works offline with rules-only mode (no LLM required)
- Designed for integration into larger autonomous ops stacks
- SRE teams investigating production incidents
- DevOps engineers correlating failures across services
- Platform teams building autonomous reliability agents
- Architects designing AI-powered troubleshooting workflows
AutoRCA-Core is part of a broader autonomous operations ecosystem including:
awesome-autonomous-ops– Curated list of AI ops toolsSecure-MCP-Gateway– Security-first MCP gateway for ops toolsOps-Agent-Desktop– Visual mission control for autonomous ops agentsADAPT-Agents– Agent orchestration layer (companion repo)
AutoRCA-Core follows a layered architecture for clarity and extensibility:
┌─────────────────────────────────────────────────────────────┐
│ CLI / API Layer │
│ (autorca CLI, Python API, MCP server) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Reasoning Layer │
│ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ Rules │ │ LLM (opt) │ │ Reasoning Loop │ │
│ │ Heuristics │ │ Interface │ │ Orchestration │ │
│ └────────────┘ └────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Graph Engine Layer │
│ ┌─────────────────────┐ ┌─────────────────────────┐ │
│ │ Graph Builder │ │ Graph Queries │ │
│ │ (topology + events) │ │ (causal chains, RCA) │ │
│ └─────────────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Ingestion Layer │
│ ┌──────┐ ┌─────────┐ ┌────────┐ ┌─────────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │ Configs │ │
│ └──────┘ └─────────┘ └────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Data Sources (files, APIs, streams) │
└─────────────────────────────────────────────────────────────┘
Key concepts:
- Service Graph: Topology of services and dependencies inferred from traces
- Incident Nodes: Anomalies detected (error spikes, latency, resource exhaustion)
- Causal Chains: Dependency paths showing how failures propagate
- Root Cause Candidates: Ranked list with confidence scores and evidence
- Python 3.10+
- (Optional) OpenAI or Anthropic API key for LLM-enhanced summaries
# Clone the repository
git clone https://github.com/nik-kale/AutoRCA-Core.git
cd AutoRCA-Core
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install the package
pip install -e .
# Or install with LLM support
pip install -e ".[llm]"autorca quickstartThis runs RCA on synthetic data simulating a database connection pool exhaustion incident. You'll see:
- Root cause identified: PostgreSQL connection saturation
- Causal chain:
postgres → user-service → api-gateway → frontend - Remediation: Scale connection pool, check for leaks
autorca run \
--logs /path/to/logs \
--metrics /path/to/metrics \
--symptom "Checkout API returning 500 errors" \
--output report.mdSupported formats:
- Logs: JSON Lines, plain text (auto-parsed)
- Metrics: CSV, JSON Lines
- Traces: OpenTelemetry JSON, Jaeger JSON
- Configs: JSON, YAML (deployment/config change events)
from datetime import datetime
from autorca_core import run_rca, DataSourcesConfig, AnthropicLLM
# Define the incident time window
window = (
datetime(2025, 11, 10, 10, 0, 0),
datetime(2025, 11, 10, 10, 5, 0),
)
# Configure data sources
sources = DataSourcesConfig(
logs_dir="./logs",
metrics_dir="./metrics",
traces_dir="./traces",
)
# Run RCA
result = run_rca(
incident_window=window,
primary_symptom="API 500 errors",
data_sources=sources,
)
# Access results
print(f"Top root cause: {result.root_cause_candidates[0].service}")
print(f"Confidence: {result.root_cause_candidates[0].confidence:.0%}")
print(result.summary)import os
from autorca_core import run_rca, DataSourcesConfig, AnthropicLLM
# Initialize Anthropic LLM (requires ANTHROPIC_API_KEY env var)
llm = AnthropicLLM(
api_key=os.getenv("ANTHROPIC_API_KEY"),
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
)
# Run RCA with LLM enhancement
result = run_rca(
incident_window=window,
primary_symptom="API 500 errors",
data_sources=sources,
llm=llm, # Add LLM for enhanced summaries
)
# Get comprehensive AI-generated analysis
print(result.summary) # Structured RCA with executive summary, impact assessment, and remediation
# Check token usage and costs
stats = llm.get_usage_stats()
print(f"Tokens used: {stats['total_tokens']}, Cost: ${stats['total_cost_usd']:.4f}")AutoRCA-Core is designed to be a composable building block in AI-powered operations workflows:
-
Agent-driven troubleshooting
- Autonomous agents (e.g., from
ADAPT-Agents) call AutoRCA-Core to investigate incidents - RCA results guide next actions: gather more data, escalate, or remediate
- Autonomous agents (e.g., from
-
MCP exposure via Secure-MCP-Gateway
- Expose AutoRCA-Core as an MCP tool for Claude Desktop, Ops-Agent-Desktop, or other MCP clients
- Enable AI assistants to perform RCA with policy controls and human-in-the-loop approvals
-
Visual investigation in Ops-Agent-Desktop
- Ops-Agent-Desktop calls AutoRCA-Core and visualizes causal graphs in real-time
- Shows live incident timelines and reasoning steps
-
Runbook automation
- Use AutoRCA-Core to detect root causes, then trigger automated remediation via Ansible, Terraform, or K8s operators
AutoRCA-Core/
├── autorca_core/ # Main package
│ ├── ingestion/ # Data loaders (logs, metrics, traces, configs)
│ ├── model/ # Data models (events, graphs)
│ ├── graph_engine/ # Graph construction and querying
│ ├── reasoning/ # RCA logic (rules, LLM, loop)
│ ├── outputs/ # Report generation (markdown, JSON, HTML)
│ └── cli/ # CLI interface
├── examples/ # Example data and scenarios
│ └── quickstart_local_logs/ # Quickstart synthetic data
├── tests/ # Test suite
├── docs/ # Architecture and usage docs
├── pyproject.toml # Package configuration
├── README.md # This file
└── LICENSE # MIT license
AutoRCA-Core is designed for extensibility:
Implement custom log/metric parsers by extending ingestion modules:
# autorca_core/ingestion/custom_parser.py
from autorca_core.model.events import LogEvent
def parse_custom_format(line: str) -> LogEvent:
# Your parsing logic
...Add domain-specific heuristics:
# autorca_core/reasoning/custom_rules.py
from autorca_core.reasoning.rules import RootCauseCandidate
def rule_custom_pattern(graph):
# Detect custom incident patterns
...
return [RootCauseCandidate(...)]Implement the LLMInterface protocol:
from autorca_core.reasoning.llm import LLMInterface
class MyCustomLLM:
def summarize_rca(self, graph, candidates, symptom):
# Call your LLM
...- Core graph-based RCA engine
- Multi-signal ingestion (logs, metrics, traces, configs)
- Rule-based reasoning with causal chains
- CLI and Python API
- OpenAI and Anthropic LLM integrations
- MCP server for tool exposure
- Prometheus and OpenTelemetry native connectors
- Interactive HTML reports with graph visualizations
- Kubernetes and service mesh topology providers
- Pre-built RCA templates for common incident types (DB saturation, DNS, auth)
Contributions are welcome! This project aims to be a reference architecture for autonomous ops tools.
How to contribute:
- Open issues for bugs or feature requests
- Submit PRs for parsers, heuristics, or integrations
- Share anonymized incident examples for testing
- Suggest improvements to the reasoning engine
See CONTRIBUTING.md for guidelines.
AutoRCA-Core performs read-only analysis by default. It does not execute commands or modify systems.
For production use:
- Validate data sources: Ensure logs/metrics are from trusted sources
- Sanitize sensitive data: Remove PII, secrets, and credentials before analysis
- Use Secure-MCP-Gateway: When exposing AutoRCA-Core as a tool, use policy controls and human approvals
MIT License - see LICENSE for details.
AutoRCA-Core draws inspiration from:
- Academic research in fault localization and causal inference
- Production RCA workflows at large-scale SaaS and cloud providers
- The growing ecosystem of AI-powered operations tools
Built by Nik Kale as part of an open-source initiative to advance autonomous operations and reliability engineering.
If you find AutoRCA-Core useful:
- ⭐ Star the repo to help others discover it
- 📢 Share it with your SRE, DevOps, and platform teams
- 🐛 Open issues with real-world scenarios (sanitized) to help improve the engine
- 🤝 Contribute parsers, rules, or integrations
For questions and discussions, open a GitHub issue.
AutoRCA-Core: Foundation for autonomous reliability agents. Graph-based RCA over logs, metrics, and traces.