Skip to content

rushikesh611/BeacnAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BeacnAI

BeacnAI is an open-source Site Reliability Engineering (SRE) agent framework.

It is designed to run continuously on a server, ingest incoming alerts (e.g. from Prometheus or Alertmanager), and autonomously investigate incidents using the exact same tools your human engineers use. What makes BeacnAI unique is its Learning Loop: after every incident, it reflects on what it found and updates its local memory files and runbooks, so it gets smarter and faster every time.

Core Features

  • Two-Tier Capability Model:
    • Skills (Tier 1): Markdown-based runbooks that map to specific alerts using a fuzzy-matching context engine.
    • Tools (Tier 2): Deterministic Python scripts that execute commands (e.g. querying Prometheus, searching Splunk, fetching K8s pods, reading recent GitHub commits, and running shell commands).
  • Persistent Memory:
    • Uses an async SQLite database (~/.beacnai/state.db) in WAL mode with FTS5 search to store all incidents, reflections, and full-text session histories.
    • Injects global context (INFRA.md) and per-service context directly into the agent prompt.
  • Provider-Agnostic: Out of the box support for OpenRouter (default), Anthropic, and local Ollama inference.

Setup and Installation

1. Requirements

  • Python 3.11+
  • uv (recommended) or pip

2. Install

Clone the repository and install the dependencies:

uv sync

Or using standard pip:

pip install -e .

The project installs official provider SDKs for OpenRouter, Anthropic, OpenAI, and Ollama through the package dependencies.

3. Configuration

Create a .env file in the root directory. You can use the provided .env.example as a template.

cp .env.example .env

Ensure you set your desired LLM Provider:

BEACNAI_PROVIDER=openrouter
BEACNAI_MODEL=anthropic/claude-3.5-sonnet
OPENROUTER_API_KEY=your_key_here

You can also set a custom database location:

BEACNAI_DB_PATH=/path/to/state.db

Configure your tools by adding your Prometheus URLs, GitHub tokens, etc. The tools will auto-discover which ones are available based on what variables you provide!

Usage

Check System Status

Ensure your provider is connected and see which tools successfully loaded:

python -m beacnai.main status

Run an Investigation Manually

You can kick off a manual investigation via the CLI. The agent will execute a ReAct loop, call tools, output the Root Cause Analysis, and run its learning loop to save insights.

python -m beacnai.main investigate "We have a payment latency spike in production."

Run the Webhook Server and Optional Cron Scheduler

Start BeacnAI as a persistent webhook ingestion service. Use --enable-cron to enable scheduled jobs, or set BEACNAI_CRON_ENABLED=true.

python -m beacnai.main serve --host 0.0.0.0 --port 3001 --enable-cron --cron-schedule "0 9 * * MON-FRI"

Enable Gateway Mode

If you want BeacnAI to accept generic alert payloads behind an API gateway, enable gateway mode. Then POST to /gateway with X-Alert-Source set to the alert source or include a source field in JSON.

BEACNAI_GATEWAY_MODE=true python -m beacnai.main serve --enable-cron

Example gateway request:

curl -X POST http://localhost:3001/gateway \
  -H "Content-Type: application/json" \
  -H "X-Alert-Source: prometheus" \
  -d '{"alerts":[{"labels":{"alertname":"HighCpu","severity":"critical","job":"web"}}]}'

Add Custom Runbooks (Skills)

You can teach the agent new skills by dropping a markdown file into the skills/ directory. BeacnAI will use fuzzy matching to automatically select the right skill when an alert fires.

Example: skills/sre/latency/SKILL.md

---
name: payment-latency
type: skill
description: Investigates high latency on the payment service.
triggers:
  alert_names:
    - HighLatency
    - PaymentLatencySpike
  keywords:
    - payment latency
---
# Payment Latency Runbook

Check the `fraud-service` first as it is a common bottleneck. 

Architecture Summary

  • Ingestion: An aiohttp webhook listener handles Prometheus alerts.
  • Context Engine: Assembles INFRA.md, per-service .md, and matches the alert to a specific SKILL.md.
  • ReAct Loop: The agent uses an LLM (via OpenRouter/Ollama) to autonomously select tools and investigate the issue.
  • Learning Loop: A post-incident review is executed and recorded into SQLite, with insights appended to memory.
  • Output: The final RCA is routed to an #incidents Slack channel via SLACK_BOT_TOKEN. Invite the bot to the target channel before using it.

License

MIT

About

Vibe coding an open-source SRE agent framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors