Enterprise-style AI backend for operational ticket triage, incident investigation, evidence-grounded recommendations, and human-review routing.
This project demonstrates how an agentic AI system can decide whether an operational request requires SQL analysis, log search, document retrieval, rule-based validation, or human review. Unlike a basic RAG chatbot, the system performs a controlled decision workflow with tool routing, evidence verification, confidence scoring, trace logging, and benchmark evaluation.
Companies handle operational tickets, incident reports, service metrics, logs, policies, and runbooks every day.
A normal RAG chatbot can retrieve documents, but real operational decisions often require multiple evidence sources.
Example ticket:
EU customers are reporting payment failures after checkout. Should this be escalated?
A useful AI system should check:
- service metrics
- operational logs
- SLA policies
- escalation rules
- confidence level
- need for human review
Given a ticket or operational query, the system:
- Classifies the request type
- Plans which tools are required
- Routes the request to selected tools
- Collects evidence from SQL, logs, and documents
- Applies escalation rules
- Calculates confidence
- Sends uncertain cases to human review
- Stores a full trace of the decision
FastAPI API
|
v
LangGraph Agent Workflow
|
+--> Task Classifier
+--> Planner
+--> Tool Router
|
+--> SQL Tool
+--> Log Search Tool
+--> RAG Tool
+--> Rule Validator
+--> Human Review Tool
|
+--> Evidence Verifier
+--> Confidence Scorer
+--> Response Generator
+--> Trace Logger
Request:
{
"query": "EU customers are reporting payment failures after checkout. Should this be escalated?",
"ticket_id": "TCK-1001"
}Response:
{
"ticket_id": "TCK-1001",
"task_type": "escalation_decision",
"priority": "P1",
"recommendation": "Escalate this incident as P1. Payment-service error rate is 18.7%, above the 10% P1 threshold. Error-level logs were found for the affected service.",
"confidence": 0.95,
"human_review_required": false,
"tools_used": [
"sql_tool",
"log_search_tool",
"rag_tool",
"rule_validator"
],
"matched_rules": [
"payment_failure_rate_above_10_percent",
"affected_customers_above_50",
"enterprise_customers_above_10",
"error_logs_present"
],
"trace_id": "trace_..."
}| Query Type | Example | Selected Tools |
|---|---|---|
| Escalation decision | “Should EU payment failures be escalated?” | SQL, logs, RAG, rules |
| Policy lookup | “What does the SLA policy say?” | RAG |
| Metrics lookup | “What is the payment failure rate in EU?” | SQL |
| Log analysis | “Find timeout errors in payment logs.” | Log search |
| Ambiguous request | “Something seems wrong.” | Human review |
- FastAPI backend
- LangGraph-based workflow orchestration
- Dynamic tool routing
- SQL tool for structured service metrics
- Log search tool for operational errors
- RAG tool for policies and runbooks
- Rule-based escalation validator
- Evidence quality verification
- Confidence scoring
- Human-review fallback
- Trace logging for auditability
- Benchmark evaluation
- Pytest test suite
- Dockerized local deployment
Python
FastAPI
LangGraph
SQLAlchemy
SQLite
Pydantic
Sentence Transformers
FAISS
Pandas
Pytest
Docker
Create and activate a virtual environment:
python -m venv .venvWindows PowerShell:
.\.venv\Scripts\Activate.ps1Linux/macOS:
source .venv/bin/activateInstall dependencies:
pip install -r requirements.txtSeed the database:
python -m app.db.seedBuild the retrieval index:
python -m app.retrieval.build_indexRun the API:
uvicorn app.main:app --reloadOpen Swagger UI:
http://127.0.0.1:8000/docs
docker compose up --buildThen open:
http://127.0.0.1:8000/docs
Run:
python -m evaluation.run_evaluationThe evaluation measures:
task classification accuracy
tool routing accuracy
human-review accuracy
priority accuracy
decision accuracy
average latency
p95 latency
Generated files:
evaluation/results.csv
evaluation/metrics_summary.json
evaluation/error_analysis.md
Run:
pytestTest coverage includes:
classifier
planner
SQL tool
RAG tool
rule validator
API endpoints
human-review behavior
trace generation
Expected result:
28 passed
---
## Design Decisions
Key design choices:
- SQL is used for structured metrics.
- Log search is used for operational events.
- RAG is used for policies and runbooks.
- Escalation logic is handled by deterministic rules.
- Low-confidence or ambiguous cases go to human review.
- Every request is stored as a trace for auditability.
- Evaluation measures workflow behavior, not only final text output.
Detailed docs:
```text
docs/architecture.md
docs/design_decisions.md
docs/agent_workflow.md
docs/evaluation_methodology.md
docs/failure_cases.md
docs/production_considerations.md
Potential production extensions:
PostgreSQL instead of SQLite
Qdrant or pgvector instead of FAISS
OpenTelemetry tracing
authentication and RBAC
GitHub Actions CI
larger benchmark dataset
hybrid search and reranking
human-review dashboard
integration with Jira, ServiceNow, Datadog, or Grafana
\end{itemize}





