Skip to content

Project 2 T RAG

Huzefa Husain edited this page Dec 7, 2025 · 2 revisions

Project 2 – T-RAG: Trace-Native RAG for Root Cause

1. Overview

Trace-Native RAG (T-RAG) is an AI-powered root cause analysis system for complex distributed environments. It leverages telemetry traces collected from services (via OpenTelemetry and CAAT’s eBPF runtime) and augments a large language model (LLM) with a vector-based memory of past spans. By grounding the LLM in actual trace data and similar historical contexts, T-RAG generates structured explanations of “what failed and why,” closing the loop between observability and cognitive reasoning.

2. Architecture

flowchart LR
    %% Classes
    classDef source fill:#dae8fc,stroke:#6c8ebf,stroke-width:1px,color:#000;
    classDef proc   fill:#fff2cc,stroke:#d6b656,stroke-width:1px,color:#000;
    classDef store  fill:#d5e8d4,stroke:#82b366,stroke-width:1px,color:#000;
    classDef llm    fill:#f8cecc,stroke:#b85450,stroke-width:1px,color:#000;
    classDef output fill:#e1d5e7,stroke:#9673a6,stroke-width:1px,color:#000;
    classDef caat   fill:#fde9d9,stroke:#d79b00,stroke-width:1px,color:#000;

    %% Nodes
    subgraph CAAT["CAAT Layer (Project 1)"]
        RL["RL Telemetry Optimizer\n(Budget Engine)"]
        TP["Telemetry Pipeline\n(eBPF / OpenTelemetry)"]
    end

    TL["Trace Loader\n(Span Parsing & Summaries)"]
    VM["Vector Memory Store\n(Embeddings & Similarity)"]
    LLM["LLM Reasoner\n(Trace-Native RAG)"]
    RC["Root Cause Report\n(JSON + Narrative RCA)"]

    %% Apply classes
    class TP,RL caat;
    class TL proc;
    class VM store;
    class LLM llm;
    class RC output;

    %% Main flow
    TP --> TL
    TL --> VM
    VM --> LLM
    LLM --> RC

    %% Control / feedback influence
    RL -. "adjusts sampling / policies" .-> TP
Loading

Architecture Description

  • Telemetry Pipeline (CAAT): Provides trace data from eBPF, OpenTelemetry, and service instrumentation.
  • Trace Loader: Converts raw spans into structured span summaries.
  • Vector Memory Store: Embeds spans using Sentence-Transformers; supports similarity search.
  • LLM Reasoner: Augmented prompt + context retrieved from vector memory → produces structured RCA.
  • Root Cause Report: JSON output with fields such as root_cause, service_chain, reasoning.

3. Implementation Overview

  • Code stored in projects/t_rag/src/t_rag/.
  • Includes components:
    • trace_loader.py
    • vector_memory.py
    • llm_reasoner.py
    • service.py
    • configurable via config.py
  • Example trace: projects/t_rag/examples/sample_trace.json
  • Dependencies: requirements.txt

4. Quickstart

cd MindOps
pip install -r projects/t_rag/requirements.txt
export OPENAI_API_KEY="sk-xxxx"
python -m t_rag.service --trace projects/t_rag/examples/sample_trace.json

5. Integration with CAAT (Project 1)

T-RAG consumes CAAT-optimized trace data. CAAT decides what to collect (cost-aware).
T-RAG decides what it means (LLM reasoning).


6. Roadmap

  • persistent vector database (Weaviate, Pinecone)
  • multi-signal ingestion (logs + metrics)
  • multi-agent RCA

Clone this wiki locally