Skip to content

Berkay2002/agentic-rag-test-scope-analysis

Repository files navigation

Agentic GraphRAG for Test Scope Analysis

Research License Python Code style: black Ruff

Master's Thesis Research Project
Conducted in collaboration with Ericsson
Linköping University, 2026

A research implementation of an agentic Retrieval-Augmented Generation (RAG) system combining Knowledge Graphs, Vector Search, and Human-in-the-Loop workflows for test scope analysis in telecommunications software systems.

⚠️ Academic Research Project

This repository contains the implementation for a Master's Thesis. The code is made publicly available for:

  • Academic transparency and reproducible research
  • Peer review and thesis evaluation
  • Educational purposes and research community benefit

Important: This is a research artifact, not a production-ready commercial product. See LICENSE, DISCLAIMER.md, and CONTRIBUTING.md for usage terms and restrictions.

Overview

This system implements a comprehensive agentic RAG architecture that aims to address three key research questions:

  • RQ1: What Knowledge Graph ontology effectively represents software engineering entities and relationships for test scope analysis?
  • RQ2: How do different retrieval strategies (vector, keyword, graph, hybrid) compare for test-related queries?
  • RQ3: How can Human-in-the-Loop (HITL) agent workflows improve retrieval quality and user control?

Key Features

Agent & Orchestration

  • LangChain create_agent API: Modern high-level agent API with built-in middleware support
  • HITL Workflows: PostgresSaver checkpointing for human intervention and tool approval
  • Middleware Support: Built-in ModelCallLimit, ToolCallLimit, HumanInTheLoop, and PII detection middleware
  • LangSmith Integration: Full observability and debugging
  • Gemini Thinking Configuration: Adjustable reasoning depth via thinking_level (low/medium/high/minimal) or legacy budgets
  • Headless Mode: Scriptable CLI with text/json/stream-json output for automation
  • Conversation Export: Export chat transcripts with optional tool call details

Storage & Retrieval

  • Dual Storage Architecture: Neo4j (knowledge graph) + PostgreSQL (pgvector + pg_search BM25)
  • 4 Retrieval Tools (using @tool decorator pattern with factory functions):
    • Vector Search (pgvector semantic similarity)
    • Keyword Search (pg_search BM25 lexical matching)
    • Graph Traversal (Neo4j Cypher queries)
    • Hybrid Search (RRF fusion of vector + keyword)

Document Loading & Processing

  • AI-Powered Document Parsing: Docling integration (15+ formats)
    • DocLayNet layout analysis (IBM Research AI model)
    • TableFormer for table structure recognition
    • HybridChunker for semantic chunking
  • AST-Based Code Parsing: Tree-sitter for structure-aware code analysis
    • Extract functions, classes, methods with full context
    • Preserve parent-child relationships and docstrings
    • Multi-language support (Python, Java, etc.)
  • TGF (Test Governance Framework) Loader: CSV loader for Ericsson test execution results
    • Parses TGF CSV format with test metadata
    • Maps to TestCase entities with relationships
    • Validates and normalizes test results and types
  • Storage Writers: Modular writers for PostgreSQL/BM25 (default) and Neo4j (explicit graph materialization)
    • Retry logic with exponential backoff
    • Idempotent upserts for reliability

Evaluation

  • Comprehensive Metrics: Precision@k, Recall@k, MAP, MRR, F1@k
  • Agentic Evaluation Framework: Full agent pipeline testing with dynamic strategy comparison
  • Multi-Strategy Comparison: Evaluate vector, keyword, hybrid, graph, and agent strategies
  • Entity Extraction: Automatic extraction of entity IDs from agent responses
  • Tool Usage Tracking: Monitor tool selection patterns and execution statistics

Architecture

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#0366d6','primaryTextColor':'#fff','primaryBorderColor':'#0366d6','lineColor':'#6a737d','secondaryColor':'#f6f8fa','tertiaryColor':'#fff','fontSize':'16px'}}}%%
graph TB
    subgraph Agent["LangChain create_agent API"]
        Middleware["Middleware<br/>(HITL, PII, Limits)"]
        ReActLoop["ReAct<br/>Loop"]
        ToolExecution["Tool<br/>Execution"]
        
        Middleware --> ReActLoop
        ReActLoop --> ToolExecution
        ToolExecution -.-> ReActLoop
    end
    
    subgraph Tools["Retrieval Tools (@tool)"]
        VectorSearch["Vector<br/>Search"]
        KeywordSearch["Keyword<br/>Search"]
        GraphTraverse["Graph<br/>Traverse"]
        HybridSearch["Hybrid<br/>Search"]
    end
    
    subgraph Storage["Storage Layer"]
        Neo4j["Neo4j<br/>(Knowledge Graph)"]
        PostgreSQL["PostgreSQL<br/>(pgvector + pg_search)"]
    end
    
    ToolExecution --> VectorSearch
    ToolExecution --> KeywordSearch
    ToolExecution --> GraphTraverse
    ToolExecution --> HybridSearch
    
    VectorSearch --> PostgreSQL
    KeywordSearch --> PostgreSQL
    GraphTraverse --> Neo4j
    HybridSearch --> PostgreSQL
Loading

Installation

Prerequisites

  • Python 3.11+
  • Neo4j 5.20+ (with APOC plugin for graph operations)
  • PostgreSQL 15+ with pgvector and pg_search extensions
  • Poetry 1.8+

Setup

  1. Clone the repository:
git clone <repository-url>
cd agentic-rag-test-scope-analysis
  1. Install dependencies with Poetry:
poetry install
  1. Create environment file:
cp .env.example .env
# Edit .env with your API keys and database credentials
  1. Initialize database schemas:
poetry run agrag init

Configuration

See .env.example for all configuration options. Key settings:

  • GOOGLE_API_KEY: Required for LLM and embeddings
  • NEO4J_URI, NEO4J_PASSWORD: Neo4j connection
  • NEON_CONNECTION_STRING: PostgreSQL/Neon connection
  • LANGCHAIN_API_KEY: Optional, for LangSmith tracing

Usage

CLI Commands

See src/agrag/cli/README.md for a deeper CLI reference, including headless mode.

Interactive Chat (Recommended)

# Start interactive chat mode (like Claude Code, Copilot CLI)
# Safe by default - you approve each tool before execution
poetry run agrag chat

# Resume a previous conversation
poetry run agrag chat --thread-id my-session

# YOLO mode - autonomous execution without approvals (use with caution)
poetry run agrag chat --yolo

The interactive chat mode provides a conversational interface with:

  • Natural language conversation
  • Automatic conversation persistence (resume anytime)
  • Real-time streaming responses with progress indicators
  • Built-in commands (/help, /stats, /exit, etc.)
  • Safe by default - you approve each tool execution (HITL mode)

Modes:

  • Default (Safe Mode): Agent asks for approval before each tool execution

    • ✅ You control everything, safer, better for learning
    • Best for: normal usage, exploring, sensitive data
  • YOLO Mode (--yolo): Agent executes autonomously without asking

    • ⚡ Faster but uncontrolled
    • Best for: trusted workflows, demos, when you're confident

Available chat commands:

  • /help - Show help
  • /clear - Clear screen
  • /history - View conversation history
  • /stats - Show session statistics (messages, tool calls, duration)
  • /reset - Start new conversation
  • /save - Save conversation to file
  • /export - Export conversation transcript (use --verbose for tool args/results)
  • /verbose - Toggle tool call arguments in output
  • /thinking [level|preset|tokens] - Set Gemini thinking level (low, medium, high, minimal) or legacy budget
  • /exit or /quit - Exit chat
  • Ctrl+C (press twice) - Exit immediately

Headless Mode (Scripting & Automation)

# Direct prompt
poetry run agrag -p "What tests cover handover requirements?"

# Pipe stdin
echo "Summarize this" | poetry run agrag

# Combine prompt with stdin
cat README.md | poetry run agrag -p "Summarize this documentation"

# JSON output
poetry run agrag -p "List test cases" --output-format json

# Streaming JSON (JSONL)
poetry run agrag -p "Analyze dependencies" --output-format stream-json

# Persistent headless session (requires Postgres checkpointer)
poetry run agrag -p "List handover requirements" --thread-id eval-001
poetry run agrag -p "Now show tests that verify those" --thread-id eval-001

Headless mode runs without the interactive UI and is intended for scripts, CI, and pipelines. When --thread-id is provided, the CLI will attempt to use the Postgres checkpointer to resume the same session across invocations (falls back to in-memory if unavailable).

Query the System

# Single query (non-interactive)
poetry run agrag query "What tests cover requirement REQ_AUTH_005?"

# With streaming output
poetry run agrag query "Find handover-related test cases" --stream

# With HITL checkpointing
poetry run agrag query "Show dependencies for TestLoginTimeout" --checkpoint --thread-id my-session

Load Documents and Code

# Load documents with Docling (AI-powered parsing)
poetry run agrag load docs /path/to/requirements --use-chunker

# Load with specific formats and options
poetry run agrag load docs /path/to/docs \
  --formats pdf,docx,xlsx \
  --use-chunker \
  --table-mode accurate \
  --max-pages 100

# Load code repository
poetry run agrag load repo /path/to/repository --languages python

# Load with filtering
poetry run agrag load repo /path/to/repo \
  --languages python,java \
  --include "src/**/*.py" \
  --exclude "tests/**"

# Show loading statistics
poetry run agrag load stats

Data Management

# Generate synthetic test data
poetry run agrag generate --requirements 50 --testcases 200

# Generate with evaluation dataset
poetry run agrag generate --requirements 50 --testcases 200 --with-eval

# Generate and immediately ingest (resets databases first)
poetry run agrag generate --requirements 50 --testcases 200 --ingest

# Ingest data into databases
poetry run agrag ingest data/mock/synthetic_dataset.json

Data layout (override root with AGRAG_DATA_ROOT):

  • data/mock/ for synthetic datasets and evaluation queries
  • data/ericsson/ for real Ericsson exports (see dataset_template.json)

Initialize Databases

poetry run agrag init

Run Evaluation

# Evaluate all strategies on a dataset
poetry run agrag evaluate --dataset data/mock/eval_queries.json --output results.json

# Evaluate specific strategy
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy vector

# Evaluate with synthetic capability suite
poetry run agrag evaluate --suite synthetic-capability

# Evaluate the full agent with dynamic tool selection (RQ2)
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy agent

# Evaluate fixed baselines (RAG vs GraphRAG)
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy rag
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy graphrag

# Evaluate with verbose per-query metrics
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy all --verbose

Show Configuration

poetry run agrag info

Programmatic Usage

from agrag.core import create_agent_graph, create_initial_state

# Create agent with default settings
graph = create_agent_graph()

# Create agent with middleware configuration
from agrag.middleware import get_pii_middleware

graph = create_agent_graph(
    middleware=get_pii_middleware(),  # Optional PII protection
    max_model_calls=10,               # Limit model invocations
    max_tool_calls=20,                # Limit tool executions
    enable_hitl=True,                 # Enable human-in-the-loop
)

# Run query
initial_state = create_initial_state("Find tests for handover failures")
result = graph.invoke(initial_state)

print(result["final_answer"])

Knowledge Graph Ontology

The system uses a custom ontology for software engineering entities:

Entity Types

  • ChangeRequest: Change requests linked to touched files
  • File: Source files in the codebase
  • Component: Higher-level components/modules
  • Requirement: System requirements with priorities
  • TestCase: Test cases (unit, integration, protocol, etc.)
  • Function: Code functions with signatures
  • Class: Code classes with methods
  • Module: Code modules/packages

Relationship Types

  • TOUCHES: ChangeRequest → File
  • PART_OF: File → Component
  • VERIFIES: TestCase → Requirement
  • COVERS: TestCase → Function
  • CALLS: Function → Function
  • DEFINED_IN: Function → File
  • INHERITS_FROM: Class → Class
  • BELONGS_TO: Class/Function → Module
  • DEPENDS_ON: Module → Module

Retrieval Tools

1. Vector Search

Semantic search using PostgreSQL pgvector (HNSW index, 768-dim embeddings, cosine similarity).

Best for:

  • Conceptual queries
  • Finding semantically similar content
  • Understanding meanings and intent

Example:

from agrag.tools import create_vector_search_tool
from agrag.storage import PostgresClient

postgres_client = PostgresClient()

tool = create_vector_search_tool(postgres_client)
result = tool.invoke({
    "query": "tests related to handover failures",
    "k": 10,
    "node_type": "TestCase"
})

2. Keyword Search

Lexical search using BM25 (Best Matching 25) probabilistic ranking algorithm via PostgreSQL FTS.

Best for:

  • Exact keyword matches
  • Specific identifiers (test IDs, function names)
  • Error codes and technical terms

Example:

from agrag.tools import create_keyword_search_tool
from agrag.storage import PostgresClient, BM25RetrieverManager

postgres_client = PostgresClient()
bm25_manager = BM25RetrieverManager()

tool = create_keyword_search_tool(postgres_client, bm25_manager)
result = tool.invoke({
    "query": "TestLoginTimeout",
    "k": 10
})

3. Graph Traversal

Multi-hop graph traversal for structural relationships.

Best for:

  • Dependency analysis
  • Coverage tracing
  • Structural queries

Example:

from agrag.tools import create_graph_traverse_tool
from agrag.storage import Neo4jClient
from agrag.kg.ontology import NodeLabel, RelationshipType

neo4j_client = Neo4jClient()
tool = create_graph_traverse_tool(neo4j_client)

result = tool.invoke({
    "start_node_id": "REQ_AUTH_005",
    "start_node_label": NodeLabel.REQUIREMENT,
    "relationship_types": [RelationshipType.VERIFIES],
    "depth": 2
})

4. Hybrid Search

RRF fusion of vector + keyword search.

Best for:

  • Complex queries needing both semantic and lexical matching
  • Balancing precision and recall

Example:

from agrag.tools import create_hybrid_search_tool
from agrag.storage import PostgresClient

postgres_client = PostgresClient()

tool = create_hybrid_search_tool(postgres_client)
result = tool.invoke({
    "query": "tests for LTE signaling with timeout errors",
    "k": 10,
    "rrf_k": 60
})

Evaluation

The system includes comprehensive evaluation metrics:

  • Precision@k: Relevant items in top-k
  • Recall@k: Coverage of relevant items
  • F1@k: Harmonic mean of precision and recall
  • Average Precision (AP): Precision at each relevant position
  • Mean Average Precision (MAP): Average AP across queries
  • Reciprocal Rank (RR): Rank of first relevant item
  • Mean Reciprocal Rank (MRR): Average RR across queries

Example:

from agrag.evaluation import evaluate_retrieval

retrieved = ["id1", "id2", "id3", "id4"]
relevant = {"id1", "id3", "id5"}

metrics = evaluate_retrieval(retrieved, relevant, k_values=[1, 3, 5, 10])
# Returns: {
#   "precision@1": 1.0,
#   "recall@1": 0.33,
#   "precision@3": 0.67,
#   ...
# }

Project Structure

src/agrag/
├── cli/              # CLI application
│   ├── main.py       # Commands: query, chat, load, init, generate, evaluate, etc.
│   ├── interactive.py # Interactive chat interface
│   ├── display.py    # Output formatting utilities
│   ├── hitl.py       # Human-in-the-loop utilities
│   ├── thinking.py   # Gemini thinking configuration
│   ├── commands.py   # Command handler helpers
│   ├── headless.py   # Headless (non-interactive) execution
│   └── README.md     # CLI usage guide
├── config/           # Configuration and logging
├── core/             # Agent using create_agent API
│   ├── state.py      # AgentState definition
│   ├── graph.py      # create_agent_graph() with middleware support
│   └── checkpointing.py # HITL checkpointing
├── data/             # Data generation and ingestion
│   ├── generators/   # Synthetic data generation (TGF-compatible)
│   ├── loaders/      # Document and code loaders
│   │   ├── base.py           # Abstract base classes
│   │   ├── document_loader.py # Docling integration
│   │   ├── code_loader.py    # Repository walker
│   │   ├── tgf_loader.py     # TGF CSV loader for Ericsson test data
│   │   └── splitters/        # Text splitters
│   │       ├── code_splitter.py     # AST-based
│   │       ├── markdown_splitter.py # Header-based
│   │       └── semantic_splitter.py # Embeddings-based
│   ├── storage_writers.py # Postgres/BM25/Neo4j writers
│   └── ingestion.py  # Data ingestion pipeline
├── evaluation/       # Evaluation framework
│   ├── metrics.py    # P@k, R@k, MAP, MRR calculations
│   ├── agentic_evaluator.py # Full agent pipeline evaluation
│   ├── entity_extractor.py  # Extract entity IDs from responses
│   └── tool_tracker.py      # Tool usage statistics
├── kg/               # Knowledge graph ontology
├── middleware/       # Agent middleware
│   └── pii.py        # PII detection and redaction
├── models/           # LLM and embedding wrappers
├── storage/          # Database clients
│   ├── neo4j_client.py
│   ├── postgres_client.py
│   └── bm25_retriever.py
└── tools/            # Retrieval tools (@tool decorator pattern)
    ├── vector_search.py   # create_vector_search_tool()
    ├── keyword_search.py  # create_keyword_search_tool()
    ├── graph_traverse.py  # create_graph_traverse_tool()
    ├── hybrid_search.py   # create_hybrid_search_tool()
    └── schemas.py

Development

Running Tests

poetry run pytest

Code Quality

# Format code
poetry run black src/ tests/

# Lint
poetry run ruff check src/ tests/

# Type checking (when enabled)
poetry run mypy src/

Working with Loaders

Document Loading with Docling

from agrag.data.loaders import DoclingDocumentLoader

# Load PDF with AI-powered parsing
loader = DoclingDocumentLoader(
    file_path="requirements.pdf",
    use_chunker=True,  # Enable semantic chunking
    table_mode="accurate",  # Use TableFormer for tables
)
documents = loader.load()

# Each document has rich metadata
for doc in documents:
    print(doc.metadata)  # headings, page_number, bbox, etc.

Code Repository Loading

from agrag.data.loaders import CodeLoader

# Load Python repository
loader = CodeLoader(
    repo_path="/path/to/repo",
    languages=["python"],
    include_patterns=["src/**/*.py"],
)
code_docs = loader.load()

# Each document represents a function/class with AST metadata
for doc in code_docs:
    print(doc.metadata)  # type, name, signature, line_start, complexity, etc.

Custom Text Splitting

from agrag.data.loaders.splitters import CodeSplitter, MarkdownSplitter

# AST-based code splitting
code_splitter = CodeSplitter(language="python")
functions = code_splitter.split_code(source_code)

# Header-based markdown splitting
md_splitter = MarkdownSplitter(max_chunk_size=1000)
sections = md_splitter.split_text(markdown_content)

TGF (Test Governance Framework) Loading

from agrag.data.loaders import TGFLoader

# Load TGF CSV with test execution results
loader = TGFLoader("data/tgf_export.csv")
test_cases = loader.load()

# Entities include relationships to requirements and functions
for tc in test_cases:
    print(tc.to_test_case_entity())  # id, name, test_type, etc.
    print(tc.requirement_ids)  # Related requirement IDs
    print(tc.function_names)   # Functions under test

Database Management

Storage Writes

from agrag.data.storage_writers import PostgresWriter, BM25Writer, GraphWriter

postgres_writer = PostgresWriter()
bm25_writer = BM25Writer()

# Write entities to PostgreSQL + BM25
postgres_count = postgres_writer.write_entities_batch(
    entities=my_entities, entity_type="Requirement", batch_size=100
)
bm25_count = bm25_writer.write_entities_batch(
    entities=my_entities, entity_type="Requirement", batch_size=100
)

# Persist BM25 index
bm25_writer.persist_index("data/bm25_index.pkl")

# Optional: materialize graph entities explicitly
# graph_writer = GraphWriter()
# graph_writer.write_entities_batch(entities=my_entities, entity_type="Requirement")

Research Questions

RQ1: Knowledge Graph Ontology

The system implements a domain-specific ontology covering:

  • 5 entity types (Requirement, TestCase, Function, Class, Module)
  • 6 relationship types (VERIFIES, COVERS, CALLS, DEFINED_IN, INHERITS_FROM, DEPENDS_ON)
  • Rich metadata (priorities, test types, signatures, etc.)
  • Vector embeddings for all entities (768-dim)

RQ2: Retrieval Strategy Comparison

Seven retrieval strategies implemented and evaluated (including fixed baselines):

  • Vector Search: Semantic similarity (PostgreSQL pgvector HNSW)
  • Keyword Search: Lexical matching (PostgreSQL pg_search BM25)
  • Graph Traversal: Structural relationships (Neo4j Cypher queries)
  • Hybrid Search: RRF fusion of vector + keyword (PostgreSQL-native)
  • Fixed RAG: Retrieval-only baseline using hybrid search
  • Fixed GraphRAG: Hybrid retrieval + graph traversal baseline
  • Agent: Dynamic strategy selection via LLM reasoning

The agentic evaluation framework (AgenticEvaluator) enables:

  • Full agent pipeline testing with dynamic tool selection
  • Comparison of static strategies vs. agent-driven retrieval
  • Tool usage pattern analysis and execution statistics
  • Entity extraction from natural language responses

Evaluation framework supports comparative analysis with Precision@k, Recall@k, MAP, MRR, and F1@k.

RQ3: HITL Workflows

LangChain create_agent API with PostgresSaver checkpointing enables:

  • Conversation persistence across sessions
  • Human approval before tool execution (via HumanInTheLoopMiddleware)
  • State inspection and modification
  • Thread-based conversation management
  • Safe mode (default) vs YOLO mode for autonomous execution

License

see LICENSE file for details.

Acknowledgments

Support & Contact

This is an academic research project with ongoing maintenance.