Agentic GraphRAG for Test Scope Analysis

Master's Thesis Research Project
Conducted in collaboration with Ericsson
Linköping University, 2026

A research implementation of an agentic Retrieval-Augmented Generation (RAG) system combining Knowledge Graphs, Vector Search, and Human-in-the-Loop workflows for test scope analysis in telecommunications software systems.

⚠️ Academic Research Project

This repository contains the implementation for a Master's Thesis. The code is made publicly available for:

Academic transparency and reproducible research
Peer review and thesis evaluation
Educational purposes and research community benefit

Important: This is a research artifact, not a production-ready commercial product. See LICENSE, DISCLAIMER.md, and CONTRIBUTING.md for usage terms and restrictions.

Overview

This system implements a comprehensive agentic RAG architecture that aims to address three key research questions:

RQ1: What Knowledge Graph ontology effectively represents software engineering entities and relationships for test scope analysis?
RQ2: How do different retrieval strategies (vector, keyword, graph, hybrid) compare for test-related queries?
RQ3: How can Human-in-the-Loop (HITL) agent workflows improve retrieval quality and user control?

Key Features

Agent & Orchestration

LangChain create_agent API: Modern high-level agent API with built-in middleware support
HITL Workflows: PostgresSaver checkpointing for human intervention and tool approval
Middleware Support: Built-in ModelCallLimit, ToolCallLimit, HumanInTheLoop, and PII detection middleware
LangSmith Integration: Full observability and debugging
Gemini Thinking Configuration: Adjustable reasoning depth via thinking_level (low/medium/high/minimal) or legacy budgets
Headless Mode: Scriptable CLI with text/json/stream-json output for automation
Conversation Export: Export chat transcripts with optional tool call details

Storage & Retrieval

Dual Storage Architecture: Neo4j (knowledge graph) + PostgreSQL (pgvector + pg_search BM25)
4 Retrieval Tools (using @tool decorator pattern with factory functions):
- Vector Search (pgvector semantic similarity)
- Keyword Search (pg_search BM25 lexical matching)
- Graph Traversal (Neo4j Cypher queries)
- Hybrid Search (RRF fusion of vector + keyword)

Document Loading & Processing

AI-Powered Document Parsing: Docling integration (15+ formats)
- DocLayNet layout analysis (IBM Research AI model)
- TableFormer for table structure recognition
- HybridChunker for semantic chunking
AST-Based Code Parsing: Tree-sitter for structure-aware code analysis
- Extract functions, classes, methods with full context
- Preserve parent-child relationships and docstrings
- Multi-language support (Python, Java, etc.)
TGF (Test Governance Framework) Loader: CSV loader for Ericsson test execution results
- Parses TGF CSV format with test metadata
- Maps to TestCase entities with relationships
- Validates and normalizes test results and types
Storage Writers: Modular writers for PostgreSQL/BM25 (default) and Neo4j (explicit graph materialization)
- Retry logic with exponential backoff
- Idempotent upserts for reliability

Evaluation

Comprehensive Metrics: Precision@k, Recall@k, MAP, MRR, F1@k
Agentic Evaluation Framework: Full agent pipeline testing with dynamic strategy comparison
Multi-Strategy Comparison: Evaluate vector, keyword, hybrid, graph, and agent strategies
Entity Extraction: Automatic extraction of entity IDs from agent responses
Tool Usage Tracking: Monitor tool selection patterns and execution statistics

Architecture

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#0366d6','primaryTextColor':'#fff','primaryBorderColor':'#0366d6','lineColor':'#6a737d','secondaryColor':'#f6f8fa','tertiaryColor':'#fff','fontSize':'16px'}}}%%
graph TB
    subgraph Agent["LangChain create_agent API"]
        Middleware["Middleware<br/>(HITL, PII, Limits)"]
        ReActLoop["ReAct<br/>Loop"]
        ToolExecution["Tool<br/>Execution"]
        
        Middleware --> ReActLoop
        ReActLoop --> ToolExecution
        ToolExecution -.-> ReActLoop
    end
    
    subgraph Tools["Retrieval Tools (@tool)"]
        VectorSearch["Vector<br/>Search"]
        KeywordSearch["Keyword<br/>Search"]
        GraphTraverse["Graph<br/>Traverse"]
        HybridSearch["Hybrid<br/>Search"]
    end
    
    subgraph Storage["Storage Layer"]
        Neo4j["Neo4j<br/>(Knowledge Graph)"]
        PostgreSQL["PostgreSQL<br/>(pgvector + pg_search)"]
    end
    
    ToolExecution --> VectorSearch
    ToolExecution --> KeywordSearch
    ToolExecution --> GraphTraverse
    ToolExecution --> HybridSearch
    
    VectorSearch --> PostgreSQL
    KeywordSearch --> PostgreSQL
    GraphTraverse --> Neo4j
    HybridSearch --> PostgreSQL

Installation

Prerequisites

Python 3.11+
Neo4j 5.20+ (with APOC plugin for graph operations)
PostgreSQL 15+ with pgvector and pg_search extensions
Poetry 1.8+

Setup

Clone the repository:

git clone <repository-url>
cd agentic-rag-test-scope-analysis

Install dependencies with Poetry:

poetry install

Create environment file:

cp .env.example .env
# Edit .env with your API keys and database credentials

Initialize database schemas:

poetry run agrag init

Configuration

See .env.example for all configuration options. Key settings:

GOOGLE_API_KEY: Required for LLM and embeddings
NEO4J_URI, NEO4J_PASSWORD: Neo4j connection
NEON_CONNECTION_STRING: PostgreSQL/Neon connection
LANGCHAIN_API_KEY: Optional, for LangSmith tracing

Usage

CLI Commands

See src/agrag/cli/README.md for a deeper CLI reference, including headless mode.

Interactive Chat (Recommended)

# Start interactive chat mode (like Claude Code, Copilot CLI)
# Safe by default - you approve each tool before execution
poetry run agrag chat

# Resume a previous conversation
poetry run agrag chat --thread-id my-session

# YOLO mode - autonomous execution without approvals (use with caution)
poetry run agrag chat --yolo

The interactive chat mode provides a conversational interface with:

Natural language conversation
Automatic conversation persistence (resume anytime)
Real-time streaming responses with progress indicators
Built-in commands (/help, /stats, /exit, etc.)
Safe by default - you approve each tool execution (HITL mode)

Modes:

Default (Safe Mode): Agent asks for approval before each tool execution
- ✅ You control everything, safer, better for learning
- Best for: normal usage, exploring, sensitive data
YOLO Mode (--yolo): Agent executes autonomously without asking
- ⚡ Faster but uncontrolled
- Best for: trusted workflows, demos, when you're confident

Available chat commands:

/help - Show help
/clear - Clear screen
/history - View conversation history
/stats - Show session statistics (messages, tool calls, duration)
/reset - Start new conversation
/save - Save conversation to file
/export - Export conversation transcript (use --verbose for tool args/results)
/verbose - Toggle tool call arguments in output
/thinking [level|preset|tokens] - Set Gemini thinking level (low, medium, high, minimal) or legacy budget
/exit or /quit - Exit chat
Ctrl+C (press twice) - Exit immediately

Headless Mode (Scripting & Automation)

# Direct prompt
poetry run agrag -p "What tests cover handover requirements?"

# Pipe stdin
echo "Summarize this" | poetry run agrag

# Combine prompt with stdin
cat README.md | poetry run agrag -p "Summarize this documentation"

# JSON output
poetry run agrag -p "List test cases" --output-format json

# Streaming JSON (JSONL)
poetry run agrag -p "Analyze dependencies" --output-format stream-json

# Persistent headless session (requires Postgres checkpointer)
poetry run agrag -p "List handover requirements" --thread-id eval-001
poetry run agrag -p "Now show tests that verify those" --thread-id eval-001

Headless mode runs without the interactive UI and is intended for scripts, CI, and pipelines. When --thread-id is provided, the CLI will attempt to use the Postgres checkpointer to resume the same session across invocations (falls back to in-memory if unavailable).

Query the System

# Single query (non-interactive)
poetry run agrag query "What tests cover requirement REQ_AUTH_005?"

# With streaming output
poetry run agrag query "Find handover-related test cases" --stream

# With HITL checkpointing
poetry run agrag query "Show dependencies for TestLoginTimeout" --checkpoint --thread-id my-session

Load Documents and Code

# Load documents with Docling (AI-powered parsing)
poetry run agrag load docs /path/to/requirements --use-chunker

# Load with specific formats and options
poetry run agrag load docs /path/to/docs \
  --formats pdf,docx,xlsx \
  --use-chunker \
  --table-mode accurate \
  --max-pages 100

# Load code repository
poetry run agrag load repo /path/to/repository --languages python

# Load with filtering
poetry run agrag load repo /path/to/repo \
  --languages python,java \
  --include "src/**/*.py" \
  --exclude "tests/**"

# Show loading statistics
poetry run agrag load stats

Data Management

# Generate synthetic test data
poetry run agrag generate --requirements 50 --testcases 200

# Generate with evaluation dataset
poetry run agrag generate --requirements 50 --testcases 200 --with-eval

# Generate and immediately ingest (resets databases first)
poetry run agrag generate --requirements 50 --testcases 200 --ingest

# Ingest data into databases
poetry run agrag ingest data/mock/synthetic_dataset.json

Data layout (override root with AGRAG_DATA_ROOT):

data/mock/ for synthetic datasets and evaluation queries
data/ericsson/ for real Ericsson exports (see dataset_template.json)

Initialize Databases

poetry run agrag init

Run Evaluation

# Evaluate all strategies on a dataset
poetry run agrag evaluate --dataset data/mock/eval_queries.json --output results.json

# Evaluate specific strategy
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy vector

# Evaluate with synthetic capability suite
poetry run agrag evaluate --suite synthetic-capability

# Evaluate the full agent with dynamic tool selection (RQ2)
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy agent

# Evaluate fixed baselines (RAG vs GraphRAG)
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy rag
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy graphrag

# Evaluate with verbose per-query metrics
poetry run agrag evaluate --dataset data/mock/eval_queries.json --strategy all --verbose

Show Configuration

poetry run agrag info

Programmatic Usage

from agrag.core import create_agent_graph, create_initial_state

# Create agent with default settings
graph = create_agent_graph()

# Create agent with middleware configuration
from agrag.middleware import get_pii_middleware

graph = create_agent_graph(
    middleware=get_pii_middleware(),  # Optional PII protection
    max_model_calls=10,               # Limit model invocations
    max_tool_calls=20,                # Limit tool executions
    enable_hitl=True,                 # Enable human-in-the-loop
)

# Run query
initial_state = create_initial_state("Find tests for handover failures")
result = graph.invoke(initial_state)

print(result["final_answer"])

Knowledge Graph Ontology

The system uses a custom ontology for software engineering entities:

Entity Types

ChangeRequest: Change requests linked to touched files
File: Source files in the codebase
Component: Higher-level components/modules
Requirement: System requirements with priorities
TestCase: Test cases (unit, integration, protocol, etc.)
Function: Code functions with signatures
Class: Code classes with methods
Module: Code modules/packages

Relationship Types

TOUCHES: ChangeRequest → File
PART_OF: File → Component
VERIFIES: TestCase → Requirement
COVERS: TestCase → Function
CALLS: Function → Function
DEFINED_IN: Function → File
INHERITS_FROM: Class → Class
BELONGS_TO: Class/Function → Module
DEPENDS_ON: Module → Module

Retrieval Tools

1. Vector Search

Semantic search using PostgreSQL pgvector (HNSW index, 768-dim embeddings, cosine similarity).

Best for:

Conceptual queries
Finding semantically similar content
Understanding meanings and intent

Example:

from agrag.tools import create_vector_search_tool
from agrag.storage import PostgresClient

postgres_client = PostgresClient()

tool = create_vector_search_tool(postgres_client)
result = tool.invoke({
    "query": "tests related to handover failures",
    "k": 10,
    "node_type": "TestCase"
})

2. Keyword Search

Lexical search using BM25 (Best Matching 25) probabilistic ranking algorithm via PostgreSQL FTS.

Best for:

Exact keyword matches
Specific identifiers (test IDs, function names)
Error codes and technical terms

Example:

from agrag.tools import create_keyword_search_tool
from agrag.storage import PostgresClient, BM25RetrieverManager

postgres_client = PostgresClient()
bm25_manager = BM25RetrieverManager()

tool = create_keyword_search_tool(postgres_client, bm25_manager)
result = tool.invoke({
    "query": "TestLoginTimeout",
    "k": 10
})

3. Graph Traversal

Multi-hop graph traversal for structural relationships.

Best for:

Dependency analysis
Coverage tracing
Structural queries

Example:

from agrag.tools import create_graph_traverse_tool
from agrag.storage import Neo4jClient
from agrag.kg.ontology import NodeLabel, RelationshipType

neo4j_client = Neo4jClient()
tool = create_graph_traverse_tool(neo4j_client)

result = tool.invoke({
    "start_node_id": "REQ_AUTH_005",
    "start_node_label": NodeLabel.REQUIREMENT,
    "relationship_types": [RelationshipType.VERIFIES],
    "depth": 2
})

4. Hybrid Search

RRF fusion of vector + keyword search.

Best for:

Complex queries needing both semantic and lexical matching
Balancing precision and recall

Example:

from agrag.tools import create_hybrid_search_tool
from agrag.storage import PostgresClient

postgres_client = PostgresClient()

tool = create_hybrid_search_tool(postgres_client)
result = tool.invoke({
    "query": "tests for LTE signaling with timeout errors",
    "k": 10,
    "rrf_k": 60
})

Evaluation

The system includes comprehensive evaluation metrics:

Precision@k: Relevant items in top-k
Recall@k: Coverage of relevant items
F1@k: Harmonic mean of precision and recall
Average Precision (AP): Precision at each relevant position
Mean Average Precision (MAP): Average AP across queries
Reciprocal Rank (RR): Rank of first relevant item
Mean Reciprocal Rank (MRR): Average RR across queries

Example:

from agrag.evaluation import evaluate_retrieval

retrieved = ["id1", "id2", "id3", "id4"]
relevant = {"id1", "id3", "id5"}

metrics = evaluate_retrieval(retrieved, relevant, k_values=[1, 3, 5, 10])
# Returns: {
#   "precision@1": 1.0,
#   "recall@1": 0.33,
#   "precision@3": 0.67,
#   ...
# }

Project Structure

src/agrag/
├── cli/              # CLI application
│   ├── main.py       # Commands: query, chat, load, init, generate, evaluate, etc.
│   ├── interactive.py # Interactive chat interface
│   ├── display.py    # Output formatting utilities
│   ├── hitl.py       # Human-in-the-loop utilities
│   ├── thinking.py   # Gemini thinking configuration
│   ├── commands.py   # Command handler helpers
│   ├── headless.py   # Headless (non-interactive) execution
│   └── README.md     # CLI usage guide
├── config/           # Configuration and logging
├── core/             # Agent using create_agent API
│   ├── state.py      # AgentState definition
│   ├── graph.py      # create_agent_graph() with middleware support
│   └── checkpointing.py # HITL checkpointing
├── data/             # Data generation and ingestion
│   ├── generators/   # Synthetic data generation (TGF-compatible)
│   ├── loaders/      # Document and code loaders
│   │   ├── base.py           # Abstract base classes
│   │   ├── document_loader.py # Docling integration
│   │   ├── code_loader.py    # Repository walker
│   │   ├── tgf_loader.py     # TGF CSV loader for Ericsson test data
│   │   └── splitters/        # Text splitters
│   │       ├── code_splitter.py     # AST-based
│   │       ├── markdown_splitter.py # Header-based
│   │       └── semantic_splitter.py # Embeddings-based
│   ├── storage_writers.py # Postgres/BM25/Neo4j writers
│   └── ingestion.py  # Data ingestion pipeline
├── evaluation/       # Evaluation framework
│   ├── metrics.py    # P@k, R@k, MAP, MRR calculations
│   ├── agentic_evaluator.py # Full agent pipeline evaluation
│   ├── entity_extractor.py  # Extract entity IDs from responses
│   └── tool_tracker.py      # Tool usage statistics
├── kg/               # Knowledge graph ontology
├── middleware/       # Agent middleware
│   └── pii.py        # PII detection and redaction
├── models/           # LLM and embedding wrappers
├── storage/          # Database clients
│   ├── neo4j_client.py
│   ├── postgres_client.py
│   └── bm25_retriever.py
└── tools/            # Retrieval tools (@tool decorator pattern)
    ├── vector_search.py   # create_vector_search_tool()
    ├── keyword_search.py  # create_keyword_search_tool()
    ├── graph_traverse.py  # create_graph_traverse_tool()
    ├── hybrid_search.py   # create_hybrid_search_tool()
    └── schemas.py

Development

Running Tests

poetry run pytest

Code Quality

# Format code
poetry run black src/ tests/

# Lint
poetry run ruff check src/ tests/

# Type checking (when enabled)
poetry run mypy src/

Working with Loaders

Document Loading with Docling

from agrag.data.loaders import DoclingDocumentLoader

# Load PDF with AI-powered parsing
loader = DoclingDocumentLoader(
    file_path="requirements.pdf",
    use_chunker=True,  # Enable semantic chunking
    table_mode="accurate",  # Use TableFormer for tables
)
documents = loader.load()

# Each document has rich metadata
for doc in documents:
    print(doc.metadata)  # headings, page_number, bbox, etc.

Code Repository Loading

from agrag.data.loaders import CodeLoader

# Load Python repository
loader = CodeLoader(
    repo_path="/path/to/repo",
    languages=["python"],
    include_patterns=["src/**/*.py"],
)
code_docs = loader.load()

# Each document represents a function/class with AST metadata
for doc in code_docs:
    print(doc.metadata)  # type, name, signature, line_start, complexity, etc.

Custom Text Splitting

from agrag.data.loaders.splitters import CodeSplitter, MarkdownSplitter

# AST-based code splitting
code_splitter = CodeSplitter(language="python")
functions = code_splitter.split_code(source_code)

# Header-based markdown splitting
md_splitter = MarkdownSplitter(max_chunk_size=1000)
sections = md_splitter.split_text(markdown_content)

TGF (Test Governance Framework) Loading

from agrag.data.loaders import TGFLoader

# Load TGF CSV with test execution results
loader = TGFLoader("data/tgf_export.csv")
test_cases = loader.load()

# Entities include relationships to requirements and functions
for tc in test_cases:
    print(tc.to_test_case_entity())  # id, name, test_type, etc.
    print(tc.requirement_ids)  # Related requirement IDs
    print(tc.function_names)   # Functions under test

Database Management

Storage Writes

from agrag.data.storage_writers import PostgresWriter, BM25Writer, GraphWriter

postgres_writer = PostgresWriter()
bm25_writer = BM25Writer()

# Write entities to PostgreSQL + BM25
postgres_count = postgres_writer.write_entities_batch(
    entities=my_entities, entity_type="Requirement", batch_size=100
)
bm25_count = bm25_writer.write_entities_batch(
    entities=my_entities, entity_type="Requirement", batch_size=100
)

# Persist BM25 index
bm25_writer.persist_index("data/bm25_index.pkl")

# Optional: materialize graph entities explicitly
# graph_writer = GraphWriter()
# graph_writer.write_entities_batch(entities=my_entities, entity_type="Requirement")

Research Questions

RQ1: Knowledge Graph Ontology

The system implements a domain-specific ontology covering:

5 entity types (Requirement, TestCase, Function, Class, Module)
6 relationship types (VERIFIES, COVERS, CALLS, DEFINED_IN, INHERITS_FROM, DEPENDS_ON)
Rich metadata (priorities, test types, signatures, etc.)
Vector embeddings for all entities (768-dim)

RQ2: Retrieval Strategy Comparison

Seven retrieval strategies implemented and evaluated (including fixed baselines):

Vector Search: Semantic similarity (PostgreSQL pgvector HNSW)
Keyword Search: Lexical matching (PostgreSQL pg_search BM25)
Graph Traversal: Structural relationships (Neo4j Cypher queries)
Hybrid Search: RRF fusion of vector + keyword (PostgreSQL-native)
Fixed RAG: Retrieval-only baseline using hybrid search
Fixed GraphRAG: Hybrid retrieval + graph traversal baseline
Agent: Dynamic strategy selection via LLM reasoning

The agentic evaluation framework (AgenticEvaluator) enables:

Full agent pipeline testing with dynamic tool selection
Comparison of static strategies vs. agent-driven retrieval
Tool usage pattern analysis and execution statistics
Entity extraction from natural language responses

Evaluation framework supports comparative analysis with Precision@k, Recall@k, MAP, MRR, and F1@k.

RQ3: HITL Workflows

LangChain create_agent API with PostgresSaver checkpointing enables:

Conversation persistence across sessions
Human approval before tool execution (via HumanInTheLoopMiddleware)
State inspection and modification
Thread-based conversation management
Safe mode (default) vs YOLO mode for autonomous execution

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.agents		.agents
.claude/skills		.claude/skills
.codex/skills		.codex/skills
data		data
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Agentic_RAG___Knowledge_Graphs_for_Test_Scope_Analysis.pdf		Agentic_RAG___Knowledge_Graphs_for_Test_Scope_Analysis.pdf
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER.md		DISCLAIMER.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Agentic GraphRAG for Test Scope Analysis

⚠️ Academic Research Project

Overview

Key Features

Agent & Orchestration

Storage & Retrieval

Document Loading & Processing

Evaluation

Architecture

Installation

Prerequisites

Setup

Configuration

Usage

CLI Commands

Interactive Chat (Recommended)

Headless Mode (Scripting & Automation)

Query the System

Load Documents and Code

Data Management

Initialize Databases

Run Evaluation

Show Configuration

Programmatic Usage

Knowledge Graph Ontology

Entity Types

Relationship Types

Retrieval Tools

1. Vector Search

2. Keyword Search

3. Graph Traversal

4. Hybrid Search

Evaluation

Project Structure

Development

Running Tests

Code Quality

Working with Loaders

Document Loading with Docling

Code Repository Loading

Custom Text Splitting

TGF (Test Governance Framework) Loading

Database Management

Storage Writes

Research Questions

RQ1: Knowledge Graph Ontology

RQ2: Retrieval Strategy Comparison

RQ3: HITL Workflows

License

Acknowledgments

Support & Contact

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages