GraphRAG Benchmark

Three-Pipeline RAG Benchmark on Biomedical Literature

Results • Architecture • Setup • Usage • Benchmark • Report

Problem Statement

Retrieval-Augmented Generation (RAG) improves LLM answer quality by providing external context — but standard vector-based RAG is inefficient. It retrieves entire document chunks when only a few sentences are relevant, wasting tokens and inflating costs.

Can knowledge graphs (GraphRAG) reduce token consumption while maintaining answer quality?

This project answers that question with a controlled 3-pipeline benchmark on 500 PubMed medical abstracts.

Key Results

Pipeline	Avg Tokens	Avg Latency	Judge Pass Rate	Token vs Basic RAG
LLM-Only	252	3.84s	100%	-67.5%
Basic RAG (FAISS)	774	3.35s	100%	baseline
GraphRAG (TigerGraph)	594	15.01s	100%	-23.3%

GraphRAG saves 180 tokens per query (23.3%) vs Basic RAG while maintaining 100% answer quality across all 35 benchmark queries.

At 1M queries, GraphRAG saves ~180 million tokens — a dramatic cost reduction at scale.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Streamlit Dashboard                                │
│                    (Live Query + Saved Results Viewer)                       │
└──────┬──────────────────────────┬──────────────────────────┬────────────────┘
       │                          │                          │
       ▼                          ▼                          ▼
┌──────────────┐    ┌──────────────────────┐    ┌──────────────────────────┐
│  LLM-Only    │    │     Basic RAG        │    │  TigerGraph GraphRAG     │
│              │    │                      │    │                          │
│  ┌────────┐  │    │  ┌────────┐  ┌─────┐ │    │  ┌────────┐  ┌────────┐  │
│  │ Ollama │  │    │  │nomic-  │  │FAISS│ │    │  │nomic-  │  │GraphRAG│  │
│  │ qwen2  │  │    │  │embed-  │→ │Index│ │    │  │embed-  │→ │Search  │  │
│  │ 0.5b   │  │    │  │text    │  │     │ │    │  │text    │  │API     │  │
│  └────────┘  │    │  └────────┘  └─────┘ │    │  └────────┘  └───┬────┘  │
│       │      │    │       │         │     │    │       │          │       │
└───────┼──────┘    └───────┼─────────┼─────┘    └───────┼──────────┼───────┘
        │                   │         │                  │          │
        ▼                   ▼         ▼                  ▼          ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Ollama (local, zero API costs)                     │
│                    qwen2:0.5b  +  nomic-embed-text                          │
└─────────────────────────────────────────────────────────────────────────────┘

Pipeline Descriptions

Pipeline	Retrieval	Context	Generation
LLM-Only	None	Model knowledge only	Direct prompt to Ollama
Basic RAG	FAISS vector similarity	Top-5 full document chunks	Prompt + context → Ollama
GraphRAG	TigerGraph graph traversal + entity disambiguation	Graph-connected entities only	Prompt + graph context → Ollama

Features

Three complete pipelines — LLM-Only baseline, Basic RAG (FAISS), and TigerGraph GraphRAG
Live Streamlit dashboard — Compare all 3 pipelines on any question side-by-side
Full benchmark suite — 35 queries across 4 question types (summary, relational, lookup, aggregation)
LLM-as-a-Judge evaluation — Automated PASS/FAIL grading using the same Ollama model
BERTScore evaluation — Semantic similarity scoring (configurable, optional)
Comprehensive metrics — Token usage, latency, retrieval time, and answer quality
Zero API costs — All LLM calls and embeddings run locally via Ollama
500 synthetic PubMed abstracts — 15 disease areas, multi-disease relational queries
Interactive charts — Plotly bar charts for tokens, latency, and accuracy comparison
Saved results viewer — Load and explore previous benchmark runs

Tech Stack

Component	Technology
GraphRAG Backend	Official TigerGraph GraphRAG Docker containers (ECC + Search API)
Graph Database	TigerGraph Cloud (free tier)
LLM Inference	Ollama + qwen2:0.5b
Embeddings	Ollama + nomic-embed-text (768d)
Vector Store	FAISS (IndexFlatL2)
Dashboard	Streamlit + Plotly
Evaluation	LLM-as-a-Judge (qwen2:0.5b) + BERTScore (optional)
Data	Synthetic PubMed dataset generator
Packaging	Python 3.10+, pip

Setup

Prerequisites

Ollama installed and running
Python 3.10+
Docker (for TigerGraph GraphRAG containers)
TigerGraph Cloud account (free tier)

1. Clone and install

git clone <your-repo-url>
cd graphrag-benchmark
pip install -r requirements.txt

2. Pull Ollama models

ollama pull qwen2:0.5b
ollama pull nomic-embed-text

3. Configure environment

cp .env.example .env
# Edit .env with your TigerGraph Cloud credentials

4. Start GraphRAG Docker containers

cd graphrag
docker compose up -d graphrag graphrag-ecc

5. Generate dataset and build FAISS index

python scripts/build_dataset.py
python scripts/build_faiss_index.py

Usage

Quick Start (single command)

python setup_and_run.py

Run the full benchmark (35 queries)

python scripts/run_benchmark.py

Launch the Streamlit dashboard

streamlit run dashboard/app.py

Open http://localhost:8501 and:

Type a medical question
Click "Run All Pipelines"
Compare LLM-Only, Basic RAG, and GraphRAG results side-by-side

Single demo query

python scripts/demo_query.py

Generate benchmark report

python scripts/generate_report.py

Benchmark Results

Detailed Metrics (35 queries)

Metric	LLM-Only	Basic RAG	GraphRAG
Total tokens	8,806	27,085	20,779
Avg tokens per query	251.6	773.9	593.7
Min tokens	64	634	501
Max tokens	706	1,135	791
Avg prompt tokens	15.2	651.4	499.4
Avg completion tokens	236.4	122.5	94.3
Avg total latency	3.84s	3.35s	15.01s
Avg retrieval time	0.00s	0.02s	11.76s
LLM Judge pass rate	100%	100%	100%

Cross-Pipeline Comparisons

Comparison	Token Impact	Latency Impact
GraphRAG vs Basic RAG	-23.3% (-180 tokens)	+11.66s
GraphRAG vs LLM-Only	+136.0% (+342 tokens)	+11.17s
Basic RAG vs LLM-Only	+207.6% (+522 tokens)	-0.49s

Query Type Breakdown

Query Type	Queries	LLM-Only	Basic RAG	GraphRAG
Summary	10	296 tok, 4.25s	829 tok, 3.60s	578 tok, 12.42s
Relational	10	262 tok, 3.79s	763 tok, 3.32s	651 tok, 12.76s
Lookup	10	185 tok, 3.47s	784 tok, 3.35s	545 tok, 12.19s

Metrics Explained

Tokens

Total LLM tokens consumed (prompt + completion). Measured via tiktoken with cl100k_base encoding. Lower is better for cost efficiency.

Latency

End-to-end wall-clock time in seconds. Includes retrieval, generation, and post-processing.

LLM-as-a-Judge

Automated quality evaluation using the same qwen2:0.5b model. The model is asked to grade each answer PASS/FAIL based on factual accuracy, completeness, and consistency with ground truth. No human annotation required.

BERTScore

Optional semantic similarity metric using microsoft/deberta-xlarge-mnli. Computes precision, recall, and F1 between predicted and ground truth answers at the token level. Useful for fine-grained quality assessment.

Project Structure

graphrag-benchmark/
├── config/
│   └── schema.gsql              # TigerGraph graph schema
├── dashboard/
│   └── app.py                   # Streamlit web dashboard
├── data/pubmed/                 # Generated dataset (gitignored)
├── evaluation/
│   ├── llm_judge.py             # LLM-as-a-Judge evaluator
│   └── bertscore_eval.py        # BERTScore evaluator
├── graphrag/                    # TigerGraph GraphRAG Docker config
│   ├── configs/server_config.json
│   └── docker-compose.yml
├── outputs/                     # Benchmark results (gitignored except samples)
├── pipelines/
│   ├── llm_only.py              # Pipeline 1: Direct LLM
│   ├── basic_rag.py             # Pipeline 2: FAISS + LLM
│   ├── graphrag_pipeline.py     # Pipeline 3: TigerGraph GraphRAG
│   └── graphrag_local.py        # Local NetworkX GraphRAG (alternative)
├── reports/
│   └── benchmark_report.md      # Full benchmark report
├── scripts/
│   ├── build_dataset.py         # PubMed dataset generator
│   ├── build_faiss_index.py     # FAISS index builder
│   ├── run_benchmark.py         # Full 35-query benchmark
│   ├── demo_query.py            # Quick demo script
│   ├── generate_report.py       # Report generator
│   └── run_bertscore.py         # BERTScore evaluation
├── .env.example                 # Environment template
├── setup_and_run.py             # One-command setup
├── requirements.txt
├── pyproject.toml
└── README.md

Demo

Dashboard Screenshots

Screenshots to be added — see the screenshots/ directory.

Live Query Mode: Three side-by-side pipeline results with metrics
Saved Results Mode: Full benchmark summary with expandable query details
Metrics Comparison: Plotly bar charts for tokens, latency, and accuracy

Demo Video Script

See demo_video_script.md for a 5-7 minute walkthrough covering:

Problem statement and motivation
Three-pipeline architecture
Live dashboard demonstration
Evaluation methodology
Key results and findings

Future Improvements

Multi-hop reasoning queries — Test GraphRAG on deeper (3+ hop) queries
Real PubMed dataset — Replace synthetic data with actual PubMed abstracts
GPU acceleration — Speed up GraphRAG ECC with GPU-enabled Ollama
Additional LLMs — Compare across different models (Llama 3, Mistral, Gemma)
Cost modeling — Dollar-based cost comparison at scale
Query caching — Cache embeddings and search results for repeated queries
Hybrid retrieval — Combine vector + graph retrieval for optimal results
CI/CD pipeline — Automated benchmark runs on schedule

License

Distributed under the MIT License. See LICENSE for more information.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Acknowledgments

Built for the TigerGraph GraphRAG Hackathon
Powered by TigerGraph Cloud free tier
Ollama for local LLM inference
FAISS for vector search

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
dashboard		dashboard
data/pubmed		data/pubmed
evaluation		evaluation
graphrag		graphrag
outputs		outputs
pipelines		pipelines
reports		reports
screenshots		screenshots
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
blog_post.md		blog_post.md
demo_video_script.md		demo_video_script.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_and_run.py		setup_and_run.py
social_media_post.md		social_media_post.md
video link		video link

Folders and files

Latest commit

History

Repository files navigation

GraphRAG Benchmark

Problem Statement

Key Results

Architecture

Pipeline Descriptions

Features

Tech Stack

Setup

Prerequisites

1. Clone and install

2. Pull Ollama models

3. Configure environment

4. Start GraphRAG Docker containers

5. Generate dataset and build FAISS index

Usage

Quick Start (single command)

Run the full benchmark (35 queries)

Launch the Streamlit dashboard

Single demo query

Generate benchmark report

Benchmark Results

Detailed Metrics (35 queries)

Cross-Pipeline Comparisons

Query Type Breakdown

Metrics Explained

Tokens

Latency

LLM-as-a-Judge

BERTScore

Project Structure

Demo

Dashboard Screenshots

Demo Video Script

Future Improvements

License

Contributing

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages