Skip to content

mayank0030/graphrag_benchmark

Repository files navigation

GraphRAG Benchmark

Three-Pipeline RAG Benchmark on Biomedical Literature

ResultsArchitectureSetupUsageBenchmarkReport

License: MIT Python 3.10+ Ollama qwen2:0.5b TigerGraph Cloud TigerGraph Hackathon


Problem Statement

Retrieval-Augmented Generation (RAG) improves LLM answer quality by providing external context — but standard vector-based RAG is inefficient. It retrieves entire document chunks when only a few sentences are relevant, wasting tokens and inflating costs.

Can knowledge graphs (GraphRAG) reduce token consumption while maintaining answer quality?

This project answers that question with a controlled 3-pipeline benchmark on 500 PubMed medical abstracts.


Key Results

Pipeline Avg Tokens Avg Latency Judge Pass Rate Token vs Basic RAG
LLM-Only 252 3.84s 100% -67.5%
Basic RAG (FAISS) 774 3.35s 100% baseline
GraphRAG (TigerGraph) 594 15.01s 100% -23.3%

GraphRAG saves 180 tokens per query (23.3%) vs Basic RAG while maintaining 100% answer quality across all 35 benchmark queries.

At 1M queries, GraphRAG saves ~180 million tokens — a dramatic cost reduction at scale.


Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Streamlit Dashboard                                │
│                    (Live Query + Saved Results Viewer)                       │
└──────┬──────────────────────────┬──────────────────────────┬────────────────┘
       │                          │                          │
       ▼                          ▼                          ▼
┌──────────────┐    ┌──────────────────────┐    ┌──────────────────────────┐
│  LLM-Only    │    │     Basic RAG        │    │  TigerGraph GraphRAG     │
│              │    │                      │    │                          │
│  ┌────────┐  │    │  ┌────────┐  ┌─────┐ │    │  ┌────────┐  ┌────────┐  │
│  │ Ollama │  │    │  │nomic-  │  │FAISS│ │    │  │nomic-  │  │GraphRAG│  │
│  │ qwen2  │  │    │  │embed-  │→ │Index│ │    │  │embed-  │→ │Search  │  │
│  │ 0.5b   │  │    │  │text    │  │     │ │    │  │text    │  │API     │  │
│  └────────┘  │    │  └────────┘  └─────┘ │    │  └────────┘  └───┬────┘  │
│       │      │    │       │         │     │    │       │          │       │
└───────┼──────┘    └───────┼─────────┼─────┘    └───────┼──────────┼───────┘
        │                   │         │                  │          │
        ▼                   ▼         ▼                  ▼          ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Ollama (local, zero API costs)                     │
│                    qwen2:0.5b  +  nomic-embed-text                          │
└─────────────────────────────────────────────────────────────────────────────┘

Pipeline Descriptions

Pipeline Retrieval Context Generation
LLM-Only None Model knowledge only Direct prompt to Ollama
Basic RAG FAISS vector similarity Top-5 full document chunks Prompt + context → Ollama
GraphRAG TigerGraph graph traversal + entity disambiguation Graph-connected entities only Prompt + graph context → Ollama

Features

  • Three complete pipelines — LLM-Only baseline, Basic RAG (FAISS), and TigerGraph GraphRAG
  • Live Streamlit dashboard — Compare all 3 pipelines on any question side-by-side
  • Full benchmark suite — 35 queries across 4 question types (summary, relational, lookup, aggregation)
  • LLM-as-a-Judge evaluation — Automated PASS/FAIL grading using the same Ollama model
  • BERTScore evaluation — Semantic similarity scoring (configurable, optional)
  • Comprehensive metrics — Token usage, latency, retrieval time, and answer quality
  • Zero API costs — All LLM calls and embeddings run locally via Ollama
  • 500 synthetic PubMed abstracts — 15 disease areas, multi-disease relational queries
  • Interactive charts — Plotly bar charts for tokens, latency, and accuracy comparison
  • Saved results viewer — Load and explore previous benchmark runs

Tech Stack

Component Technology
GraphRAG Backend Official TigerGraph GraphRAG Docker containers (ECC + Search API)
Graph Database TigerGraph Cloud (free tier)
LLM Inference Ollama + qwen2:0.5b
Embeddings Ollama + nomic-embed-text (768d)
Vector Store FAISS (IndexFlatL2)
Dashboard Streamlit + Plotly
Evaluation LLM-as-a-Judge (qwen2:0.5b) + BERTScore (optional)
Data Synthetic PubMed dataset generator
Packaging Python 3.10+, pip

Setup

Prerequisites

  • Ollama installed and running
  • Python 3.10+
  • Docker (for TigerGraph GraphRAG containers)
  • TigerGraph Cloud account (free tier)

1. Clone and install

git clone <your-repo-url>
cd graphrag-benchmark
pip install -r requirements.txt

2. Pull Ollama models

ollama pull qwen2:0.5b
ollama pull nomic-embed-text

3. Configure environment

cp .env.example .env
# Edit .env with your TigerGraph Cloud credentials

4. Start GraphRAG Docker containers

cd graphrag
docker compose up -d graphrag graphrag-ecc

5. Generate dataset and build FAISS index

python scripts/build_dataset.py
python scripts/build_faiss_index.py

Usage

Quick Start (single command)

python setup_and_run.py

Run the full benchmark (35 queries)

python scripts/run_benchmark.py

Launch the Streamlit dashboard

streamlit run dashboard/app.py

Open http://localhost:8501 and:

  1. Type a medical question
  2. Click "Run All Pipelines"
  3. Compare LLM-Only, Basic RAG, and GraphRAG results side-by-side

Single demo query

python scripts/demo_query.py

Generate benchmark report

python scripts/generate_report.py

Benchmark Results

Detailed Metrics (35 queries)

Metric LLM-Only Basic RAG GraphRAG
Total tokens 8,806 27,085 20,779
Avg tokens per query 251.6 773.9 593.7
Min tokens 64 634 501
Max tokens 706 1,135 791
Avg prompt tokens 15.2 651.4 499.4
Avg completion tokens 236.4 122.5 94.3
Avg total latency 3.84s 3.35s 15.01s
Avg retrieval time 0.00s 0.02s 11.76s
LLM Judge pass rate 100% 100% 100%

Cross-Pipeline Comparisons

Comparison Token Impact Latency Impact
GraphRAG vs Basic RAG -23.3% (-180 tokens) +11.66s
GraphRAG vs LLM-Only +136.0% (+342 tokens) +11.17s
Basic RAG vs LLM-Only +207.6% (+522 tokens) -0.49s

Query Type Breakdown

Query Type Queries LLM-Only Basic RAG GraphRAG
Summary 10 296 tok, 4.25s 829 tok, 3.60s 578 tok, 12.42s
Relational 10 262 tok, 3.79s 763 tok, 3.32s 651 tok, 12.76s
Lookup 10 185 tok, 3.47s 784 tok, 3.35s 545 tok, 12.19s

Metrics Explained

Tokens

Total LLM tokens consumed (prompt + completion). Measured via tiktoken with cl100k_base encoding. Lower is better for cost efficiency.

Latency

End-to-end wall-clock time in seconds. Includes retrieval, generation, and post-processing.

LLM-as-a-Judge

Automated quality evaluation using the same qwen2:0.5b model. The model is asked to grade each answer PASS/FAIL based on factual accuracy, completeness, and consistency with ground truth. No human annotation required.

BERTScore

Optional semantic similarity metric using microsoft/deberta-xlarge-mnli. Computes precision, recall, and F1 between predicted and ground truth answers at the token level. Useful for fine-grained quality assessment.


Project Structure

graphrag-benchmark/
├── config/
│   └── schema.gsql              # TigerGraph graph schema
├── dashboard/
│   └── app.py                   # Streamlit web dashboard
├── data/pubmed/                 # Generated dataset (gitignored)
├── evaluation/
│   ├── llm_judge.py             # LLM-as-a-Judge evaluator
│   └── bertscore_eval.py        # BERTScore evaluator
├── graphrag/                    # TigerGraph GraphRAG Docker config
│   ├── configs/server_config.json
│   └── docker-compose.yml
├── outputs/                     # Benchmark results (gitignored except samples)
├── pipelines/
│   ├── llm_only.py              # Pipeline 1: Direct LLM
│   ├── basic_rag.py             # Pipeline 2: FAISS + LLM
│   ├── graphrag_pipeline.py     # Pipeline 3: TigerGraph GraphRAG
│   └── graphrag_local.py        # Local NetworkX GraphRAG (alternative)
├── reports/
│   └── benchmark_report.md      # Full benchmark report
├── scripts/
│   ├── build_dataset.py         # PubMed dataset generator
│   ├── build_faiss_index.py     # FAISS index builder
│   ├── run_benchmark.py         # Full 35-query benchmark
│   ├── demo_query.py            # Quick demo script
│   ├── generate_report.py       # Report generator
│   └── run_bertscore.py         # BERTScore evaluation
├── .env.example                 # Environment template
├── setup_and_run.py             # One-command setup
├── requirements.txt
├── pyproject.toml
└── README.md

Demo

Dashboard Screenshots

Screenshots to be added — see the screenshots/ directory.

  • Live Query Mode: Three side-by-side pipeline results with metrics
  • Saved Results Mode: Full benchmark summary with expandable query details
  • Metrics Comparison: Plotly bar charts for tokens, latency, and accuracy

Demo Video Script

See demo_video_script.md for a 5-7 minute walkthrough covering:

  • Problem statement and motivation
  • Three-pipeline architecture
  • Live dashboard demonstration
  • Evaluation methodology
  • Key results and findings

Future Improvements

  • Multi-hop reasoning queries — Test GraphRAG on deeper (3+ hop) queries
  • Real PubMed dataset — Replace synthetic data with actual PubMed abstracts
  • GPU acceleration — Speed up GraphRAG ECC with GPU-enabled Ollama
  • Additional LLMs — Compare across different models (Llama 3, Mistral, Gemma)
  • Cost modeling — Dollar-based cost comparison at scale
  • Query caching — Cache embeddings and search results for repeated queries
  • Hybrid retrieval — Combine vector + graph retrieval for optimal results
  • CI/CD pipeline — Automated benchmark runs on schedule

License

Distributed under the MIT License. See LICENSE for more information.


Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.


Acknowledgments

About

Benchmarking GraphRAG vs Basic RAG vs LLM-Only using TigerGraph, Ollama, and Streamlit with biomedical datasets, live query evaluation, token analysis, latency benchmarking, and BERTScore accuracy metrics.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages