Three-Pipeline RAG Benchmark on Biomedical Literature
Results • Architecture • Setup • Usage • Benchmark • Report
Retrieval-Augmented Generation (RAG) improves LLM answer quality by providing external context — but standard vector-based RAG is inefficient. It retrieves entire document chunks when only a few sentences are relevant, wasting tokens and inflating costs.
Can knowledge graphs (GraphRAG) reduce token consumption while maintaining answer quality?
This project answers that question with a controlled 3-pipeline benchmark on 500 PubMed medical abstracts.
| Pipeline | Avg Tokens | Avg Latency | Judge Pass Rate | Token vs Basic RAG |
|---|---|---|---|---|
| LLM-Only | 252 | 3.84s | 100% | -67.5% |
| Basic RAG (FAISS) | 774 | 3.35s | 100% | baseline |
| GraphRAG (TigerGraph) | 594 | 15.01s | 100% | -23.3% |
GraphRAG saves 180 tokens per query (23.3%) vs Basic RAG while maintaining 100% answer quality across all 35 benchmark queries.
At 1M queries, GraphRAG saves ~180 million tokens — a dramatic cost reduction at scale.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Streamlit Dashboard │
│ (Live Query + Saved Results Viewer) │
└──────┬──────────────────────────┬──────────────────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────────────┐ ┌──────────────────────────┐
│ LLM-Only │ │ Basic RAG │ │ TigerGraph GraphRAG │
│ │ │ │ │ │
│ ┌────────┐ │ │ ┌────────┐ ┌─────┐ │ │ ┌────────┐ ┌────────┐ │
│ │ Ollama │ │ │ │nomic- │ │FAISS│ │ │ │nomic- │ │GraphRAG│ │
│ │ qwen2 │ │ │ │embed- │→ │Index│ │ │ │embed- │→ │Search │ │
│ │ 0.5b │ │ │ │text │ │ │ │ │ │text │ │API │ │
│ └────────┘ │ │ └────────┘ └─────┘ │ │ └────────┘ └───┬────┘ │
│ │ │ │ │ │ │ │ │ │ │
└───────┼──────┘ └───────┼─────────┼─────┘ └───────┼──────────┼───────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Ollama (local, zero API costs) │
│ qwen2:0.5b + nomic-embed-text │
└─────────────────────────────────────────────────────────────────────────────┘
| Pipeline | Retrieval | Context | Generation |
|---|---|---|---|
| LLM-Only | None | Model knowledge only | Direct prompt to Ollama |
| Basic RAG | FAISS vector similarity | Top-5 full document chunks | Prompt + context → Ollama |
| GraphRAG | TigerGraph graph traversal + entity disambiguation | Graph-connected entities only | Prompt + graph context → Ollama |
- Three complete pipelines — LLM-Only baseline, Basic RAG (FAISS), and TigerGraph GraphRAG
- Live Streamlit dashboard — Compare all 3 pipelines on any question side-by-side
- Full benchmark suite — 35 queries across 4 question types (summary, relational, lookup, aggregation)
- LLM-as-a-Judge evaluation — Automated PASS/FAIL grading using the same Ollama model
- BERTScore evaluation — Semantic similarity scoring (configurable, optional)
- Comprehensive metrics — Token usage, latency, retrieval time, and answer quality
- Zero API costs — All LLM calls and embeddings run locally via Ollama
- 500 synthetic PubMed abstracts — 15 disease areas, multi-disease relational queries
- Interactive charts — Plotly bar charts for tokens, latency, and accuracy comparison
- Saved results viewer — Load and explore previous benchmark runs
| Component | Technology |
|---|---|
| GraphRAG Backend | Official TigerGraph GraphRAG Docker containers (ECC + Search API) |
| Graph Database | TigerGraph Cloud (free tier) |
| LLM Inference | Ollama + qwen2:0.5b |
| Embeddings | Ollama + nomic-embed-text (768d) |
| Vector Store | FAISS (IndexFlatL2) |
| Dashboard | Streamlit + Plotly |
| Evaluation | LLM-as-a-Judge (qwen2:0.5b) + BERTScore (optional) |
| Data | Synthetic PubMed dataset generator |
| Packaging | Python 3.10+, pip |
- Ollama installed and running
- Python 3.10+
- Docker (for TigerGraph GraphRAG containers)
- TigerGraph Cloud account (free tier)
git clone <your-repo-url>
cd graphrag-benchmark
pip install -r requirements.txtollama pull qwen2:0.5b
ollama pull nomic-embed-textcp .env.example .env
# Edit .env with your TigerGraph Cloud credentialscd graphrag
docker compose up -d graphrag graphrag-eccpython scripts/build_dataset.py
python scripts/build_faiss_index.pypython setup_and_run.pypython scripts/run_benchmark.pystreamlit run dashboard/app.pyOpen http://localhost:8501 and:
- Type a medical question
- Click "Run All Pipelines"
- Compare LLM-Only, Basic RAG, and GraphRAG results side-by-side
python scripts/demo_query.pypython scripts/generate_report.py| Metric | LLM-Only | Basic RAG | GraphRAG |
|---|---|---|---|
| Total tokens | 8,806 | 27,085 | 20,779 |
| Avg tokens per query | 251.6 | 773.9 | 593.7 |
| Min tokens | 64 | 634 | 501 |
| Max tokens | 706 | 1,135 | 791 |
| Avg prompt tokens | 15.2 | 651.4 | 499.4 |
| Avg completion tokens | 236.4 | 122.5 | 94.3 |
| Avg total latency | 3.84s | 3.35s | 15.01s |
| Avg retrieval time | 0.00s | 0.02s | 11.76s |
| LLM Judge pass rate | 100% | 100% | 100% |
| Comparison | Token Impact | Latency Impact |
|---|---|---|
| GraphRAG vs Basic RAG | -23.3% (-180 tokens) | +11.66s |
| GraphRAG vs LLM-Only | +136.0% (+342 tokens) | +11.17s |
| Basic RAG vs LLM-Only | +207.6% (+522 tokens) | -0.49s |
| Query Type | Queries | LLM-Only | Basic RAG | GraphRAG |
|---|---|---|---|---|
| Summary | 10 | 296 tok, 4.25s | 829 tok, 3.60s | 578 tok, 12.42s |
| Relational | 10 | 262 tok, 3.79s | 763 tok, 3.32s | 651 tok, 12.76s |
| Lookup | 10 | 185 tok, 3.47s | 784 tok, 3.35s | 545 tok, 12.19s |
Total LLM tokens consumed (prompt + completion). Measured via tiktoken with cl100k_base encoding. Lower is better for cost efficiency.
End-to-end wall-clock time in seconds. Includes retrieval, generation, and post-processing.
Automated quality evaluation using the same qwen2:0.5b model. The model is asked to grade each answer PASS/FAIL based on factual accuracy, completeness, and consistency with ground truth. No human annotation required.
Optional semantic similarity metric using microsoft/deberta-xlarge-mnli. Computes precision, recall, and F1 between predicted and ground truth answers at the token level. Useful for fine-grained quality assessment.
graphrag-benchmark/
├── config/
│ └── schema.gsql # TigerGraph graph schema
├── dashboard/
│ └── app.py # Streamlit web dashboard
├── data/pubmed/ # Generated dataset (gitignored)
├── evaluation/
│ ├── llm_judge.py # LLM-as-a-Judge evaluator
│ └── bertscore_eval.py # BERTScore evaluator
├── graphrag/ # TigerGraph GraphRAG Docker config
│ ├── configs/server_config.json
│ └── docker-compose.yml
├── outputs/ # Benchmark results (gitignored except samples)
├── pipelines/
│ ├── llm_only.py # Pipeline 1: Direct LLM
│ ├── basic_rag.py # Pipeline 2: FAISS + LLM
│ ├── graphrag_pipeline.py # Pipeline 3: TigerGraph GraphRAG
│ └── graphrag_local.py # Local NetworkX GraphRAG (alternative)
├── reports/
│ └── benchmark_report.md # Full benchmark report
├── scripts/
│ ├── build_dataset.py # PubMed dataset generator
│ ├── build_faiss_index.py # FAISS index builder
│ ├── run_benchmark.py # Full 35-query benchmark
│ ├── demo_query.py # Quick demo script
│ ├── generate_report.py # Report generator
│ └── run_bertscore.py # BERTScore evaluation
├── .env.example # Environment template
├── setup_and_run.py # One-command setup
├── requirements.txt
├── pyproject.toml
└── README.md
Screenshots to be added — see the screenshots/ directory.
- Live Query Mode: Three side-by-side pipeline results with metrics
- Saved Results Mode: Full benchmark summary with expandable query details
- Metrics Comparison: Plotly bar charts for tokens, latency, and accuracy
See demo_video_script.md for a 5-7 minute walkthrough covering:
- Problem statement and motivation
- Three-pipeline architecture
- Live dashboard demonstration
- Evaluation methodology
- Key results and findings
- Multi-hop reasoning queries — Test GraphRAG on deeper (3+ hop) queries
- Real PubMed dataset — Replace synthetic data with actual PubMed abstracts
- GPU acceleration — Speed up GraphRAG ECC with GPU-enabled Ollama
- Additional LLMs — Compare across different models (Llama 3, Mistral, Gemma)
- Cost modeling — Dollar-based cost comparison at scale
- Query caching — Cache embeddings and search results for repeated queries
- Hybrid retrieval — Combine vector + graph retrieval for optimal results
- CI/CD pipeline — Automated benchmark runs on schedule
Distributed under the MIT License. See LICENSE for more information.
Contributions are welcome! See CONTRIBUTING.md for guidelines.
- Built for the TigerGraph GraphRAG Hackathon
- Powered by TigerGraph Cloud free tier
- Ollama for local LLM inference
- FAISS for vector search