Skip to content

Samruddhi-jadh/ProbeGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProbeGuard

Adversarial pre-deployment evaluation and root cause attribution for RAG pipelines.

RAGAS tells you your pipeline failed. ProbeGuard finds the failures before your users do. RAGSurgeon tells you which component caused each failure and what to fix.

Status Python License Benchmark


The problem

RAG pipelines in production fail at their edges — not their center. Standard evaluation frameworks test pipelines on natural, well-formed queries and report average-case metrics. They have no mechanism to reach the edge cases that real users find after deployment.

Four documented failure modes are systematically missed:

  • Chunk boundary fracture — the answer exists in the corpus but spans two chunks; the retriever cannot surface it
  • Retrieval gap — the retriever degrades silently when queries are rephrased with synonyms or jargon
  • Corpus contradiction — conflicting documents corrupt the generated answer without triggering any quality alert
  • Prompt hijack — adversarial instructions embedded in retrieved content redirect the LLM

When evaluation does surface a failure, existing tools report a score and stop. They do not attribute the failure to a specific pipeline component, and they do not prescribe a fix. Engineering teams are left with a number and no diagnosis.

ProbeGuard closes both blind spots before a single user sees the pipeline.


Key results

Evaluated on HotpotQA with a production-grade RAG pipeline (BGE-base-en-v1.5 + cross-encoder re-ranker + Llama 3.3 70B):

Metric Value
FCS (Failure Coverage Score) 0.806
Adversarial failure rate 52.0%
Retrieval gap rate 40.0% ±26.0 (95% CI)
Corpus contradiction rate 20.0% ±22.7 (95% CI)
Standard eval failure rate 90.0% ±19.3 (95% CI)

FCS = 0.806 — standard evaluation misses 80.6% of the structural failure boundary that ProbeGuard finds. A pipeline can pass standard evaluation with high faithfulness scores while 80% of its structural vulnerabilities remain undetected.


The Failure Coverage Score

FCS is a novel metric introduced by this work. It quantifies what fraction of a RAG pipeline's true failure boundary is left unexplored by standard evaluation.

FCS = adversarial_unique_rate / adversarial_union_rate

where:
  unique_rate_p           = probe_rate_p x (1 - overlap_weight_p)
  adversarial_unique_rate = 1 - prod(1 - unique_rate_p)
  adversarial_union_rate  = 1 - prod(1 - probe_rate_p)

Each probe carries an overlap weight — the fraction of its failures that standard answer-checking could theoretically detect:

Probe Overlap weight Rationale
Chunk fracture 0.15 Cross-chunk failures are structurally invisible to answer-checking
Retrieval gap 0.15 Silent retriever misses never surface in generated answers
Corpus contradiction 0.40 Incoherent answers are sometimes caught by faithfulness metrics
Prompt hijack 0.40 Off-topic answers are sometimes flagged by answer relevance metrics

FCS = 0.0 means standard evaluation found everything (impossible in practice). FCS = 0.67 means it missed 67% of the structural failure boundary. FCS = 1.0 means standard evaluation found nothing that ProbeGuard finds.


Four adversarial probes

Chunk fracture

HotpotQA bridge questions require reasoning across two Wikipedia articles. If the retriever cannot connect facts across a chunk boundary, the answer is unreachable regardless of LLM capability. The probe compares retrieval under chunked indexing (150 tokens) versus full-document indexing and flags fractures when the chunked retriever drops a gold document that full-document retrieval surfaces.

Retrieval gap

Real users phrase the same information need in many ways. The probe generates lexical paraphrases of each query — declarative form, keyword extraction, question-word variants — and measures rank delta across variants. A gap is flagged when mean rank delta exceeds threshold, indicating the retriever is fragile to phrasing variation.

Corpus contradiction

Su et al. (2024) showed adversarially crafted documents can outrank legitimate ones in retrieval. The probe injects a programmatically generated contradiction document and runs NLI-based detection (DeBERTa-v3) to measure how much the contradiction shifts the pipeline's output.

Prompt hijack

The probe injects instruction-bearing documents into the retrieved set and measures semantic deviation between clean and hijacked responses. A hijack is flagged when mean semantic deviation exceeds baseline variance.


Architecture

Input query
    |
    v
+---------------------------------------------+
|          RAG pipeline under test            |
|                                             |
|  BGE embed -> ChromaDB top-8               |
|           -> Cross-encoder re-rank -> top-4 |
|           -> Llama 3.3 70B (multi-hop)      |
|           -> Answer                         |
+---------------------------------------------+
    |
    v
+---------------------------------------------+
|        ProbeGuard adversarial layer         |
|                                             |
|  ChunkFractureProbe  -> fracture_rate (CF%) |
|  RetrievalGapProbe   -> gap_rate      (RG%) |
|  ContradictionProbe  -> vuln_rate     (CC%) |
|  PromptHijackProbe   -> hijack_rate   (PH%) |
|                                             |
|  FCSCalculator       -> FCS score           |
|                         + interpretation    |
|                         + recommendation    |
+---------------------------------------------+
    |
    v
+---------------------------------------------+
|        RAGSurgeon attribution layer         |
|                                             |
|  Differential component isolation:          |
|  failure_cause in {retriever, chunker,      |
|                    corpus, prompt}          |
|  + structured fix recommendation per cause  |
+---------------------------------------------+

How it relates to existing work

RAGAS (Es et al., 2023) established reference-free RAG evaluation via faithfulness, answer relevance, and context relevance. Its fundamental limitation: it evaluates reactively on the queries you provide. It has no mechanism to find failures you did not think to test, and no concept of chunk boundary failure or corpus contamination.

HotpotQA (Yang et al., 2018) demonstrated that multi-hop QA is where retrieval systems break most severely. Every bridge question in HotpotQA systematically requires the cross-document connections that chunking disrupts — making it the natural stress test for chunk fracture. ProbeGuard operationalises this observation as a quantifiable probe.

RAG Adversarial Poisoning (Su et al., 2024) showed adversarial documents can outrank legitimate ones in retrieval, and that mixing adversarial and guiding contexts creates non-linear corruption. ProbeGuard implements their taxonomy as a systematic probe and extends it with component-level attribution.

ProbeGuard is the first framework that converts all three insights into a systematic pre-deployment evaluation tool with component-level attribution and a reproducible coverage metric.


Comparison with existing tools

Capability RAGAS TruLens RAGXplain ProbeGuard
Pre-deployment adversarial testing No No No Yes
Chunk boundary failure detection No No No Yes
Novel failure coverage metric (FCS) No No No Yes
Component-level root cause attribution No Partial Partial Yes
Structured fix recommendations No No No Yes
Statistical confidence intervals No No No Yes
Reference-free evaluation Yes Yes Yes Yes

Technology stack

Component Technology
Vector store ChromaDB (persistent, cosine similarity)
Embedding model BAAI/bge-base-en-v1.5 (768-dim, asymmetric encoding)
Re-ranker cross-encoder/ms-marco-MiniLM-L-6-v2
LLM Llama 3.3 70B via Groq API
NLI model cross-encoder/nli-deberta-v3-base
Standard eval ROUGE-L with Wilson 95% confidence intervals
Reproducibility Fixed seed 42, deterministic temperature 0.0

Benchmark datasets

Dataset Type Failure modes targeted
HotpotQA Multi-hop QA (113k pairs) Chunk fracture — bridge questions stress chunk boundaries
TriviaQA Single-hop QA Retrieval gap — jargon and synonym sensitivity
MS-MARCO Passage retrieval Retrieval gap — passage-level query variation

Quick start

# Clone and install
git clone https://github.com/YOUR_USERNAME/probeguard.git
cd probeguard
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt

# Configure API keys
cp .env.example .env
# Add GROQ_API_KEY and ANTHROPIC_API_KEY

# Verify all components are healthy
python smoke_test/dry_run1.py
python smoke_test/dry_run2.py
python smoke_test/dry_run3.py
python smoke_test/dry_run4.py

# Run fast experiment (~35 min, all 4 probes active)
python probeguard/experiments/hotpotqa_eval.py --fast

# Run paper-grade experiment (~90 min)
python probeguard/experiments/hotpotqa_eval.py --n-docs 500 --n-queries 100

Project structure

probeguard/
├── probes/
│   ├── chunk_fracture.py        # Cross-chunk reasoning failure detection
│   ├── retrieval_gap.py         # Retrieval sensitivity to query paraphrases
│   ├── corpus_contradiction.py  # Conflicting document injection + NLI scoring
│   └── prompt_hijack.py         # Adversarial instruction detection
├── experiments/
│   ├── base_experiment.py       # Shared infrastructure, CI computation, seed
│   └── hotpotqa_eval.py         # HotpotQA benchmark experiment
├── fcs_metric.py                # Failure Coverage Score — novel metric
├── base_pipeline.py             # Production RAG pipeline under test
├── config.py                    # Hyperparameters, model config, thresholds
└── core.py                      # ProbeGuard orchestration and evaluation loop
smoke_test/
├── dry_run1.py                  # Data loading verification
├── dry_run2.py                  # Component health check
├── dry_run3.py                  # Latency benchmark
└── dry_run4.py                  # Probe sensitivity invariant tests

Reproducing results

All experiments use fixed seed 42. Results include Wilson 95% confidence intervals. To reproduce:

python probeguard/experiments/hotpotqa_eval.py \
  --n-docs 250 --n-queries 50 --force-rebuild

Expected: FCS = 0.80, RG% = 40%, CC% = 20%, runtime ~35 minutes


Status and roadmap

Completed

  • Four-probe adversarial evaluation framework, all probes operational
  • Failure Coverage Score — novel metric with overlap-weighted unique rate formula
  • Production RAG pipeline: BGE embeddings, cross-encoder re-ranking, Llama 3.3 70B, multi-hop decomposition prompt
  • HotpotQA benchmark: FCS = 0.806, RG = 40%, CC = 20%
  • Wilson 95% confidence intervals on all probe rates
  • All 4 smoke tests passing with sub-0.5s query latency

In active development

  • ChunkFracture calibration for larger corpora (target CF% = 35-50% on HotpotQA)
  • TriviaQA and MS-MARCO experiments (datasets 2 and 3 of 5)
  • RAGSurgeon component attribution — differential isolation of retriever, chunker, corpus, and prompt failures

Planned

  • Natural Questions and SQuAD experiments to complete 5-pipeline benchmark
  • Mean FCS validation across all five pipelines (target: 0.67 ± 0.05)
  • pip install probeguard open-source package release
  • LangChain and LlamaIndex native integration
  • arXiv preprint submission

About

Adversarial Stress Testing and Root Cause Attribution for Production RAG Pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages