ProbeGuard

Adversarial pre-deployment evaluation and root cause attribution for RAG pipelines.

RAGAS tells you your pipeline failed. ProbeGuard finds the failures before your users do. RAGSurgeon tells you which component caused each failure and what to fix.

The problem

RAG pipelines in production fail at their edges — not their center. Standard evaluation frameworks test pipelines on natural, well-formed queries and report average-case metrics. They have no mechanism to reach the edge cases that real users find after deployment.

Four documented failure modes are systematically missed:

Chunk boundary fracture — the answer exists in the corpus but spans two chunks; the retriever cannot surface it
Retrieval gap — the retriever degrades silently when queries are rephrased with synonyms or jargon
Corpus contradiction — conflicting documents corrupt the generated answer without triggering any quality alert
Prompt hijack — adversarial instructions embedded in retrieved content redirect the LLM

When evaluation does surface a failure, existing tools report a score and stop. They do not attribute the failure to a specific pipeline component, and they do not prescribe a fix. Engineering teams are left with a number and no diagnosis.

ProbeGuard closes both blind spots before a single user sees the pipeline.

Key results

Evaluated on HotpotQA with a production-grade RAG pipeline (BGE-base-en-v1.5 + cross-encoder re-ranker + Llama 3.3 70B):

Metric	Value
FCS (Failure Coverage Score)	0.806
Adversarial failure rate	52.0%
Retrieval gap rate	40.0% ±26.0 (95% CI)
Corpus contradiction rate	20.0% ±22.7 (95% CI)
Standard eval failure rate	90.0% ±19.3 (95% CI)

FCS = 0.806 — standard evaluation misses 80.6% of the structural failure boundary that ProbeGuard finds. A pipeline can pass standard evaluation with high faithfulness scores while 80% of its structural vulnerabilities remain undetected.

The Failure Coverage Score

FCS is a novel metric introduced by this work. It quantifies what fraction of a RAG pipeline's true failure boundary is left unexplored by standard evaluation.

FCS = adversarial_unique_rate / adversarial_union_rate

where:
  unique_rate_p           = probe_rate_p x (1 - overlap_weight_p)
  adversarial_unique_rate = 1 - prod(1 - unique_rate_p)
  adversarial_union_rate  = 1 - prod(1 - probe_rate_p)

Each probe carries an overlap weight — the fraction of its failures that standard answer-checking could theoretically detect:

Probe	Overlap weight	Rationale
Chunk fracture	0.15	Cross-chunk failures are structurally invisible to answer-checking
Retrieval gap	0.15	Silent retriever misses never surface in generated answers
Corpus contradiction	0.40	Incoherent answers are sometimes caught by faithfulness metrics
Prompt hijack	0.40	Off-topic answers are sometimes flagged by answer relevance metrics

FCS = 0.0 means standard evaluation found everything (impossible in practice). FCS = 0.67 means it missed 67% of the structural failure boundary. FCS = 1.0 means standard evaluation found nothing that ProbeGuard finds.

Four adversarial probes

Chunk fracture

HotpotQA bridge questions require reasoning across two Wikipedia articles. If the retriever cannot connect facts across a chunk boundary, the answer is unreachable regardless of LLM capability. The probe compares retrieval under chunked indexing (150 tokens) versus full-document indexing and flags fractures when the chunked retriever drops a gold document that full-document retrieval surfaces.

Retrieval gap

Real users phrase the same information need in many ways. The probe generates lexical paraphrases of each query — declarative form, keyword extraction, question-word variants — and measures rank delta across variants. A gap is flagged when mean rank delta exceeds threshold, indicating the retriever is fragile to phrasing variation.

Corpus contradiction

Su et al. (2024) showed adversarially crafted documents can outrank legitimate ones in retrieval. The probe injects a programmatically generated contradiction document and runs NLI-based detection (DeBERTa-v3) to measure how much the contradiction shifts the pipeline's output.

Prompt hijack

The probe injects instruction-bearing documents into the retrieved set and measures semantic deviation between clean and hijacked responses. A hijack is flagged when mean semantic deviation exceeds baseline variance.

Architecture

Input query
    |
    v
+---------------------------------------------+
|          RAG pipeline under test            |
|                                             |
|  BGE embed -> ChromaDB top-8               |
|           -> Cross-encoder re-rank -> top-4 |
|           -> Llama 3.3 70B (multi-hop)      |
|           -> Answer                         |
+---------------------------------------------+
    |
    v
+---------------------------------------------+
|        ProbeGuard adversarial layer         |
|                                             |
|  ChunkFractureProbe  -> fracture_rate (CF%) |
|  RetrievalGapProbe   -> gap_rate      (RG%) |
|  ContradictionProbe  -> vuln_rate     (CC%) |
|  PromptHijackProbe   -> hijack_rate   (PH%) |
|                                             |
|  FCSCalculator       -> FCS score           |
|                         + interpretation    |
|                         + recommendation    |
+---------------------------------------------+
    |
    v
+---------------------------------------------+
|        RAGSurgeon attribution layer         |
|                                             |
|  Differential component isolation:          |
|  failure_cause in {retriever, chunker,      |
|                    corpus, prompt}          |
|  + structured fix recommendation per cause  |
+---------------------------------------------+

How it relates to existing work

RAGAS (Es et al., 2023) established reference-free RAG evaluation via faithfulness, answer relevance, and context relevance. Its fundamental limitation: it evaluates reactively on the queries you provide. It has no mechanism to find failures you did not think to test, and no concept of chunk boundary failure or corpus contamination.

HotpotQA (Yang et al., 2018) demonstrated that multi-hop QA is where retrieval systems break most severely. Every bridge question in HotpotQA systematically requires the cross-document connections that chunking disrupts — making it the natural stress test for chunk fracture. ProbeGuard operationalises this observation as a quantifiable probe.

RAG Adversarial Poisoning (Su et al., 2024) showed adversarial documents can outrank legitimate ones in retrieval, and that mixing adversarial and guiding contexts creates non-linear corruption. ProbeGuard implements their taxonomy as a systematic probe and extends it with component-level attribution.

ProbeGuard is the first framework that converts all three insights into a systematic pre-deployment evaluation tool with component-level attribution and a reproducible coverage metric.

Comparison with existing tools

Capability	RAGAS	TruLens	RAGXplain	ProbeGuard
Pre-deployment adversarial testing	No	No	No	Yes
Chunk boundary failure detection	No	No	No	Yes
Novel failure coverage metric (FCS)	No	No	No	Yes
Component-level root cause attribution	No	Partial	Partial	Yes
Structured fix recommendations	No	No	No	Yes
Statistical confidence intervals	No	No	No	Yes
Reference-free evaluation	Yes	Yes	Yes	Yes

Technology stack

Component	Technology
Vector store	ChromaDB (persistent, cosine similarity)
Embedding model	BAAI/bge-base-en-v1.5 (768-dim, asymmetric encoding)
Re-ranker	cross-encoder/ms-marco-MiniLM-L-6-v2
LLM	Llama 3.3 70B via Groq API
NLI model	cross-encoder/nli-deberta-v3-base
Standard eval	ROUGE-L with Wilson 95% confidence intervals
Reproducibility	Fixed seed 42, deterministic temperature 0.0

Benchmark datasets

Dataset	Type	Failure modes targeted
HotpotQA	Multi-hop QA (113k pairs)	Chunk fracture — bridge questions stress chunk boundaries
TriviaQA	Single-hop QA	Retrieval gap — jargon and synonym sensitivity
MS-MARCO	Passage retrieval	Retrieval gap — passage-level query variation

Quick start

# Clone and install
git clone https://github.com/YOUR_USERNAME/probeguard.git
cd probeguard
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt

# Configure API keys
cp .env.example .env
# Add GROQ_API_KEY and ANTHROPIC_API_KEY

# Verify all components are healthy
python smoke_test/dry_run1.py
python smoke_test/dry_run2.py
python smoke_test/dry_run3.py
python smoke_test/dry_run4.py

# Run fast experiment (~35 min, all 4 probes active)
python probeguard/experiments/hotpotqa_eval.py --fast

# Run paper-grade experiment (~90 min)
python probeguard/experiments/hotpotqa_eval.py --n-docs 500 --n-queries 100

Project structure

probeguard/
├── probes/
│   ├── chunk_fracture.py        # Cross-chunk reasoning failure detection
│   ├── retrieval_gap.py         # Retrieval sensitivity to query paraphrases
│   ├── corpus_contradiction.py  # Conflicting document injection + NLI scoring
│   └── prompt_hijack.py         # Adversarial instruction detection
├── experiments/
│   ├── base_experiment.py       # Shared infrastructure, CI computation, seed
│   └── hotpotqa_eval.py         # HotpotQA benchmark experiment
├── fcs_metric.py                # Failure Coverage Score — novel metric
├── base_pipeline.py             # Production RAG pipeline under test
├── config.py                    # Hyperparameters, model config, thresholds
└── core.py                      # ProbeGuard orchestration and evaluation loop
smoke_test/
├── dry_run1.py                  # Data loading verification
├── dry_run2.py                  # Component health check
├── dry_run3.py                  # Latency benchmark
└── dry_run4.py                  # Probe sensitivity invariant tests

Reproducing results

All experiments use fixed seed 42. Results include Wilson 95% confidence intervals. To reproduce:

python probeguard/experiments/hotpotqa_eval.py \
  --n-docs 250 --n-queries 50 --force-rebuild

Expected: FCS = 0.80, RG% = 40%, CC% = 20%, runtime ~35 minutes

Status and roadmap

Completed

Four-probe adversarial evaluation framework, all probes operational
Failure Coverage Score — novel metric with overlap-weighted unique rate formula
Production RAG pipeline: BGE embeddings, cross-encoder re-ranking, Llama 3.3 70B, multi-hop decomposition prompt
HotpotQA benchmark: FCS = 0.806, RG = 40%, CC = 20%
Wilson 95% confidence intervals on all probe rates
All 4 smoke tests passing with sub-0.5s query latency

In active development

ChunkFracture calibration for larger corpora (target CF% = 35-50% on HotpotQA)
TriviaQA and MS-MARCO experiments (datasets 2 and 3 of 5)
RAGSurgeon component attribution — differential isolation of retriever, chunker, corpus, and prompt failures

Planned

Natural Questions and SQuAD experiments to complete 5-pipeline benchmark
Mean FCS validation across all five pipelines (target: 0.67 ± 0.05)
pip install probeguard open-source package release
LangChain and LlamaIndex native integration
arXiv preprint submission

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
probeguard		probeguard
smoke_test		smoke_test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProbeGuard

The problem

Key results

The Failure Coverage Score

Four adversarial probes

Chunk fracture

Retrieval gap

Corpus contradiction

Prompt hijack

Architecture

How it relates to existing work

Comparison with existing tools

Technology stack

Benchmark datasets

Quick start

Project structure

Reproducing results

Status and roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProbeGuard

The problem

Key results

The Failure Coverage Score

Four adversarial probes

Chunk fracture

Retrieval gap

Corpus contradiction

Prompt hijack

Architecture

How it relates to existing work

Comparison with existing tools

Technology stack

Benchmark datasets

Quick start

Project structure

Reproducing results

Status and roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages