Adversarial pre-deployment evaluation and root cause attribution for RAG pipelines.
RAGAS tells you your pipeline failed. ProbeGuard finds the failures before your users do. RAGSurgeon tells you which component caused each failure and what to fix.
RAG pipelines in production fail at their edges — not their center. Standard evaluation frameworks test pipelines on natural, well-formed queries and report average-case metrics. They have no mechanism to reach the edge cases that real users find after deployment.
Four documented failure modes are systematically missed:
- Chunk boundary fracture — the answer exists in the corpus but spans two chunks; the retriever cannot surface it
- Retrieval gap — the retriever degrades silently when queries are rephrased with synonyms or jargon
- Corpus contradiction — conflicting documents corrupt the generated answer without triggering any quality alert
- Prompt hijack — adversarial instructions embedded in retrieved content redirect the LLM
When evaluation does surface a failure, existing tools report a score and stop. They do not attribute the failure to a specific pipeline component, and they do not prescribe a fix. Engineering teams are left with a number and no diagnosis.
ProbeGuard closes both blind spots before a single user sees the pipeline.
Evaluated on HotpotQA with a production-grade RAG pipeline (BGE-base-en-v1.5 + cross-encoder re-ranker + Llama 3.3 70B):
| Metric | Value |
|---|---|
| FCS (Failure Coverage Score) | 0.806 |
| Adversarial failure rate | 52.0% |
| Retrieval gap rate | 40.0% ±26.0 (95% CI) |
| Corpus contradiction rate | 20.0% ±22.7 (95% CI) |
| Standard eval failure rate | 90.0% ±19.3 (95% CI) |
FCS = 0.806 — standard evaluation misses 80.6% of the structural failure boundary that ProbeGuard finds. A pipeline can pass standard evaluation with high faithfulness scores while 80% of its structural vulnerabilities remain undetected.
FCS is a novel metric introduced by this work. It quantifies what fraction of a RAG pipeline's true failure boundary is left unexplored by standard evaluation.
FCS = adversarial_unique_rate / adversarial_union_rate
where:
unique_rate_p = probe_rate_p x (1 - overlap_weight_p)
adversarial_unique_rate = 1 - prod(1 - unique_rate_p)
adversarial_union_rate = 1 - prod(1 - probe_rate_p)
Each probe carries an overlap weight — the fraction of its failures that standard answer-checking could theoretically detect:
| Probe | Overlap weight | Rationale |
|---|---|---|
| Chunk fracture | 0.15 | Cross-chunk failures are structurally invisible to answer-checking |
| Retrieval gap | 0.15 | Silent retriever misses never surface in generated answers |
| Corpus contradiction | 0.40 | Incoherent answers are sometimes caught by faithfulness metrics |
| Prompt hijack | 0.40 | Off-topic answers are sometimes flagged by answer relevance metrics |
FCS = 0.0 means standard evaluation found everything (impossible in practice). FCS = 0.67 means it missed 67% of the structural failure boundary. FCS = 1.0 means standard evaluation found nothing that ProbeGuard finds.
HotpotQA bridge questions require reasoning across two Wikipedia articles. If the retriever cannot connect facts across a chunk boundary, the answer is unreachable regardless of LLM capability. The probe compares retrieval under chunked indexing (150 tokens) versus full-document indexing and flags fractures when the chunked retriever drops a gold document that full-document retrieval surfaces.
Real users phrase the same information need in many ways. The probe generates lexical paraphrases of each query — declarative form, keyword extraction, question-word variants — and measures rank delta across variants. A gap is flagged when mean rank delta exceeds threshold, indicating the retriever is fragile to phrasing variation.
Su et al. (2024) showed adversarially crafted documents can outrank legitimate ones in retrieval. The probe injects a programmatically generated contradiction document and runs NLI-based detection (DeBERTa-v3) to measure how much the contradiction shifts the pipeline's output.
The probe injects instruction-bearing documents into the retrieved set and measures semantic deviation between clean and hijacked responses. A hijack is flagged when mean semantic deviation exceeds baseline variance.
Input query
|
v
+---------------------------------------------+
| RAG pipeline under test |
| |
| BGE embed -> ChromaDB top-8 |
| -> Cross-encoder re-rank -> top-4 |
| -> Llama 3.3 70B (multi-hop) |
| -> Answer |
+---------------------------------------------+
|
v
+---------------------------------------------+
| ProbeGuard adversarial layer |
| |
| ChunkFractureProbe -> fracture_rate (CF%) |
| RetrievalGapProbe -> gap_rate (RG%) |
| ContradictionProbe -> vuln_rate (CC%) |
| PromptHijackProbe -> hijack_rate (PH%) |
| |
| FCSCalculator -> FCS score |
| + interpretation |
| + recommendation |
+---------------------------------------------+
|
v
+---------------------------------------------+
| RAGSurgeon attribution layer |
| |
| Differential component isolation: |
| failure_cause in {retriever, chunker, |
| corpus, prompt} |
| + structured fix recommendation per cause |
+---------------------------------------------+
RAGAS (Es et al., 2023) established reference-free RAG evaluation via faithfulness, answer relevance, and context relevance. Its fundamental limitation: it evaluates reactively on the queries you provide. It has no mechanism to find failures you did not think to test, and no concept of chunk boundary failure or corpus contamination.
HotpotQA (Yang et al., 2018) demonstrated that multi-hop QA is where retrieval systems break most severely. Every bridge question in HotpotQA systematically requires the cross-document connections that chunking disrupts — making it the natural stress test for chunk fracture. ProbeGuard operationalises this observation as a quantifiable probe.
RAG Adversarial Poisoning (Su et al., 2024) showed adversarial documents can outrank legitimate ones in retrieval, and that mixing adversarial and guiding contexts creates non-linear corruption. ProbeGuard implements their taxonomy as a systematic probe and extends it with component-level attribution.
ProbeGuard is the first framework that converts all three insights into a systematic pre-deployment evaluation tool with component-level attribution and a reproducible coverage metric.
| Capability | RAGAS | TruLens | RAGXplain | ProbeGuard |
|---|---|---|---|---|
| Pre-deployment adversarial testing | No | No | No | Yes |
| Chunk boundary failure detection | No | No | No | Yes |
| Novel failure coverage metric (FCS) | No | No | No | Yes |
| Component-level root cause attribution | No | Partial | Partial | Yes |
| Structured fix recommendations | No | No | No | Yes |
| Statistical confidence intervals | No | No | No | Yes |
| Reference-free evaluation | Yes | Yes | Yes | Yes |
| Component | Technology |
|---|---|
| Vector store | ChromaDB (persistent, cosine similarity) |
| Embedding model | BAAI/bge-base-en-v1.5 (768-dim, asymmetric encoding) |
| Re-ranker | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| LLM | Llama 3.3 70B via Groq API |
| NLI model | cross-encoder/nli-deberta-v3-base |
| Standard eval | ROUGE-L with Wilson 95% confidence intervals |
| Reproducibility | Fixed seed 42, deterministic temperature 0.0 |
| Dataset | Type | Failure modes targeted |
|---|---|---|
| HotpotQA | Multi-hop QA (113k pairs) | Chunk fracture — bridge questions stress chunk boundaries |
| TriviaQA | Single-hop QA | Retrieval gap — jargon and synonym sensitivity |
| MS-MARCO | Passage retrieval | Retrieval gap — passage-level query variation |
# Clone and install
git clone https://github.com/YOUR_USERNAME/probeguard.git
cd probeguard
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt
# Configure API keys
cp .env.example .env
# Add GROQ_API_KEY and ANTHROPIC_API_KEY
# Verify all components are healthy
python smoke_test/dry_run1.py
python smoke_test/dry_run2.py
python smoke_test/dry_run3.py
python smoke_test/dry_run4.py
# Run fast experiment (~35 min, all 4 probes active)
python probeguard/experiments/hotpotqa_eval.py --fast
# Run paper-grade experiment (~90 min)
python probeguard/experiments/hotpotqa_eval.py --n-docs 500 --n-queries 100probeguard/
├── probes/
│ ├── chunk_fracture.py # Cross-chunk reasoning failure detection
│ ├── retrieval_gap.py # Retrieval sensitivity to query paraphrases
│ ├── corpus_contradiction.py # Conflicting document injection + NLI scoring
│ └── prompt_hijack.py # Adversarial instruction detection
├── experiments/
│ ├── base_experiment.py # Shared infrastructure, CI computation, seed
│ └── hotpotqa_eval.py # HotpotQA benchmark experiment
├── fcs_metric.py # Failure Coverage Score — novel metric
├── base_pipeline.py # Production RAG pipeline under test
├── config.py # Hyperparameters, model config, thresholds
└── core.py # ProbeGuard orchestration and evaluation loop
smoke_test/
├── dry_run1.py # Data loading verification
├── dry_run2.py # Component health check
├── dry_run3.py # Latency benchmark
└── dry_run4.py # Probe sensitivity invariant tests
All experiments use fixed seed 42. Results include Wilson 95% confidence intervals. To reproduce:
python probeguard/experiments/hotpotqa_eval.py \
--n-docs 250 --n-queries 50 --force-rebuildExpected: FCS = 0.80, RG% = 40%, CC% = 20%, runtime ~35 minutes
Completed
- Four-probe adversarial evaluation framework, all probes operational
- Failure Coverage Score — novel metric with overlap-weighted unique rate formula
- Production RAG pipeline: BGE embeddings, cross-encoder re-ranking, Llama 3.3 70B, multi-hop decomposition prompt
- HotpotQA benchmark: FCS = 0.806, RG = 40%, CC = 20%
- Wilson 95% confidence intervals on all probe rates
- All 4 smoke tests passing with sub-0.5s query latency
In active development
- ChunkFracture calibration for larger corpora (target CF% = 35-50% on HotpotQA)
- TriviaQA and MS-MARCO experiments (datasets 2 and 3 of 5)
- RAGSurgeon component attribution — differential isolation of retriever, chunker, corpus, and prompt failures
Planned
- Natural Questions and SQuAD experiments to complete 5-pipeline benchmark
- Mean FCS validation across all five pipelines (target: 0.67 ± 0.05)
pip install probeguardopen-source package release- LangChain and LlamaIndex native integration
- arXiv preprint submission