FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Anonymous Repository for Double-Blind Review

Overview

FinGround is a three-stage verify-then-ground pipeline for financial document question answering that detects and mitigates LLM hallucinations through:

Finance-Aware Hybrid Retrieval — Complexity-routed retrieval over text and tables with structure-aware chunking
Atomic Financial Claim Verification — Decomposition into six claim types (numerical, temporal, entity-attribute, comparative, regulatory, computational) with type-routed verification including formula reconstruction
Grounded Regeneration — Targeted rewriting of unsupported claims with paragraph- and table-cell-level citations

An efficient 8B distilled detector retains 91.4% F1 at 18× lower per-claim latency, enabling $0.003/query deployment for production financial services environments.

🏭 Main Work

Deployment Context

Aspect	Details
Status	Pilot (4-week feasibility study with 24 financial analysts)
Scale	~25,000 queries/day target (500 filings/day × 50 queries/filing)
Latency	P50: 2.4s, P95: 3.8s (full pipeline, A100)
Throughput	8.4 QPS on single A100 (28× headroom over target)
Infrastructure	vLLM serving, Kubernetes orchestration, A100/A10G GPU
Cost	$0.003/query (15.7× lower than GPT-4o teacher)

Production Metrics

Metric	GPT-4o Teacher	FinGround (8B)	Improvement
Detection F1	95.0	91.4	96.2% retention
Per-claim Latency	6.1s	340ms (p95)	18× faster
Pipeline Latency (p95)	8.2s	3.8s	2.2× faster
Cost per Query	$0.047	$0.003	15.7× cheaper
HalRate (FinQA)	—	3.6%	78% reduction vs GPT-4o+CoT

System Architecture

[User Query] → [Complexity Classifier] → [Hybrid Text+Table Retrieval]
                                                    ↓
[Generated Answer] → [Claim Decomposition] → [Claim-Evidence Alignment]
                          ↓                          ↓
              [6-Type Financial Taxonomy]    [Verdict Classifier]
                                                    ↓
                           ┌────────────────────────┼────────────────┐
                     [Supported]            [Contradicted]     [Unverifiable]
                        ↓ (pass-through)         ↓                   ↓
                                          [Targeted Span Regeneration]
                                                    ↓
                                          [Citation Attachment]
                                                    ↓
                                          [Grounded Answer]

Lessons Learned

Claim-type routing is the bottleneck, not verification difficulty: Computational claims have the highest hallucination rate (28.4%) despite being most amenable to automated verification when properly typed. Generic NLI completely fails on ratio and margin verification even with correct evidence retrieved.
Hedged financial language dominates false positives: Terms like "approximately" and "roughly" in risk disclosures account for 52% of false positives. Calibrated confidence thresholds specific to claim types and financial language patterns are essential for production deployment.
Table-cell-level citations outperform paragraph-level references: Pilot analysts strongly preferred cell-level citations (e.g., "Table 3, Row: Operating Income, Col: FY2024"), enabling verification in seconds rather than minutes. Structure-aware chunking that preserves row-column relationships is critical infrastructure.
Retrieval-equalized evaluation is essential for RAG research: Conflating retrieval improvements with verification improvements inflates perceived system value. Under controlled retrieval, atomic verification contributes 68–76% additional HalRate reduction independently.
Distillation efficiency plateaus early: 1,600 examples yield 88.6% F1; 3,200 yield 91.4%; 6,400 yield only 92.1% (+0.7). Near-optimal training requires ~2,500–3,000 examples, making domain adaptation practical.

Reproducibility

While production data and internal deployment details cannot be released, we provide:

Inference and pipeline code
Evaluation scripts with benchmark reproduction
Latency benchmarking infrastructure
Serving code (FastAPI + vLLM)
Configuration files for all experiments
Prompts and templates
Model weights (available upon acceptance)
Production training data (proprietary)

Quick Start

Installation

# Clone repository
git clone https://anonymous.4open.science/r/FinGround/
cd finground

# Create environment
python -m venv .venv && source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

Run Inference

from finground.pipeline import FinGroundPipeline

pipeline = FinGroundPipeline.from_pretrained("finground-8b")

result = pipeline.verify(
    query="What was Apple's gross margin in Q4 2023?",
    answer="Apple's gross margin was 42.3% in Q4 2023.",
    documents=["path/to/10K.pdf"],
)

print(result.grounded_answer)
print(result.claims)  # Per-claim verdicts with citations

Run Evaluation

# Reproduce main results (Table 2 in paper)
python -m finground.evaluation.run_eval \
    --config configs/eval_finqa.yaml \
    --output results/

# Reproduce detection results (Table 1)
python -m finground.evaluation.run_eval \
    --config configs/eval_finhalu.yaml \
    --output results/

# Run latency benchmark
python -m finground.evaluation.latency_benchmark \
    --config configs/benchmark.yaml

Start Serving

# Local development
uvicorn finground.serving.server:app --host 0.0.0.0 --port 8080

# Docker
docker build -f Dockerfile.serving -t finground-serving .
docker run --gpus all -p 8080:8080 finground-serving

# Kubernetes
kubectl apply -f kubernetes/

Project Structure

finground/
├── src/finground/
│   ├── __init__.py
│   ├── pipeline/              # End-to-end pipeline
│   │   ├── __init__.py
│   │   └── core.py            # FinGroundPipeline
│   ├── models/                # Model definitions
│   │   ├── __init__.py
│   │   ├── claim_decomposer.py
│   │   ├── verdict_classifier.py
│   │   ├── complexity_classifier.py
│   │   └── regenerator.py
│   ├── data/                  # Data loading and processing
│   │   ├── __init__.py
│   │   ├── datasets.py
│   │   └── taxonomy.py
│   ├── evaluation/            # Evaluation and benchmarking
│   │   ├── __init__.py
│   │   ├── metrics.py
│   │   ├── run_eval.py
│   │   └── latency_benchmark.py
│   ├── serving/               # Production serving
│   │   ├── __init__.py
│   │   └── server.py
│   └── utils/                 # Utilities
│       ├── __init__.py
│       └── reproducibility.py
├── configs/                   # Experiment configurations
│   ├── eval_finqa.yaml
│   ├── eval_tatqa.yaml
│   ├── eval_finhalu.yaml
│   ├── eval_financebench.yaml
│   ├── distillation.yaml
│   └── benchmark.yaml
├── tests/                     # Unit and integration tests
│   ├── test_pipeline.py
│   ├── test_taxonomy.py
│   └── test_metrics.py
├── docs/                      # Documentation
│   └── COMPUTE_RESOURCES.md
├── kubernetes/                # Deployment configs
│   └── deployment.yaml
├── paper/                     # Paper source
│   ├── finground_6page.tex
│   └── references.bib
├── Dockerfile.serving
├── pyproject.toml
├── LICENSE
└── README.md

Evaluation Results Summary

Hallucination Detection (FinHalu)

System	Precision	Recall	F1
SelfCheckGPT	69.4±2.1	76.5±1.8	72.8±1.6
HHEM	78.9±1.8	73.8±2.0	76.3±1.5
FActScore	74.2±2.0	79.3±1.7	76.7±1.5
Self-RAG	81.2±1.7	77.1±1.9	79.1±1.4
GPT-4o (teacher)	94.1±0.9	95.9±0.7	95.0±0.6
FinGround (8B)	92.7±1.1	90.2±1.3	91.4±1.2

End-to-End Hallucination Rates

System	FinQA	TAT-QA	FinanceBench
Vanilla RAG	34.7%	31.5%	43.8%
GPT-4o + CoT	18.6%	15.2%	22.4%
FinGround	3.6%	3.8%	4.9%

Hardware Requirements

Configuration	GPU	Memory	Use Case
Recommended	NVIDIA A100 (80GB)	32GB RAM	Full pipeline, benchmarking
Production	NVIDIA A10G (24GB)	16GB RAM	Serving (FP16)
Minimum	Any GPU with 24GB+ VRAM	16GB RAM	Inference only
CPU-only	None	32GB RAM	Evaluation scripts only

Limitations

Evaluated on English-language U.S. SEC filings only; other languages and jurisdictions require validation
GPT-dependent evaluation ecosystem with 3–4 point F1 degradation on non-GPT generators
3.6-point F1 gap from teacher, particularly on nuanced computational claims
Pilot involved 24 analysts at a single firm with self-reported observations
Detection recall drops to 71.4% on hallucinated values within ±5% of ground truth

Responsible NLP Checklist

Limitations: Documented above and in paper Limitations section
Risks: Automation bias risk; outputs are not financial advice; liability for AI-verified claims unresolved
Compute: ~32 A100 GPU-hours for distillation training; ~8 GPU-hours per new claim type
Human Evaluation: 3 financial domain experts, IRB-exempt observational study
AI Assistants: GPT-4o used for teacher annotations with two-pass consistency checks; all outputs reviewed by authors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Overview

🏭 Main Work

Deployment Context

Production Metrics

System Architecture

Lessons Learned

Reproducibility

Quick Start

Installation

Run Inference

Run Evaluation

Start Serving

Project Structure

Evaluation Results Summary

Hallucination Detection (FinHalu)

End-to-End Hallucination Rates

Hardware Requirements

Limitations

Responsible NLP Checklist

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
kubernetes		kubernetes
src/finground		src/finground
tests		tests
.gitattributes		.gitattributes
Dockerfile.serving		Dockerfile.serving
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Overview

🏭 Main Work

Deployment Context

Production Metrics

System Architecture

Lessons Learned

Reproducibility

Quick Start

Installation

Run Inference

Run Evaluation

Start Serving

Project Structure

Evaluation Results Summary

Hallucination Detection (FinHalu)

End-to-End Hallucination Rates

Hardware Requirements

Limitations

Responsible NLP Checklist

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages