Skip to content

bettyguo/FinGround

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Anonymous Repository for Double-Blind Review

Python 3.10+ Code style: black


Overview

FinGround is a three-stage verify-then-ground pipeline for financial document question answering that detects and mitigates LLM hallucinations through:

  1. Finance-Aware Hybrid Retrieval — Complexity-routed retrieval over text and tables with structure-aware chunking
  2. Atomic Financial Claim Verification — Decomposition into six claim types (numerical, temporal, entity-attribute, comparative, regulatory, computational) with type-routed verification including formula reconstruction
  3. Grounded Regeneration — Targeted rewriting of unsupported claims with paragraph- and table-cell-level citations

An efficient 8B distilled detector retains 91.4% F1 at 18× lower per-claim latency, enabling $0.003/query deployment for production financial services environments.


🏭 Main Work

Deployment Context

Aspect Details
Status Pilot (4-week feasibility study with 24 financial analysts)
Scale ~25,000 queries/day target (500 filings/day × 50 queries/filing)
Latency P50: 2.4s, P95: 3.8s (full pipeline, A100)
Throughput 8.4 QPS on single A100 (28× headroom over target)
Infrastructure vLLM serving, Kubernetes orchestration, A100/A10G GPU
Cost $0.003/query (15.7× lower than GPT-4o teacher)

Production Metrics

Metric GPT-4o Teacher FinGround (8B) Improvement
Detection F1 95.0 91.4 96.2% retention
Per-claim Latency 6.1s 340ms (p95) 18× faster
Pipeline Latency (p95) 8.2s 3.8s 2.2× faster
Cost per Query $0.047 $0.003 15.7× cheaper
HalRate (FinQA) 3.6% 78% reduction vs GPT-4o+CoT

System Architecture

[User Query] → [Complexity Classifier] → [Hybrid Text+Table Retrieval]
                                                    ↓
[Generated Answer] → [Claim Decomposition] → [Claim-Evidence Alignment]
                          ↓                          ↓
              [6-Type Financial Taxonomy]    [Verdict Classifier]
                                                    ↓
                           ┌────────────────────────┼────────────────┐
                     [Supported]            [Contradicted]     [Unverifiable]
                        ↓ (pass-through)         ↓                   ↓
                                          [Targeted Span Regeneration]
                                                    ↓
                                          [Citation Attachment]
                                                    ↓
                                          [Grounded Answer]

Lessons Learned

  1. Claim-type routing is the bottleneck, not verification difficulty: Computational claims have the highest hallucination rate (28.4%) despite being most amenable to automated verification when properly typed. Generic NLI completely fails on ratio and margin verification even with correct evidence retrieved.

  2. Hedged financial language dominates false positives: Terms like "approximately" and "roughly" in risk disclosures account for 52% of false positives. Calibrated confidence thresholds specific to claim types and financial language patterns are essential for production deployment.

  3. Table-cell-level citations outperform paragraph-level references: Pilot analysts strongly preferred cell-level citations (e.g., "Table 3, Row: Operating Income, Col: FY2024"), enabling verification in seconds rather than minutes. Structure-aware chunking that preserves row-column relationships is critical infrastructure.

  4. Retrieval-equalized evaluation is essential for RAG research: Conflating retrieval improvements with verification improvements inflates perceived system value. Under controlled retrieval, atomic verification contributes 68–76% additional HalRate reduction independently.

  5. Distillation efficiency plateaus early: 1,600 examples yield 88.6% F1; 3,200 yield 91.4%; 6,400 yield only 92.1% (+0.7). Near-optimal training requires ~2,500–3,000 examples, making domain adaptation practical.

Reproducibility

While production data and internal deployment details cannot be released, we provide:

  • Inference and pipeline code
  • Evaluation scripts with benchmark reproduction
  • Latency benchmarking infrastructure
  • Serving code (FastAPI + vLLM)
  • Configuration files for all experiments
  • Prompts and templates
  • Model weights (available upon acceptance)
  • Production training data (proprietary)

Quick Start

Installation

# Clone repository
git clone https://anonymous.4open.science/r/FinGround/
cd finground

# Create environment
python -m venv .venv && source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

Run Inference

from finground.pipeline import FinGroundPipeline

pipeline = FinGroundPipeline.from_pretrained("finground-8b")

result = pipeline.verify(
    query="What was Apple's gross margin in Q4 2023?",
    answer="Apple's gross margin was 42.3% in Q4 2023.",
    documents=["path/to/10K.pdf"],
)

print(result.grounded_answer)
print(result.claims)  # Per-claim verdicts with citations

Run Evaluation

# Reproduce main results (Table 2 in paper)
python -m finground.evaluation.run_eval \
    --config configs/eval_finqa.yaml \
    --output results/

# Reproduce detection results (Table 1)
python -m finground.evaluation.run_eval \
    --config configs/eval_finhalu.yaml \
    --output results/

# Run latency benchmark
python -m finground.evaluation.latency_benchmark \
    --config configs/benchmark.yaml

Start Serving

# Local development
uvicorn finground.serving.server:app --host 0.0.0.0 --port 8080

# Docker
docker build -f Dockerfile.serving -t finground-serving .
docker run --gpus all -p 8080:8080 finground-serving

# Kubernetes
kubectl apply -f kubernetes/

Project Structure

finground/
├── src/finground/
│   ├── __init__.py
│   ├── pipeline/              # End-to-end pipeline
│   │   ├── __init__.py
│   │   └── core.py            # FinGroundPipeline
│   ├── models/                # Model definitions
│   │   ├── __init__.py
│   │   ├── claim_decomposer.py
│   │   ├── verdict_classifier.py
│   │   ├── complexity_classifier.py
│   │   └── regenerator.py
│   ├── data/                  # Data loading and processing
│   │   ├── __init__.py
│   │   ├── datasets.py
│   │   └── taxonomy.py
│   ├── evaluation/            # Evaluation and benchmarking
│   │   ├── __init__.py
│   │   ├── metrics.py
│   │   ├── run_eval.py
│   │   └── latency_benchmark.py
│   ├── serving/               # Production serving
│   │   ├── __init__.py
│   │   └── server.py
│   └── utils/                 # Utilities
│       ├── __init__.py
│       └── reproducibility.py
├── configs/                   # Experiment configurations
│   ├── eval_finqa.yaml
│   ├── eval_tatqa.yaml
│   ├── eval_finhalu.yaml
│   ├── eval_financebench.yaml
│   ├── distillation.yaml
│   └── benchmark.yaml
├── tests/                     # Unit and integration tests
│   ├── test_pipeline.py
│   ├── test_taxonomy.py
│   └── test_metrics.py
├── docs/                      # Documentation
│   └── COMPUTE_RESOURCES.md
├── kubernetes/                # Deployment configs
│   └── deployment.yaml
├── paper/                     # Paper source
│   ├── finground_6page.tex
│   └── references.bib
├── Dockerfile.serving
├── pyproject.toml
├── LICENSE
└── README.md

Evaluation Results Summary

Hallucination Detection (FinHalu)

System Precision Recall F1
SelfCheckGPT 69.4±2.1 76.5±1.8 72.8±1.6
HHEM 78.9±1.8 73.8±2.0 76.3±1.5
FActScore 74.2±2.0 79.3±1.7 76.7±1.5
Self-RAG 81.2±1.7 77.1±1.9 79.1±1.4
GPT-4o (teacher) 94.1±0.9 95.9±0.7 95.0±0.6
FinGround (8B) 92.7±1.1 90.2±1.3 91.4±1.2

End-to-End Hallucination Rates

System FinQA TAT-QA FinanceBench
Vanilla RAG 34.7% 31.5% 43.8%
GPT-4o + CoT 18.6% 15.2% 22.4%
FinGround 3.6% 3.8% 4.9%

Hardware Requirements

Configuration GPU Memory Use Case
Recommended NVIDIA A100 (80GB) 32GB RAM Full pipeline, benchmarking
Production NVIDIA A10G (24GB) 16GB RAM Serving (FP16)
Minimum Any GPU with 24GB+ VRAM 16GB RAM Inference only
CPU-only None 32GB RAM Evaluation scripts only

Limitations

  • Evaluated on English-language U.S. SEC filings only; other languages and jurisdictions require validation
  • GPT-dependent evaluation ecosystem with 3–4 point F1 degradation on non-GPT generators
  • 3.6-point F1 gap from teacher, particularly on nuanced computational claims
  • Pilot involved 24 analysts at a single firm with self-reported observations
  • Detection recall drops to 71.4% on hallucinated values within ±5% of ground truth

Responsible NLP Checklist

  • Limitations: Documented above and in paper Limitations section
  • Risks: Automation bias risk; outputs are not financial advice; liability for AI-verified claims unresolved
  • Compute: ~32 A100 GPU-hours for distillation training; ~8 GPU-hours per new claim type
  • Human Evaluation: 3 financial domain experts, IRB-exempt observational study
  • AI Assistants: GPT-4o used for teacher annotations with two-pass consistency checks; all outputs reviewed by authors

About

FinGround is a three-stage verify-then-ground pipeline for financial document question answering that detects and mitigates LLM hallucinations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages