Anonymous Repository for Double-Blind Review
FinGround is a three-stage verify-then-ground pipeline for financial document question answering that detects and mitigates LLM hallucinations through:
- Finance-Aware Hybrid Retrieval — Complexity-routed retrieval over text and tables with structure-aware chunking
- Atomic Financial Claim Verification — Decomposition into six claim types (numerical, temporal, entity-attribute, comparative, regulatory, computational) with type-routed verification including formula reconstruction
- Grounded Regeneration — Targeted rewriting of unsupported claims with paragraph- and table-cell-level citations
An efficient 8B distilled detector retains 91.4% F1 at 18× lower per-claim latency, enabling $0.003/query deployment for production financial services environments.
| Aspect | Details |
|---|---|
| Status | Pilot (4-week feasibility study with 24 financial analysts) |
| Scale | ~25,000 queries/day target (500 filings/day × 50 queries/filing) |
| Latency | P50: 2.4s, P95: 3.8s (full pipeline, A100) |
| Throughput | 8.4 QPS on single A100 (28× headroom over target) |
| Infrastructure | vLLM serving, Kubernetes orchestration, A100/A10G GPU |
| Cost | $0.003/query (15.7× lower than GPT-4o teacher) |
| Metric | GPT-4o Teacher | FinGround (8B) | Improvement |
|---|---|---|---|
| Detection F1 | 95.0 | 91.4 | 96.2% retention |
| Per-claim Latency | 6.1s | 340ms (p95) | 18× faster |
| Pipeline Latency (p95) | 8.2s | 3.8s | 2.2× faster |
| Cost per Query | $0.047 | $0.003 | 15.7× cheaper |
| HalRate (FinQA) | — | 3.6% | 78% reduction vs GPT-4o+CoT |
[User Query] → [Complexity Classifier] → [Hybrid Text+Table Retrieval]
↓
[Generated Answer] → [Claim Decomposition] → [Claim-Evidence Alignment]
↓ ↓
[6-Type Financial Taxonomy] [Verdict Classifier]
↓
┌────────────────────────┼────────────────┐
[Supported] [Contradicted] [Unverifiable]
↓ (pass-through) ↓ ↓
[Targeted Span Regeneration]
↓
[Citation Attachment]
↓
[Grounded Answer]
-
Claim-type routing is the bottleneck, not verification difficulty: Computational claims have the highest hallucination rate (28.4%) despite being most amenable to automated verification when properly typed. Generic NLI completely fails on ratio and margin verification even with correct evidence retrieved.
-
Hedged financial language dominates false positives: Terms like "approximately" and "roughly" in risk disclosures account for 52% of false positives. Calibrated confidence thresholds specific to claim types and financial language patterns are essential for production deployment.
-
Table-cell-level citations outperform paragraph-level references: Pilot analysts strongly preferred cell-level citations (e.g., "Table 3, Row: Operating Income, Col: FY2024"), enabling verification in seconds rather than minutes. Structure-aware chunking that preserves row-column relationships is critical infrastructure.
-
Retrieval-equalized evaluation is essential for RAG research: Conflating retrieval improvements with verification improvements inflates perceived system value. Under controlled retrieval, atomic verification contributes 68–76% additional HalRate reduction independently.
-
Distillation efficiency plateaus early: 1,600 examples yield 88.6% F1; 3,200 yield 91.4%; 6,400 yield only 92.1% (+0.7). Near-optimal training requires ~2,500–3,000 examples, making domain adaptation practical.
While production data and internal deployment details cannot be released, we provide:
- Inference and pipeline code
- Evaluation scripts with benchmark reproduction
- Latency benchmarking infrastructure
- Serving code (FastAPI + vLLM)
- Configuration files for all experiments
- Prompts and templates
- Model weights (available upon acceptance)
- Production training data (proprietary)
# Clone repository
git clone https://anonymous.4open.science/r/FinGround/
cd finground
# Create environment
python -m venv .venv && source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"from finground.pipeline import FinGroundPipeline
pipeline = FinGroundPipeline.from_pretrained("finground-8b")
result = pipeline.verify(
query="What was Apple's gross margin in Q4 2023?",
answer="Apple's gross margin was 42.3% in Q4 2023.",
documents=["path/to/10K.pdf"],
)
print(result.grounded_answer)
print(result.claims) # Per-claim verdicts with citations# Reproduce main results (Table 2 in paper)
python -m finground.evaluation.run_eval \
--config configs/eval_finqa.yaml \
--output results/
# Reproduce detection results (Table 1)
python -m finground.evaluation.run_eval \
--config configs/eval_finhalu.yaml \
--output results/
# Run latency benchmark
python -m finground.evaluation.latency_benchmark \
--config configs/benchmark.yaml# Local development
uvicorn finground.serving.server:app --host 0.0.0.0 --port 8080
# Docker
docker build -f Dockerfile.serving -t finground-serving .
docker run --gpus all -p 8080:8080 finground-serving
# Kubernetes
kubectl apply -f kubernetes/finground/
├── src/finground/
│ ├── __init__.py
│ ├── pipeline/ # End-to-end pipeline
│ │ ├── __init__.py
│ │ └── core.py # FinGroundPipeline
│ ├── models/ # Model definitions
│ │ ├── __init__.py
│ │ ├── claim_decomposer.py
│ │ ├── verdict_classifier.py
│ │ ├── complexity_classifier.py
│ │ └── regenerator.py
│ ├── data/ # Data loading and processing
│ │ ├── __init__.py
│ │ ├── datasets.py
│ │ └── taxonomy.py
│ ├── evaluation/ # Evaluation and benchmarking
│ │ ├── __init__.py
│ │ ├── metrics.py
│ │ ├── run_eval.py
│ │ └── latency_benchmark.py
│ ├── serving/ # Production serving
│ │ ├── __init__.py
│ │ └── server.py
│ └── utils/ # Utilities
│ ├── __init__.py
│ └── reproducibility.py
├── configs/ # Experiment configurations
│ ├── eval_finqa.yaml
│ ├── eval_tatqa.yaml
│ ├── eval_finhalu.yaml
│ ├── eval_financebench.yaml
│ ├── distillation.yaml
│ └── benchmark.yaml
├── tests/ # Unit and integration tests
│ ├── test_pipeline.py
│ ├── test_taxonomy.py
│ └── test_metrics.py
├── docs/ # Documentation
│ └── COMPUTE_RESOURCES.md
├── kubernetes/ # Deployment configs
│ └── deployment.yaml
├── paper/ # Paper source
│ ├── finground_6page.tex
│ └── references.bib
├── Dockerfile.serving
├── pyproject.toml
├── LICENSE
└── README.md
| System | Precision | Recall | F1 |
|---|---|---|---|
| SelfCheckGPT | 69.4±2.1 | 76.5±1.8 | 72.8±1.6 |
| HHEM | 78.9±1.8 | 73.8±2.0 | 76.3±1.5 |
| FActScore | 74.2±2.0 | 79.3±1.7 | 76.7±1.5 |
| Self-RAG | 81.2±1.7 | 77.1±1.9 | 79.1±1.4 |
| GPT-4o (teacher) | 94.1±0.9 | 95.9±0.7 | 95.0±0.6 |
| FinGround (8B) | 92.7±1.1 | 90.2±1.3 | 91.4±1.2 |
| System | FinQA | TAT-QA | FinanceBench |
|---|---|---|---|
| Vanilla RAG | 34.7% | 31.5% | 43.8% |
| GPT-4o + CoT | 18.6% | 15.2% | 22.4% |
| FinGround | 3.6% | 3.8% | 4.9% |
| Configuration | GPU | Memory | Use Case |
|---|---|---|---|
| Recommended | NVIDIA A100 (80GB) | 32GB RAM | Full pipeline, benchmarking |
| Production | NVIDIA A10G (24GB) | 16GB RAM | Serving (FP16) |
| Minimum | Any GPU with 24GB+ VRAM | 16GB RAM | Inference only |
| CPU-only | None | 32GB RAM | Evaluation scripts only |
- Evaluated on English-language U.S. SEC filings only; other languages and jurisdictions require validation
- GPT-dependent evaluation ecosystem with 3–4 point F1 degradation on non-GPT generators
- 3.6-point F1 gap from teacher, particularly on nuanced computational claims
- Pilot involved 24 analysts at a single firm with self-reported observations
- Detection recall drops to 71.4% on hallucinated values within ±5% of ground truth
- Limitations: Documented above and in paper Limitations section
- Risks: Automation bias risk; outputs are not financial advice; liability for AI-verified claims unresolved
- Compute: ~32 A100 GPU-hours for distillation training; ~8 GPU-hours per new claim type
- Human Evaluation: 3 financial domain experts, IRB-exempt observational study
- AI Assistants: GPT-4o used for teacher annotations with two-pass consistency checks; all outputs reviewed by authors