Agentic Math Tool

Pair project (w/ Laura Bengs) as part of the Master's block course Infrastructure for Advanced Analytics and Machine Learning in SS2025 @ LMU Munich.

Abstract

Agents are an emergent field in AI, which is concerned with equipping large language models (LLMs) with further capabilities such as tool calling, thereby eliminating the need for users to orchestrate individual steps and extending the capabilities of LLMs beyond conventional chatbots.

In this project, we build a Math Agent using LangGraph that can solve mathematical problems with two tools:

a document extractor (Docling)
a symbolic calculator (NumExpr)

We hypothesize that augmenting LLMs with a symbolic calculator could improve accuracy for mathematical tasks, mitigating the inherent tendency of LLMs to hallucinate due to its autoregressive nature.

We evaluate four different models on the GSM8K dataset:

Open-weight models (via Ollama):
- Qwen 2.5-7B
- Qwen 2.5-14B
Closed-source API models:
- Claude 3.5 Sonnet (Anthropic)
- Claude 3.7 Sonnet (Anthropic)

Key Findings

Tool calling improves correctness across all models.
Smaller models with tools can match or outperform larger models without tools.
Tool calling increases latency and token usage.
Our implementation performs competitively compared to GSM8K benchmarks.

Methods & System Design

Frameworks & Ecosystem

LangChain – LLM app framework
LangGraph – graph-based agent orchestration
LangSmith – debugging & evaluation platform

Tools

Document Extractor: Docling
- PDF, DOCX, image, and OCR support
Math Calculator: NumExpr
- Fast numerical expression evaluation, memory-efficient

Evaluation

Dataset: GSM8K (subset of 100 problems)
Environment: Google Colab (NVIDIA T4, 15GB GPU, 12.7GB RAM)

Results (String Input w/ vs. w/o Tools)

Model	Tool Calling	Accuracy	P50 Latency	P99 Latency
Claude 3.5 Sonnet	✅	0.95	4.08s	11.18s
Claude 3.5 Sonnet	❌	0.60	0.52s	1.29s
Claude 3.7 Sonnet	✅	0.92	5.45s	14.41s
Claude 3.7 Sonnet	❌	0.68	0.73s	1.53s
Qwen 2.5-14B	✅	0.46	0.38s	5.34s
Qwen 2.5-14B	❌	0.32	0.31s	0.55s
Qwen-7B	✅	0.59	1.31s	6.60s
Qwen-7B	❌	0.24	0.16s	0.43s

Overall: Tool calling improves correctness significantly, but at the cost of latency, which could mainly be due to the increased token count when invoking tools. While there is a significant speed tradeoff to be made in search of accuracy, future work attempting to embed symbolic math engines into the native architecture of LLMs could potentially mitigate this tradeoff.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
mathagent.ipynb		mathagent.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Math Tool

Abstract

Key Findings

Methods & System Design

Frameworks & Ecosystem

Tools

Evaluation

Results (String Input w/ vs. w/o Tools)

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Math Tool

Abstract

Key Findings

Methods & System Design

Frameworks & Ecosystem

Tools

Evaluation

Results (String Input w/ vs. w/o Tools)

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages