Pair project (w/ Laura Bengs) as part of the Master's block course Infrastructure for Advanced Analytics and Machine Learning in SS2025 @ LMU Munich.
Agents are an emergent field in AI, which is concerned with equipping large language models (LLMs) with further capabilities such as tool calling, thereby eliminating the need for users to orchestrate individual steps and extending the capabilities of LLMs beyond conventional chatbots.
In this project, we build a Math Agent using LangGraph that can solve mathematical problems with two tools:
We hypothesize that augmenting LLMs with a symbolic calculator could improve accuracy for mathematical tasks, mitigating the inherent tendency of LLMs to hallucinate due to its autoregressive nature.
We evaluate four different models on the GSM8K dataset:
- Open-weight models (via Ollama):
- Qwen 2.5-7B
- Qwen 2.5-14B
- Closed-source API models:
- Claude 3.5 Sonnet (Anthropic)
- Claude 3.7 Sonnet (Anthropic)
- Tool calling improves correctness across all models.
- Smaller models with tools can match or outperform larger models without tools.
- Tool calling increases latency and token usage.
- Our implementation performs competitively compared to GSM8K benchmarks.
- LangChain – LLM app framework
- LangGraph – graph-based agent orchestration
- LangSmith – debugging & evaluation platform
- Document Extractor: Docling
- PDF, DOCX, image, and OCR support
- Math Calculator: NumExpr
- Fast numerical expression evaluation, memory-efficient
- Dataset: GSM8K (subset of 100 problems)
- Environment: Google Colab (NVIDIA T4, 15GB GPU, 12.7GB RAM)
| Model | Tool Calling | Accuracy | P50 Latency | P99 Latency |
|---|---|---|---|---|
| Claude 3.5 Sonnet | ✅ | 0.95 | 4.08s | 11.18s |
| Claude 3.5 Sonnet | ❌ | 0.60 | 0.52s | 1.29s |
| Claude 3.7 Sonnet | ✅ | 0.92 | 5.45s | 14.41s |
| Claude 3.7 Sonnet | ❌ | 0.68 | 0.73s | 1.53s |
| Qwen 2.5-14B | ✅ | 0.46 | 0.38s | 5.34s |
| Qwen 2.5-14B | ❌ | 0.32 | 0.31s | 0.55s |
| Qwen-7B | ✅ | 0.59 | 1.31s | 6.60s |
| Qwen-7B | ❌ | 0.24 | 0.16s | 0.43s |
Overall: Tool calling improves correctness significantly, but at the cost of latency, which could mainly be due to the increased token count when invoking tools. While there is a significant speed tradeoff to be made in search of accuracy, future work attempting to embed symbolic math engines into the native architecture of LLMs could potentially mitigate this tradeoff.