Skip to content

ngoax/agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Agentic Math Tool

Pair project (w/ Laura Bengs) as part of the Master's block course Infrastructure for Advanced Analytics and Machine Learning in SS2025 @ LMU Munich.

Abstract

Agents are an emergent field in AI, which is concerned with equipping large language models (LLMs) with further capabilities such as tool calling, thereby eliminating the need for users to orchestrate individual steps and extending the capabilities of LLMs beyond conventional chatbots.

In this project, we build a Math Agent using LangGraph that can solve mathematical problems with two tools:

We hypothesize that augmenting LLMs with a symbolic calculator could improve accuracy for mathematical tasks, mitigating the inherent tendency of LLMs to hallucinate due to its autoregressive nature.

We evaluate four different models on the GSM8K dataset:

  • Open-weight models (via Ollama):
    • Qwen 2.5-7B
    • Qwen 2.5-14B
  • Closed-source API models:
    • Claude 3.5 Sonnet (Anthropic)
    • Claude 3.7 Sonnet (Anthropic)

Key Findings

  1. Tool calling improves correctness across all models.
  2. Smaller models with tools can match or outperform larger models without tools.
  3. Tool calling increases latency and token usage.
  4. Our implementation performs competitively compared to GSM8K benchmarks.

Methods & System Design

Frameworks & Ecosystem

Tools

  • Document Extractor: Docling
    • PDF, DOCX, image, and OCR support
  • Math Calculator: NumExpr
    • Fast numerical expression evaluation, memory-efficient

Evaluation

  • Dataset: GSM8K (subset of 100 problems)
  • Environment: Google Colab (NVIDIA T4, 15GB GPU, 12.7GB RAM)

Results (String Input w/ vs. w/o Tools)

Model Tool Calling Accuracy P50 Latency P99 Latency
Claude 3.5 Sonnet 0.95 4.08s 11.18s
Claude 3.5 Sonnet 0.60 0.52s 1.29s
Claude 3.7 Sonnet 0.92 5.45s 14.41s
Claude 3.7 Sonnet 0.68 0.73s 1.53s
Qwen 2.5-14B 0.46 0.38s 5.34s
Qwen 2.5-14B 0.32 0.31s 0.55s
Qwen-7B 0.59 1.31s 6.60s
Qwen-7B 0.24 0.16s 0.43s

Overall: Tool calling improves correctness significantly, but at the cost of latency, which could mainly be due to the increased token count when invoking tools. While there is a significant speed tradeoff to be made in search of accuracy, future work attempting to embed symbolic math engines into the native architecture of LLMs could potentially mitigate this tradeoff.

About

Math Agent enhanching LLMs with symbolic tools to increase robustness and accuracy.

Resources

Stars

Watchers

Forks

Contributors