GAIA2 and MultiAgentBench Fixes #30

cemde · 2026-02-11T16:34:34Z

Description

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

…ation

# Conflicts: # CHANGELOG.md # tests/test_benchmarks/test_multiagentbench/test_environment.py

…ing LLM contextlength overflow

…to agent messages

github-actions · 2026-02-13T00:50:05Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/benchmark/gaia2
__init__.py
data_loader.py					150-153, 165-203, 233-266
environment.py					85-86, 198-199, 228-230, 238, 249-253, 264-268
evaluator.py					162, 200, 213-216, 276, 295
gaia2.py					327-329, 347-352, 431, 448-450, 478, 482-483, 572, 582-583, 637, 760-800, 814-821, 828-829, 839-840, 866-867, 876, 885-889, 899-911, 920, 950-954, 969, 984, 1027
tool_wrapper.py					89-93, 117
maseval/benchmark/multiagentbench
_constants.py
data_loader.py					140-147, 333-335
environment.py					82, 109, 149-166
evaluator.py					250, 644
multiagentbench.py					485-486, 500-501
maseval/benchmark/multiagentbench/adapters
marble_adapter.py
maseval/benchmark/tau2
environment.py					157-159, 192-203, 225-246, 268-331, 342, 356-358
evaluator.py
tau2.py					544, 548, 550, 554, 557-560, 806-808, 816, 825-827, 830
maseval/benchmark/tau2/domains/telecom
db.py
user_models.py
user_tools.py					74, 77-78, 84, 165, 169, 176-183, 278, 307-308, 329, 708, 739-740, 831-832, 843-844, 919-920
Project Total

_{This report was generated by python-coverage-comment-action}

cemde added 30 commits February 5, 2026 23:17

initial fix of werewolf

88aecde

added fixes for multiagentbench

1f2a0f1

added dependency overwrite for ARE due to pinned dependencies

9b6d62c

fixed gaia2 implementation

693c8db

updated pining in multiagentbench

853d4b5

added optional dependency for multiagentbench

a14cda1

updated optional dependencies

9e58ba6

fixed multiagentbench tests

48bcac9

simplified installs with extras for all benchmarks

7ed2e83

[skip ci] updated lockfile

3d4c9d6

[skip ci] fixed dependency

ba4ed05

attempt at fix of dependency issues

aed1f57

another attemt to fix dependencies

5a3acf9

changed vendoring of multiagentbench to my own fork

9e73d96

fix bug in GAIA2

565db42

fixed multiagentbench fallbacks

927aeb5

fixed tools for gaia2

0af0f8c

fixed gaia2 bugs

155402c

more bug fixes

919eb03

updated agents file

3064626

condensed default instructions

678df7e

default example

1f946fa

updated agents file

0edb8c9

fixes file added.

27d17e3

added fixes to gaia2 that increase faithfulness to original implement…

ab8093a

…ation

added bug report for multiagentbench

e9eef21

updated gaia eval

b98bbec

fixed gaia2 evaluator

d97fdec

fixed bug in evaluators

98e3e6b

Merge remote-tracking branch 'origin/main' into fix-benchmarks

961e2a5

# Conflicts: # CHANGELOG.md # tests/test_benchmarks/test_multiagentbench/test_environment.py

cemde added 17 commits February 11, 2026 15:02

fixed benchmarks

84b5da8

added testing plan

95afa0c

fixed tests

86e94d7

fixed tests

6d0ef86

removed testing cache

4a3c153

fixed missing intiialization bug for tau2 bench

a26b0a4

fixed typing errors

331badd

fix Multiagentbench bug where results are not trunacted properly caus…

040d481

…ing LLM contextlength overflow

fixed tools in tau2

811a924

gaia2 typing fix

211c087

fixed data loading bug in gaia2

c092da1

fixed bug in tau2 implementation

fc66d36

fixed gaia2 docstring

de674ce

fixed testing of marble

b6af88f

fixed testing errors

d0e2b6c

small fix for tau2 implementation now printing tool results better in…

21c7dc1

…to agent messages

fixed tests for marble and pytest misconfiguration

0356b96

fixed Gaia2 evaluator config

27e1ecf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GAIA2 and MultiAgentBench Fixes #30

GAIA2 and MultiAgentBench Fixes #30

cemde commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GAIA2 and MultiAgentBench Fixes #30

Are you sure you want to change the base?

GAIA2 and MultiAgentBench Fixes #30

Conversation

cemde commented Feb 11, 2026

Description

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 13, 2026 •

edited

Loading