Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
88aecde
initial fix of werewolf
cemde Feb 5, 2026
1f2a0f1
added fixes for multiagentbench
cemde Feb 5, 2026
9b6d62c
added dependency overwrite for ARE due to pinned dependencies
cemde Feb 5, 2026
693c8db
fixed gaia2 implementation
cemde Feb 5, 2026
853d4b5
updated pining in multiagentbench
cemde Feb 6, 2026
a14cda1
added optional dependency for multiagentbench
cemde Feb 6, 2026
9e58ba6
updated optional dependencies
cemde Feb 6, 2026
48bcac9
fixed multiagentbench tests
cemde Feb 6, 2026
7ed2e83
simplified installs with extras for all benchmarks
cemde Feb 6, 2026
3d4c9d6
[skip ci] updated lockfile
cemde Feb 6, 2026
ba4ed05
[skip ci] fixed dependency
cemde Feb 6, 2026
aed1f57
attempt at fix of dependency issues
cemde Feb 6, 2026
5a3acf9
another attemt to fix dependencies
cemde Feb 6, 2026
9e73d96
changed vendoring of multiagentbench to my own fork
cemde Feb 6, 2026
565db42
fix bug in GAIA2
cemde Feb 6, 2026
927aeb5
fixed multiagentbench fallbacks
cemde Feb 6, 2026
0af0f8c
fixed tools for gaia2
cemde Feb 9, 2026
155402c
fixed gaia2 bugs
cemde Feb 9, 2026
919eb03
more bug fixes
cemde Feb 9, 2026
3064626
updated agents file
cemde Feb 10, 2026
678df7e
condensed default instructions
cemde Feb 10, 2026
1f946fa
default example
cemde Feb 10, 2026
0edb8c9
updated agents file
cemde Feb 10, 2026
27d17e3
fixes file added.
cemde Feb 10, 2026
ab8093a
added fixes to gaia2 that increase faithfulness to original implement…
cemde Feb 10, 2026
e9eef21
added bug report for multiagentbench
cemde Feb 10, 2026
b98bbec
updated gaia eval
cemde Feb 10, 2026
d97fdec
fixed gaia2 evaluator
cemde Feb 10, 2026
98e3e6b
fixed bug in evaluators
cemde Feb 11, 2026
961e2a5
Merge remote-tracking branch 'origin/main' into fix-benchmarks
cemde Feb 11, 2026
84b5da8
fixed benchmarks
cemde Feb 11, 2026
95afa0c
added testing plan
cemde Feb 11, 2026
86e94d7
fixed tests
cemde Feb 11, 2026
6d0ef86
fixed tests
cemde Feb 11, 2026
4a3c153
removed testing cache
cemde Feb 11, 2026
a26b0a4
fixed missing intiialization bug for tau2 bench
cemde Feb 12, 2026
331badd
fixed typing errors
cemde Feb 12, 2026
040d481
fix Multiagentbench bug where results are not trunacted properly caus…
cemde Feb 12, 2026
811a924
fixed tools in tau2
cemde Feb 12, 2026
211c087
gaia2 typing fix
cemde Feb 12, 2026
c092da1
fixed data loading bug in gaia2
cemde Feb 12, 2026
fc66d36
fixed bug in tau2 implementation
cemde Feb 12, 2026
de674ce
fixed gaia2 docstring
cemde Feb 12, 2026
b6af88f
fixed testing of marble
cemde Feb 12, 2026
d0e2b6c
fixed testing errors
cemde Feb 12, 2026
21c7dc1
small fix for tau2 implementation now printing tool results better in…
cemde Feb 12, 2026
0356b96
fixed tests for marble and pytest misconfiguration
cemde Feb 13, 2026
27e1ecf
fixed Gaia2 evaluator config
cemde Feb 13, 2026
7880183
proposed fix for gaia2 notification management
cemde Feb 13, 2026
6b1c743
bug fix for gaia2 notifcation pulling
cemde Feb 13, 2026
1f61c25
[skip ci] fixed comment formatting
cemde Feb 13, 2026
5ee46aa
fixed stop token usage
cemde Feb 13, 2026
3991732
[skip ci] fixed gaia2 docs
cemde Feb 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 2 additions & 10 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,10 @@ jobs:
- name: Install dependencies
run: |
pip install uv
uv sync --group dev
uv sync --all-extras --group dev
- name: Run benchmark tests
run: |
uv run pytest -m benchmark -v
uv run pytest -m "benchmark and not (slow or live)" -v

test-all:
name: All Tests (With Optional Deps)
Expand Down Expand Up @@ -86,14 +86,6 @@ jobs:
run: |
pip install uv
uv sync --all-extras --group dev
- name: Cache benchmark data
uses: actions/cache@v4
with:
path: |
maseval/benchmark/tau2/data/
maseval/benchmark/macs/data/
maseval/benchmark/macs/prompt_templates/
key: benchmark-data-${{ hashFiles('maseval/benchmark/tau2/data_loader.py', 'maseval/benchmark/macs/data_loader.py') }}
- name: Run slow tests
run: |
uv run pytest -m "slow and not credentialed" -v
Expand Down
47 changes: 47 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -494,3 +494,50 @@ MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade
- Focus on getting it right, not keeping it the same

We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.

## Scientific Integrity

MASEval is a scientific library. Scientific integrity is paramount. **Never introduce defaults that could silently alter benchmark behavior or experimental outcomes.**

### The Boundary

**Guiding principle:** If a researcher would need to report a parameter in a paper's "Experimental Setup" section, **do not invent a default for it.**

**Acceptable (infrastructure/convenience):** `TaskQueue(limit=None)`, `Logger(verbose=False)`, `num_workers=1`, `print_results(color=True)` — these don't affect scientific results.

**Unacceptable (experimental parameters):** Temperature, seed, model version, prompt format, simulation duration, agent limits, dataset splits, scoring functions — these alter what's being measured.

### Reproducing Benchmarks

When integrating external benchmarks, match the source implementation exactly. Never invent fallback values.

```python
# BAD: Invented defaults
config = EnvironmentConfig(
duration=getattr(scenario, "duration", 86400), # Made-up fallback!
)
start_time = getattr(scenario, "start_time", None) # Hides missing attributes

# GOOD: Pass through directly, let errors surface
config = EnvironmentConfig(
duration=scenario.duration, # Trust the source
)
start_time = scenario.start_time # AttributeError if missing

# GOOD: Copy source defaults with documentation
# Default value copied from original_library/evaluator.py:L45
EVAL_TEMPERATURE = 0.7

class Evaluator:
def run(self, temperature: Optional[float] = None):
if temperature is None:
temperature = EVAL_TEMPERATURE # From source:L45

# also good:
class Evaluator:
# default temperature from source:L45
def run(self, temperature: Optional[float] = 0.7):
...
```

**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
5 changes: 4 additions & 1 deletion BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,14 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co

### Source and License

- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
- **Fork Used:** [https://github.com/cemde/MARBLE](https://github.com/cemde/MARBLE) (contains bug fixes for MASEval integration)
- **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
- **Code License:** MIT
- **Data License:** MIT

> **Note**: MASEval uses a fork with bug fixes. All credit for the original work goes to the MARBLE team (Haofei Yu et al.).

---

## 4. GAIA2
Expand Down
25 changes: 21 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Fixed

- Fixed GAIA2 default agent failing on reasoning models (o1, o3, GPT-5) that reject `stop` and `temperature` parameters. Client-side stop-token truncation (matching ARE's reference implementation) is now always applied, and `llm_args` values set to `None` are omitted from API calls (PR: #PR_NUMBER_PLACEHOLDER)
- Fixed GAIA2 multi-turn notification loop: `wait_for_notification()` no longer terminates the agent prematurely, enabling correct behavior for `time` and `adaptability` scenarios that require the agent to wait for simulation events and resume (PR: #PR_NUMBER_PLACEHOLDER)
- Added `Gaia2Environment.poll_notifications()` convenience method for custom agent implementations to drain the notification queue without needing ARE-internal imports (PR: #PR_NUMBER_PLACEHOLDER)

### Added

**Benchmarks**
Expand All @@ -16,11 +22,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
- Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
- Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
- `Gaia2JudgeEngineConfig` for configuring the judge's LLM model and provider (e.g., switching from HuggingFace to OpenRouter) via `configure_model_ids(tasks, judge_engine_config=...)` (PR: #PR_NUMBER_PLACEHOLDER)
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
- Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26)
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)

- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25)
- `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
- `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
- `MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
Expand Down Expand Up @@ -52,14 +59,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
- Marker implication hook: `credentialed` implies `live`, so `-m "not live"` always gives a fully offline run (PR: #29)
- Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for tests needing API keys (PR: #29)
- Data integrity tests for Tau2 and MACS benchmarks validating download pipelines, file structures, and database content (PR: #29)
- Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29)
- HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using `respx` mocks — no API keys needed (PR: #29)
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
- Real-data integration tests for GAIA2 (ARE environments, tools, evaluator pipeline) and MultiAgentBench (MARBLE data loading, environments, evaluation, pipeline smoke tests) (PR: #PR_NUMBER_PLACEHOLDER)
- CI jobs for slow tests (with benchmark data caching for Tau2, MACS, GAIA2, and MultiAgentBench) and credentialed tests (behind GitHub Environment approval) (PR: #29)
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)

### Changed

**Benchmarks**

- MultiAgentBench: Full alignment with MARBLE paper — all 6 domains now fully supported end-to-end. Removed `web` and `worldsimulation` (not in paper, no task data). Added `werewolf` domain with config-based task loading and LLM evaluation. Added `minecraft` domain evaluation. Fixed `bargaining` environment mapping to use `WorldSimulationEnvironment`. Fixed `WerewolfEnv` constructor handling. Removed hard-coded minecraft infrastructure block. (PR: #PR_NUMBER_PLACEHOLDER)

**Core**

- Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
Expand Down Expand Up @@ -89,6 +101,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

**Benchmarks**

- GAIA2: Fixed multi-turn scenario evaluation always failing due to missing intermediate judge calls. The evaluator now calls `judge(env)` for each intermediate turn before `judge.validate(env)`, matching ARE's intended evaluation flow. Single-turn scenarios were unaffected. (PR: #PR_NUMBER_PLACEHOLDER)
- GAIA2: Fixed data loader producing unusable tasks — `VALID_CAPABILITIES` included nonexistent HF configs (`agent2agent`, `noise`), `DEFAULT_CONFIG` referenced nonexistent `"validation"` config, `task.query` used fabricated `task_instruction` field (doesn't exist in ARE), and `oracle_events` were stored but never used. `load_tasks()` now correctly iterates HF capability configs with a pinned revision, leaves `query` empty (GAIA2 is event-driven), and omits oracle events from evaluation_data. (PR: #PR_NUMBER_PLACEHOLDER)
- Tau2: Fixed tool result serialization in `DefaultTau2Agent` — now uses `model_dump()` + `json.dumps()` (matching original tau2-bench) instead of Python `str()`/`repr()`, which produced noisy formats with raw enum values, Pydantic repr strings, and `datetime.date(...)` literals that degraded agent accuracy (PR: #PR_NUMBER_PLACEHOLDER)
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)

### Removed
Expand Down
Loading