parameterlab · cemde · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -45,10 +45,10 @@ jobs:
       - name: Install dependencies
         run: |
           pip install uv
-          uv sync --group dev
+          uv sync --all-extras --group dev
       - name: Run benchmark tests
         run: |
-          uv run pytest -m benchmark -v
+          uv run pytest -m "benchmark and not (slow or live)" -v
 
   test-all:
     name: All Tests (With Optional Deps)
@@ -86,14 +86,6 @@ jobs:
         run: |
           pip install uv
           uv sync --all-extras --group dev
-      - name: Cache benchmark data
-        uses: actions/cache@v4
-        with:
-          path: |
-            maseval/benchmark/tau2/data/
-            maseval/benchmark/macs/data/
-            maseval/benchmark/macs/prompt_templates/
-          key: benchmark-data-${{ hashFiles('maseval/benchmark/tau2/data_loader.py', 'maseval/benchmark/macs/data_loader.py') }}
       - name: Run slow tests
         run: |
           uv run pytest -m "slow and not credentialed" -v

diff --git a/AGENTS.md b/AGENTS.md
@@ -494,3 +494,50 @@ MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade
 - Focus on getting it right, not keeping it the same
 
 We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
+
+## Scientific Integrity
+
+MASEval is a scientific library. Scientific integrity is paramount. **Never introduce defaults that could silently alter benchmark behavior or experimental outcomes.**
+
+### The Boundary
+
+**Guiding principle:** If a researcher would need to report a parameter in a paper's "Experimental Setup" section, **do not invent a default for it.**
+
+**Acceptable (infrastructure/convenience):** `TaskQueue(limit=None)`, `Logger(verbose=False)`, `num_workers=1`, `print_results(color=True)` — these don't affect scientific results.
+
+**Unacceptable (experimental parameters):** Temperature, seed, model version, prompt format, simulation duration, agent limits, dataset splits, scoring functions — these alter what's being measured.
+
+### Reproducing Benchmarks
+
+When integrating external benchmarks, match the source implementation exactly. Never invent fallback values.
+
+```python
+# BAD: Invented defaults
+config = EnvironmentConfig(
+    duration=getattr(scenario, "duration", 86400),  # Made-up fallback!
+)
+start_time = getattr(scenario, "start_time", None)  # Hides missing attributes
+
+# GOOD: Pass through directly, let errors surface
+config = EnvironmentConfig(
+    duration=scenario.duration,  # Trust the source
+)
+start_time = scenario.start_time  # AttributeError if missing
+
+# GOOD: Copy source defaults with documentation
+# Default value copied from original_library/evaluator.py:L45
+EVAL_TEMPERATURE = 0.7
+
+class Evaluator:
+    def run(self, temperature: Optional[float] = None):
+        if temperature is None:
+            temperature = EVAL_TEMPERATURE  # From source:L45
+
+# also good:
+class Evaluator:
+    # default temperature from source:L45
+    def run(self, temperature: Optional[float] = 0.7):
+        ...
+```
+
+**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -33,11 +33,14 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co
 
 ### Source and License
 
-- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
+- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
+- **Fork Used:** [https://github.com/cemde/MARBLE](https://github.com/cemde/MARBLE) (contains bug fixes for MASEval integration)
 - **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
 - **Code License:** MIT
 - **Data License:** MIT
 
+> **Note**: MASEval uses a fork with bug fixes. All credit for the original work goes to the MARBLE team (Haofei Yu et al.).
+
 ---
 
 ## 4. GAIA2

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Fixed
+
+- Fixed GAIA2 default agent failing on reasoning models (o1, o3, GPT-5) that reject `stop` and `temperature` parameters. Client-side stop-token truncation (matching ARE's reference implementation) is now always applied, and `llm_args` values set to `None` are omitted from API calls (PR: #PR_NUMBER_PLACEHOLDER)
+- Fixed GAIA2 multi-turn notification loop: `wait_for_notification()` no longer terminates the agent prematurely, enabling correct behavior for `time` and `adaptability` scenarios that require the agent to wait for simulation events and resume (PR: #PR_NUMBER_PLACEHOLDER)
+- Added `Gaia2Environment.poll_notifications()` convenience method for custom agent implementations to drain the notification queue without needing ARE-internal imports (PR: #PR_NUMBER_PLACEHOLDER)
+
 ### Added
 
 **Benchmarks**
@@ -16,11 +22,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
   - Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
   - Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
+  - `Gaia2JudgeEngineConfig` for configuring the judge's LLM model and provider (e.g., switching from HuggingFace to OpenRouter) via `configure_model_ids(tasks, judge_engine_config=...)` (PR: #PR_NUMBER_PLACEHOLDER)
   - Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
-  - Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
+  - Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26)
   - Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
 
-- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
+- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25)
   - `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
   - `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
   - `MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
@@ -52,14 +59,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
 - Marker implication hook: `credentialed` implies `live`, so `-m "not live"` always gives a fully offline run (PR: #29)
 - Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for tests needing API keys (PR: #29)
-- Data integrity tests for Tau2 and MACS benchmarks validating download pipelines, file structures, and database content (PR: #29)
+- Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29)
 - HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using `respx` mocks — no API keys needed (PR: #29)
 - Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
-- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
+- Real-data integration tests for GAIA2 (ARE environments, tools, evaluator pipeline) and MultiAgentBench (MARBLE data loading, environments, evaluation, pipeline smoke tests) (PR: #PR_NUMBER_PLACEHOLDER)
+- CI jobs for slow tests (with benchmark data caching for Tau2, MACS, GAIA2, and MultiAgentBench) and credentialed tests (behind GitHub Environment approval) (PR: #29)
 - Added `respx` dev dependency for HTTP-level mocking (PR: #29)
 
 ### Changed
 
+**Benchmarks**
+
+- MultiAgentBench: Full alignment with MARBLE paper — all 6 domains now fully supported end-to-end. Removed `web` and `worldsimulation` (not in paper, no task data). Added `werewolf` domain with config-based task loading and LLM evaluation. Added `minecraft` domain evaluation. Fixed `bargaining` environment mapping to use `WorldSimulationEnvironment`. Fixed `WerewolfEnv` constructor handling. Removed hard-coded minecraft infrastructure block. (PR: #PR_NUMBER_PLACEHOLDER)
+
 **Core**
 
 - Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
@@ -89,6 +101,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+**Benchmarks**
+
+- GAIA2: Fixed multi-turn scenario evaluation always failing due to missing intermediate judge calls. The evaluator now calls `judge(env)` for each intermediate turn before `judge.validate(env)`, matching ARE's intended evaluation flow. Single-turn scenarios were unaffected. (PR: #PR_NUMBER_PLACEHOLDER)
+- GAIA2: Fixed data loader producing unusable tasks — `VALID_CAPABILITIES` included nonexistent HF configs (`agent2agent`, `noise`), `DEFAULT_CONFIG` referenced nonexistent `"validation"` config, `task.query` used fabricated `task_instruction` field (doesn't exist in ARE), and `oracle_events` were stored but never used. `load_tasks()` now correctly iterates HF capability configs with a pinned revision, leaves `query` empty (GAIA2 is event-driven), and omits oracle events from evaluation_data. (PR: #PR_NUMBER_PLACEHOLDER)
+- Tau2: Fixed tool result serialization in `DefaultTau2Agent` — now uses `model_dump()` + `json.dumps()` (matching original tau2-bench) instead of Python `str()`/`repr()`, which produced noisy formats with raw enum values, Pydantic repr strings, and `datetime.date(...)` literals that degraded agent accuracy (PR: #PR_NUMBER_PLACEHOLDER)
 - Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
 
 ### Removed