Skip to content

Add MMLU benchmark with DISCO support#34

Open
arubique wants to merge 47 commits intoparameterlab:mainfrom
arubique:add_disco_and_mmlu_rebased
Open

Add MMLU benchmark with DISCO support#34
arubique wants to merge 47 commits intoparameterlab:mainfrom
arubique:add_disco_and_mmlu_rebased

Conversation

@arubique
Copy link
Contributor

@arubique arubique commented Feb 15, 2026

Description

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

Documentation

  • Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
  • Updated relevant documentation in docs/ (if applicable)
  • Tag github issue with this PR (if applicable)

Changelog

  • Added entry to CHANGELOG.md under [Unreleased] section
    • Use Added section for new features
    • Use Changed section for modifications to existing functionality
    • Use Fixed section for bug fixes
    • Use Removed section for deprecated/removed features
  • OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

  • Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
  • Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

- Add scaffold for MMLU evaluation on anchor points
- Add scaffold for DISCO prediction for MMLU
- Add lm_eval wrapper
- Fix infinite loop in AnchorPointsTaskQueue
- Save model predictions as intermediate outputs
- Estimate likelihood for MCQ
- Add DISCO dependencies
- Freeze exactly DISCO libs versions for reproducibility
- Reproduce outputs with lm_eval_wrapper
- Temporary update libs for reproducibility
- Sync MMLU argmaxes between MASeval and DISCO
- Achieve exact match with DISCO predictions for --lm_eval_batching
- Load DISCO predictor and transform from npz
- DISCO model is loadable from hf
- Store anchor points in hf
- Store predictor config in hf
- Enforce empty args for transform_path and anchor_points_path when using hf model
- Add flattened MMLU to hf
- Add flattened-MMLU and --use-full-prompt to enforced config for DISCO-MMLU
- Update readme for flattened-MMLU
- Update dataset metadata
- Minor readme update
- Add fields description
- Remove extra files
- Move HuggingFaceMMLUBenchmark to mmlu.py
- Move hardcoded values to constants
- Add lm-eval to pyproject.toml
- Align pyproject.toml with MasEval main
- Sync uv.lock with MasEval main
- Sync uv.lock with MasEval main vol. 2
- Fix typo in pyproject.toml
- Remove duplicates from pyproject.toml
- Add **kwargs to MMLUBenchmark init
- Update DISCO deps
- Add test for MMLU benchmark
- Sync pyproject.toml with MasEval main
- Update types to pass autotests
- Update changelog
- Add default value to --data_path
- Shorten example's readme
@arubique arubique force-pushed the add_disco_and_mmlu_rebased branch from 22e4b96 to 8b7343f Compare February 16, 2026 20:42
- Add args description to readme
- Add links to papers in readme
- Add link to lm-evaluation-harness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments