feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter by dingzhiqiang · Pull Request #1372 · areal-project/AReaL

dingzhiqiang · 2026-05-28T03:48:56Z

Summary

Switches the Bailing-MoE V2.5 family (BailingMoeV2_5ForCausalLM, BailingMoeLinearForCausalLM, BailingHybridForCausalLM) from mbridge-only to dual-bridge by adding a NVIDIA megatron-bridge MegatronModelBridge adapter alongside the existing mbridge one. The mbridge default behavior is unchanged; users opt into the new path via mcore.bridge_type=megatron-bridge.

The dual-bridge infrastructure (config field mcore.bridge_type, MegatronEngine._build_hf_mcore_bridge(), registry's megatron-bridge branch, megatron-bridge==0.4.0 declared in pyproject.toml) already exists on main.

This is the first of two split PRs (previously combined as #1362, now closed):

This PR-A: Bailing-MoE megatron-bridge adapter + shared cross-cutting infra.
PR-B (follow-up): GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash new model support. Depends on PR-A.

What's added

New adapter

areal/models/mcore/bailing_moe_megatron_bridge.py — a single MegatronModelBridge subclass registered for the three Bailing architectures. Handles per-layer heterogeneous Lightning + MLA attention dispatch via is_lightning_layer(layer_idx, group_size); per-layer enumeration in mapping_registry() because MegatronMappingRegistry's wildcard mappings cannot express the heterogeneous case. LightningQKVMapping subclass overrides hf_to_megatron / megatron_to_hf to permute fused QKV between HF [Q|K|V] and mcore [q0,k0,v0,...] layouts.

Registry / engine wiring

areal/models/mcore/registry.py — add _BAILING_ARCHITECTURES set and _is_bailing() helper; inject AReaL's heterogeneous Bailing layer spec into provider.transformer_layer_spec for the megatron-bridge path so the bridge's default uniform spec doesn't overwrite it.
areal/engine/megatron_engine.py — import bailing_moe_megatron_bridge unconditionally so the @register_bridge decorator fires.

Shared infrastructure

Applies to all current and future bridge adapters (not Bailing-specific), but lands here because PR-A is the first to need it:

Wrap mbridge top-level import and the mbridge-dependent bailing_moe_bridge module import in try/except so megatron-bridge-only deployments still load the engine. Raise a clear ImportError in _build_hf_mcore_bridge() when bridge_type=mbridge is requested but mbridge is unavailable.
Drop hard mbridge.core.Bridge type annotations in hf_load.py / hf_save.py; use TYPE_CHECKING in registry.py.
Replace mbridge.core.util.unwrap_model with a local _unwrap_model.
Lazy-import LLMBridge inside patch_bridge_for_tree_training.
Guard tests/test_estimate_num_params.py with pytest.importorskip("mbridge").

Docs

docs/en/best_practices/migrate_to_megatron_bridge.md — migration guide covering API differences, supported architectures, layer-spec injection pattern, and verification checklist.

Validation

The mbridge ↔ megatron-bridge numerical equivalence (matching starting logp / grad_norm / loss with same seed and config) has been validated on the internal branch this PR is ported from. All provider.<field> assignments are preserved as-is from the validated source, including the provider.rotary_percent = 1.0 MLA RoPE fix.

Test plan

pre-commit run --all-files (ruff lint + format, mdformat) — passing locally
CI smoke tests (regression for Qwen3 / existing Bailing mbridge path) — to run on PR
Manual: Bailing-MoE V2.5 5-step loss alignment between bridge_type: mbridge and bridge_type: megatron-bridge on same yaml

Not in scope

GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash model support → split into PR-B
VPP > 1 support for custom layer-spec injection (the lambda currently captures stage-0 specs; validated runs were VPP=1)
tests/test_megatron_bridge_smoke.py (internal smoke tests depend on local model paths)
Chinese (docs/zh/) translation of the migration guide

Switches the Bailing-MoE V2.5 family (BailingMoeV2_5ForCausalLM, BailingMoeLinearForCausalLM, BailingHybridForCausalLM) from mbridge-only to dual-bridge by adding a NVIDIA megatron-bridge MegatronModelBridge adapter alongside the existing mbridge one. The mbridge default behavior is unchanged; users opt into the new path via mcore.bridge_type=megatron-bridge. The dual-bridge infrastructure (config field mcore.bridge_type, MegatronEngine._build_hf_mcore_bridge(), registry's megatron-bridge branch, megatron-bridge==0.4.0 declared in pyproject.toml) already exists on main. New adapter: - areal/models/mcore/bailing_moe_megatron_bridge.py: a single MegatronModelBridge subclass registered for the three Bailing arches. Handles per-layer heterogeneous Lightning + MLA attention dispatch via is_lightning_layer(layer_idx, group_size); per-layer enumeration in mapping_registry() because MegatronMappingRegistry's wildcard mappings cannot express the heterogeneous case. LightningQKVMapping subclass overrides hf_to_megatron / megatron_to_hf to permute fused QKV between HF [Q|K|V] and mcore [q0,k0,v0,...] layouts. Registry / engine wiring: - areal/models/mcore/registry.py: add _BAILING_ARCHITECTURES set and _is_bailing() helper; inject AReaL's heterogeneous Bailing layer spec into provider.transformer_layer_spec for the megatron-bridge path so the bridge's default uniform spec doesn't overwrite it. - areal/engine/megatron_engine.py: import bailing_moe_megatron_bridge unconditionally so the @register_bridge decorator fires. Shared infrastructure (applies to all current and future bridge adapters, not Bailing-specific): - Wrap mbridge top-level import and the mbridge-dependent bailing_moe_bridge module import in try/except so megatron-bridge- only deployments still load the engine. Raise a clear ImportError in _build_hf_mcore_bridge() when bridge_type=mbridge is requested but mbridge is unavailable. - Drop hard mbridge.core.Bridge type annotations in hf_load.py / hf_save.py; use TYPE_CHECKING in registry.py. - Replace mbridge.core.util.unwrap_model with a local _unwrap_model. - Lazy-import LLMBridge inside patch_bridge_for_tree_training. - Guard tests/test_estimate_num_params.py with pytest.importorskip so collection succeeds when mbridge is missing. Docs: - docs/en/best_practices/migrate_to_megatron_bridge.md — migration guide covering API differences, supported architectures, layer-spec injection pattern, and verification checklist.

gemini-code-assist

Code Review

This pull request introduces support for the megatron-bridge backend as an alternative to mbridge for HuggingFace to Megatron-Core weight conversion and model creation, including a new BailingMoeV25Bridge adapter for BailingMoeV2.5 models. It also adds migration documentation and updates existing modules to make mbridge imports optional. The review feedback highlights several critical issues: a potential TypeError in BailingMoeV25Bridge due to the use of functools.partial with keyword arguments, an eager evaluation bug in getattr fallback logic that could raise an AttributeError, a static capture issue in registry.py that breaks virtual pipeline parallelism (VPP > 1), and a signature mismatch in the documentation's mapping_registry code example.

`MegatronModelBridge.mapping_registry()` is a no-arg method that accesses `self.hf_pretrained` internally; the doc sketch incorrectly showed it taking `hf_pretrained` as a parameter. Bring the example in line with the actual subclass signatures used in `bailing_moe_megatron_bridge.py:449` (and the upcoming `glm5_megatron_bridge.py:175` in PR-B).

dingzhiqiang requested review from PrometheusComing, garrett4wade, geshi001, nuzant, rchardx and sitabulaixizawaluduo as code owners May 28, 2026 03:48

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread areal/models/mcore/bailing_moe_megatron_bridge.py

Comment thread areal/models/mcore/bailing_moe_megatron_bridge.py

Comment thread areal/models/mcore/registry.py

Comment thread docs/en/best_practices/migrate_to_megatron_bridge.md Outdated

dingzhiqiang mentioned this pull request May 28, 2026

feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge) #1373

Open

4 tasks

dingzhiqiang requested a review from fishcrap May 28, 2026 04:22

sitabulaixizawaluduo mentioned this pull request Jun 1, 2026

[Roadmap] 2026 H2 Milestones #1381

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter#1372

feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter#1372
dingzhiqiang wants to merge 2 commits into
mainfrom
chucai.dzq/bailing-megatron-bridge

dingzhiqiang commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dingzhiqiang commented May 28, 2026

Summary

What's added

New adapter

Registry / engine wiring

Shared infrastructure

Docs

Validation

Test plan

Not in scope

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant