feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter#1372
feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter#1372dingzhiqiang wants to merge 2 commits into
Conversation
Switches the Bailing-MoE V2.5 family (BailingMoeV2_5ForCausalLM, BailingMoeLinearForCausalLM, BailingHybridForCausalLM) from mbridge-only to dual-bridge by adding a NVIDIA megatron-bridge MegatronModelBridge adapter alongside the existing mbridge one. The mbridge default behavior is unchanged; users opt into the new path via mcore.bridge_type=megatron-bridge. The dual-bridge infrastructure (config field mcore.bridge_type, MegatronEngine._build_hf_mcore_bridge(), registry's megatron-bridge branch, megatron-bridge==0.4.0 declared in pyproject.toml) already exists on main. New adapter: - areal/models/mcore/bailing_moe_megatron_bridge.py: a single MegatronModelBridge subclass registered for the three Bailing arches. Handles per-layer heterogeneous Lightning + MLA attention dispatch via is_lightning_layer(layer_idx, group_size); per-layer enumeration in mapping_registry() because MegatronMappingRegistry's wildcard mappings cannot express the heterogeneous case. LightningQKVMapping subclass overrides hf_to_megatron / megatron_to_hf to permute fused QKV between HF [Q|K|V] and mcore [q0,k0,v0,...] layouts. Registry / engine wiring: - areal/models/mcore/registry.py: add _BAILING_ARCHITECTURES set and _is_bailing() helper; inject AReaL's heterogeneous Bailing layer spec into provider.transformer_layer_spec for the megatron-bridge path so the bridge's default uniform spec doesn't overwrite it. - areal/engine/megatron_engine.py: import bailing_moe_megatron_bridge unconditionally so the @register_bridge decorator fires. Shared infrastructure (applies to all current and future bridge adapters, not Bailing-specific): - Wrap mbridge top-level import and the mbridge-dependent bailing_moe_bridge module import in try/except so megatron-bridge- only deployments still load the engine. Raise a clear ImportError in _build_hf_mcore_bridge() when bridge_type=mbridge is requested but mbridge is unavailable. - Drop hard mbridge.core.Bridge type annotations in hf_load.py / hf_save.py; use TYPE_CHECKING in registry.py. - Replace mbridge.core.util.unwrap_model with a local _unwrap_model. - Lazy-import LLMBridge inside patch_bridge_for_tree_training. - Guard tests/test_estimate_num_params.py with pytest.importorskip so collection succeeds when mbridge is missing. Docs: - docs/en/best_practices/migrate_to_megatron_bridge.md — migration guide covering API differences, supported architectures, layer-spec injection pattern, and verification checklist.
There was a problem hiding this comment.
Code Review
This pull request introduces support for the megatron-bridge backend as an alternative to mbridge for HuggingFace to Megatron-Core weight conversion and model creation, including a new BailingMoeV25Bridge adapter for BailingMoeV2.5 models. It also adds migration documentation and updates existing modules to make mbridge imports optional. The review feedback highlights several critical issues: a potential TypeError in BailingMoeV25Bridge due to the use of functools.partial with keyword arguments, an eager evaluation bug in getattr fallback logic that could raise an AttributeError, a static capture issue in registry.py that breaks virtual pipeline parallelism (VPP > 1), and a signature mismatch in the documentation's mapping_registry code example.
`MegatronModelBridge.mapping_registry()` is a no-arg method that accesses `self.hf_pretrained` internally; the doc sketch incorrectly showed it taking `hf_pretrained` as a parameter. Bring the example in line with the actual subclass signatures used in `bailing_moe_megatron_bridge.py:449` (and the upcoming `glm5_megatron_bridge.py:175` in PR-B).
Summary
Switches the Bailing-MoE V2.5 family (
BailingMoeV2_5ForCausalLM,BailingMoeLinearForCausalLM,BailingHybridForCausalLM) frommbridge-only to dual-bridge by adding a NVIDIAmegatron-bridgeMegatronModelBridgeadapter alongside the existingmbridgeone. Thembridgedefault behavior is unchanged; users opt into the new path viamcore.bridge_type=megatron-bridge.The dual-bridge infrastructure (config field
mcore.bridge_type,MegatronEngine._build_hf_mcore_bridge(), registry's megatron-bridge branch,megatron-bridge==0.4.0declared inpyproject.toml) already exists onmain.This is the first of two split PRs (previously combined as #1362, now closed):
What's added
New adapter
areal/models/mcore/bailing_moe_megatron_bridge.py— a singleMegatronModelBridgesubclass registered for the three Bailing architectures. Handles per-layer heterogeneous Lightning + MLA attention dispatch viais_lightning_layer(layer_idx, group_size); per-layer enumeration inmapping_registry()becauseMegatronMappingRegistry's wildcard mappings cannot express the heterogeneous case.LightningQKVMappingsubclass overrideshf_to_megatron/megatron_to_hfto permute fused QKV between HF[Q|K|V]and mcore[q0,k0,v0,...]layouts.Registry / engine wiring
areal/models/mcore/registry.py— add_BAILING_ARCHITECTURESset and_is_bailing()helper; inject AReaL's heterogeneous Bailing layer spec intoprovider.transformer_layer_specfor the megatron-bridge path so the bridge's default uniform spec doesn't overwrite it.areal/engine/megatron_engine.py— importbailing_moe_megatron_bridgeunconditionally so the@register_bridgedecorator fires.Shared infrastructure
Applies to all current and future bridge adapters (not Bailing-specific), but lands here because PR-A is the first to need it:
mbridgetop-level import and the mbridge-dependentbailing_moe_bridgemodule import intry/exceptso megatron-bridge-only deployments still load the engine. Raise a clearImportErrorin_build_hf_mcore_bridge()whenbridge_type=mbridgeis requested but mbridge is unavailable.mbridge.core.Bridgetype annotations inhf_load.py/hf_save.py; useTYPE_CHECKINGinregistry.py.mbridge.core.util.unwrap_modelwith a local_unwrap_model.LLMBridgeinsidepatch_bridge_for_tree_training.tests/test_estimate_num_params.pywithpytest.importorskip("mbridge").Docs
docs/en/best_practices/migrate_to_megatron_bridge.md— migration guide covering API differences, supported architectures, layer-spec injection pattern, and verification checklist.Validation
The mbridge ↔ megatron-bridge numerical equivalence (matching starting logp / grad_norm / loss with same seed and config) has been validated on the internal branch this PR is ported from. All
provider.<field>assignments are preserved as-is from the validated source, including theprovider.rotary_percent = 1.0MLA RoPE fix.Test plan
pre-commit run --all-files(ruff lint + format, mdformat) — passing locallybridge_type: mbridgeandbridge_type: megatron-bridgeon same yamlNot in scope
tests/test_megatron_bridge_smoke.py(internal smoke tests depend on local model paths)docs/zh/) translation of the migration guide