Skip to content

feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter#1372

Open
dingzhiqiang wants to merge 2 commits into
mainfrom
chucai.dzq/bailing-megatron-bridge
Open

feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter#1372
dingzhiqiang wants to merge 2 commits into
mainfrom
chucai.dzq/bailing-megatron-bridge

Conversation

@dingzhiqiang
Copy link
Copy Markdown
Collaborator

Summary

Switches the Bailing-MoE V2.5 family (BailingMoeV2_5ForCausalLM, BailingMoeLinearForCausalLM, BailingHybridForCausalLM) from mbridge-only to dual-bridge by adding a NVIDIA megatron-bridge MegatronModelBridge adapter alongside the existing mbridge one. The mbridge default behavior is unchanged; users opt into the new path via mcore.bridge_type=megatron-bridge.

The dual-bridge infrastructure (config field mcore.bridge_type, MegatronEngine._build_hf_mcore_bridge(), registry's megatron-bridge branch, megatron-bridge==0.4.0 declared in pyproject.toml) already exists on main.

This is the first of two split PRs (previously combined as #1362, now closed):

  • This PR-A: Bailing-MoE megatron-bridge adapter + shared cross-cutting infra.
  • PR-B (follow-up): GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash new model support. Depends on PR-A.

What's added

New adapter

  • areal/models/mcore/bailing_moe_megatron_bridge.py — a single MegatronModelBridge subclass registered for the three Bailing architectures. Handles per-layer heterogeneous Lightning + MLA attention dispatch via is_lightning_layer(layer_idx, group_size); per-layer enumeration in mapping_registry() because MegatronMappingRegistry's wildcard mappings cannot express the heterogeneous case. LightningQKVMapping subclass overrides hf_to_megatron / megatron_to_hf to permute fused QKV between HF [Q|K|V] and mcore [q0,k0,v0,...] layouts.

Registry / engine wiring

  • areal/models/mcore/registry.py — add _BAILING_ARCHITECTURES set and _is_bailing() helper; inject AReaL's heterogeneous Bailing layer spec into provider.transformer_layer_spec for the megatron-bridge path so the bridge's default uniform spec doesn't overwrite it.
  • areal/engine/megatron_engine.py — import bailing_moe_megatron_bridge unconditionally so the @register_bridge decorator fires.

Shared infrastructure

Applies to all current and future bridge adapters (not Bailing-specific), but lands here because PR-A is the first to need it:

  • Wrap mbridge top-level import and the mbridge-dependent bailing_moe_bridge module import in try/except so megatron-bridge-only deployments still load the engine. Raise a clear ImportError in _build_hf_mcore_bridge() when bridge_type=mbridge is requested but mbridge is unavailable.
  • Drop hard mbridge.core.Bridge type annotations in hf_load.py / hf_save.py; use TYPE_CHECKING in registry.py.
  • Replace mbridge.core.util.unwrap_model with a local _unwrap_model.
  • Lazy-import LLMBridge inside patch_bridge_for_tree_training.
  • Guard tests/test_estimate_num_params.py with pytest.importorskip("mbridge").

Docs

  • docs/en/best_practices/migrate_to_megatron_bridge.md — migration guide covering API differences, supported architectures, layer-spec injection pattern, and verification checklist.

Validation

The mbridge ↔ megatron-bridge numerical equivalence (matching starting logp / grad_norm / loss with same seed and config) has been validated on the internal branch this PR is ported from. All provider.<field> assignments are preserved as-is from the validated source, including the provider.rotary_percent = 1.0 MLA RoPE fix.

Test plan

  • pre-commit run --all-files (ruff lint + format, mdformat) — passing locally
  • CI smoke tests (regression for Qwen3 / existing Bailing mbridge path) — to run on PR
  • Manual: Bailing-MoE V2.5 5-step loss alignment between bridge_type: mbridge and bridge_type: megatron-bridge on same yaml

Not in scope

  • GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash model support → split into PR-B
  • VPP > 1 support for custom layer-spec injection (the lambda currently captures stage-0 specs; validated runs were VPP=1)
  • tests/test_megatron_bridge_smoke.py (internal smoke tests depend on local model paths)
  • Chinese (docs/zh/) translation of the migration guide

Switches the Bailing-MoE V2.5 family (BailingMoeV2_5ForCausalLM,
BailingMoeLinearForCausalLM, BailingHybridForCausalLM) from
mbridge-only to dual-bridge by adding a NVIDIA megatron-bridge
MegatronModelBridge adapter alongside the existing mbridge one. The
mbridge default behavior is unchanged; users opt into the new path via
mcore.bridge_type=megatron-bridge.

The dual-bridge infrastructure (config field mcore.bridge_type,
MegatronEngine._build_hf_mcore_bridge(), registry's megatron-bridge
branch, megatron-bridge==0.4.0 declared in pyproject.toml) already
exists on main.

New adapter:
- areal/models/mcore/bailing_moe_megatron_bridge.py: a single
  MegatronModelBridge subclass registered for the three Bailing arches.
  Handles per-layer heterogeneous Lightning + MLA attention dispatch
  via is_lightning_layer(layer_idx, group_size); per-layer enumeration
  in mapping_registry() because MegatronMappingRegistry's wildcard
  mappings cannot express the heterogeneous case. LightningQKVMapping
  subclass overrides hf_to_megatron / megatron_to_hf to permute fused
  QKV between HF [Q|K|V] and mcore [q0,k0,v0,...] layouts.

Registry / engine wiring:
- areal/models/mcore/registry.py: add _BAILING_ARCHITECTURES set and
  _is_bailing() helper; inject AReaL's heterogeneous Bailing layer
  spec into provider.transformer_layer_spec for the megatron-bridge
  path so the bridge's default uniform spec doesn't overwrite it.
- areal/engine/megatron_engine.py: import bailing_moe_megatron_bridge
  unconditionally so the @register_bridge decorator fires.

Shared infrastructure (applies to all current and future bridge
adapters, not Bailing-specific):
- Wrap mbridge top-level import and the mbridge-dependent
  bailing_moe_bridge module import in try/except so megatron-bridge-
  only deployments still load the engine. Raise a clear ImportError
  in _build_hf_mcore_bridge() when bridge_type=mbridge is requested
  but mbridge is unavailable.
- Drop hard mbridge.core.Bridge type annotations in hf_load.py /
  hf_save.py; use TYPE_CHECKING in registry.py.
- Replace mbridge.core.util.unwrap_model with a local _unwrap_model.
- Lazy-import LLMBridge inside patch_bridge_for_tree_training.
- Guard tests/test_estimate_num_params.py with pytest.importorskip
  so collection succeeds when mbridge is missing.

Docs:
- docs/en/best_practices/migrate_to_megatron_bridge.md — migration
  guide covering API differences, supported architectures, layer-spec
  injection pattern, and verification checklist.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the megatron-bridge backend as an alternative to mbridge for HuggingFace to Megatron-Core weight conversion and model creation, including a new BailingMoeV25Bridge adapter for BailingMoeV2.5 models. It also adds migration documentation and updates existing modules to make mbridge imports optional. The review feedback highlights several critical issues: a potential TypeError in BailingMoeV25Bridge due to the use of functools.partial with keyword arguments, an eager evaluation bug in getattr fallback logic that could raise an AttributeError, a static capture issue in registry.py that breaks virtual pipeline parallelism (VPP > 1), and a signature mismatch in the documentation's mapping_registry code example.

Comment thread areal/models/mcore/bailing_moe_megatron_bridge.py
Comment thread areal/models/mcore/bailing_moe_megatron_bridge.py
Comment thread areal/models/mcore/registry.py
Comment thread docs/en/best_practices/migrate_to_megatron_bridge.md Outdated
`MegatronModelBridge.mapping_registry()` is a no-arg method that
accesses `self.hf_pretrained` internally; the doc sketch incorrectly
showed it taking `hf_pretrained` as a parameter. Bring the example in
line with the actual subclass signatures used in
`bailing_moe_megatron_bridge.py:449` (and the upcoming
`glm5_megatron_bridge.py:175` in PR-B).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant