Trim PR #2590: revert CP/packing to main, keep R3 minimal (no THD pat… by zyzhou5 · Pull Request #2 · zyzhou5/RL

zyzhou5 · 2026-06-12T18:30:00Z

Revert CP change

…(no THD path, no gating) Deletes the PR's zigzag->Transformer-Engine-THD context-parallel rewrite entirely; CP/packing uses main's original per-sequence zigzag, byte-for-byte. Only minimal R3 (router-replay) plumbing is kept. No second code path, no use_thd_cp gating. Reverted ENTIRELY to main (diff vs main = 0; THD machinery deleted): - distributed/model_utils.py (_get_thd_partitioned_indices, _get_packed_thd_tokens_on_this_cp_rank, allgather_packed_thd_cp_sharded_tensor, AllGatherPackedTHDCPTensor, from_parallel_logits_to_logprobs* edits) - algorithms/loss/utils.py (_pack_input_ids THD), loss/wrapper.py (_packed_cp_cu_seqlens_padded) - tests: test_model_utils.py, test_sequence_packing_fusion.py, test_train.py data.py: - _pack_sequences_for_megatron is BYTE-IDENTICAL to main (5-tuple, unchanged signature/body) -> all main-arity callers (test_sequence_packing_fusion, megatron_data_actors, sequence_packing_gradient_actor) work unchanged. - routed_experts (+ R3-trace token_identity) CP-sharding moved to a SEPARATE additive helper _shard_routed_experts_for_cp, which rides the SAME per-seq _get_tokens_on_this_cp_rank(seq_dim=0) and derives per-seq padded boundaries from the packer's own cu_seqlens_padded (drift-free). arange(K) pad rows (mcore _validate_replay_tensor), no roll. process_microbatch calls the 5-tuple packer then the helper only when routed_experts is present. - Deleted the THD packer trio (_pack_token_aligned_*, _slice_batch_for_megatron_context_parallel). train.py: restored main TopkLogitsPostProcessor per-seq allgather; kept R3 forward plumbing. Kept unchanged (R3): router_replay.py, r3_trace.py, worker R3 orchestration, setup R3 config, vLLM routed_experts recording, dataplane routed_experts schema/codec. Tests: dropped THD-packer tests; re-pointed 2 process_microbatch tests to the 5-tuple packer + the new helper; added real CPU test test_shard_routed_experts_for_cp_matches_input_ids_zigzag (cp1/cp2) asserting routed_experts zigzag selection matches input_ids + arange(K) padding. Regime note: per-seq zigzag == THD for 2*cp-aligned packs (make_sequence_length_divisible_by=2*cp); validated configs (gate0 PP=1; committed PP=2/CP=2/TP=4 recipe) are in-regime. THD CP-correctness for non-2*cp whole-pack tails (PP>1/fp8) belongs in a separate CP PR. Not GPU-validated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

zyzhou5 force-pushed the r3-router-replay-main-refresh branch 8 times, most recently from 1565fe0 to 1e7bb71 Compare June 15, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim PR #2590: revert CP/packing to main, keep R3 minimal (no THD pat…#2

Trim PR #2590: revert CP/packing to main, keep R3 minimal (no THD pat…#2
zyzhou5 wants to merge 1 commit into
r3-router-replay-main-refreshfrom
zezhou/r3-only-revert-cp

zyzhou5 commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zyzhou5 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zyzhou5 commented Jun 12, 2026 •

edited

Loading