Conversation
Correctly handle `ds_grad_is_ready` in ZeRO2 --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The current code has the following issues: - `use_default_specs: false` doesn't work - Injection by the traditional pattern runs even when custom patterns are set - `mpu` needs to be passed to `deepspeed.initialize` (HF integration doesn't pass mpu) This PR fixes AutoTP setup to respect `use_default_specs: false` and disable the traditional injection path when custom patterns are enabled. Also, when `mpu` is not passed, we create a TP group in the initialization process. With these changes, the [related tests](https://github.com/deepspeedai/DeepSpeed/tree/master/tests/unit/model_parallelism) pass and [all AutoTP examples](https://github.com/tohtana/DeepSpeedExamples/tree/tohtana/custom_auto_tp/training/tensor_parallel) in DeepSpeedExamples work now ([PR](deepspeedai/DeepSpeedExamples#998)). --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd07c93a5e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cabfebcdca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.return_router_logits: | ||
| logits = self._cached_router_logits | ||
| self._cached_router_logits = None |
There was a problem hiding this comment.
Populate router logits when returning tuple output
When _detect_forward_contract sets return_router_logits=True for legacy MoE blocks (router_logits_capture_target == "moe_block"), _register_logit_hook is not installed and _cached_router_logits is never set. The forward path then returns (output, None) here, which breaks callers that expect actual router logits (e.g., OutputRecorder/z-loss paths that rely on the second return value). This only shows up for models using the MoE-block tuple contract, but in that case the logits are silently missing.
Useful? React with 👍 / 👎.
Current metaclasses for layers and parameters access annotations in a way that is incompatible with python 3.14+ See: - [Python 3.14 release notes](https://docs.python.org/3/whatsnew/3.14.html) - [Porting annotations](https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-porting-annotations) - [PEP649](https://peps.python.org/pep-0649/) and [PEP749](https://peps.python.org/pep-0749/) This PR uses annotationlib from python 3.14 onwards and keeps backwards compatibility. closes deepspeedai#7673 should unblock CF builds for py3.14 conda-forge/deepspeed-feedstock#114 A question is, does deepspeed support officially 3.14 yet? Should we test it in CIs? --------- Signed-off-by: Santi Villalba <sdvillal@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
## Bug `fractions.gcd` was deprecated in Python 3.5 and removed in Python 3.9. This causes an `AttributeError` on Python 3.9+. ## Fix Replaced `fractions.gcd` with `math.gcd` which is the standard replacement.
Fixes: deepspeedai#7837 ZeRO-0 + bf16 has two bugs in `engine.py`: 1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with `cur_scale=65536` but `engine.backward()` never scales the loss, so `step()` divides gradients by 65536 2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing gradient accumulation. Fix: disable loss scaling for bf16 and remove the `zero_optimization()` gate on `zero_grad`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…peedai#7840) Fixes deepspeedai#7835. On torch==2.10.0, importing DeepSpeed emitted deprecation warnings from import-time JIT-decorated helpers. This change updates the compatibility path to align with PyTorch guidance while keeping import clean. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
…ost-0.18.6 Update version post release
This PR addresses deepspeedai#7677 by flattening parameter tensors on the accelerators instead of the CPU during zero stage 1 and 2 initialization. This should alleviate CPU contention, with the caveat that the optimization is only used when there is enough VRAM to allocate a full copy of the parameter buffers. On 8 x H100s and a Intel Xeon Platinum 8480+, profiling the initialization of DeepSpeed on 32 layers of `Qwen3-30B` with Z2 gives the following: Old = ~382s New = ~130s ------------------------- If necessary, this optimization can be extended to allowed a tiered system that trades off VRAM space with performance, which might look like the following: ``` if enough VRAM for 2x model_size: naive flatten else if enough VRAM for model_size / N: distributed flatten across N devices else: flatten on CPU ``` The distributed flatten would involve each device flattening a portion of the parameters and performing an all-gather to assemble the full flattened model. See deepspeedai#7677 for original discussion. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Kento Sugama <kentosugama@protonmail.ch> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com> Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: nathon <leejianwoo@gmail.com> Co-authored-by: Vensen <vensenmu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: jp <jsb10121249@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR enables shared memory communication in single node for arm hosts - deepspeedai#7625 <img width="908" height="108" alt="image" src="https://github.com/user-attachments/assets/a5d1a5c7-f28e-4129-9503-cc2b477993ac" /> --------- Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Added a new news entry about DeepSpeed ZeRO++ support for LLM distillation work at LinkedIn.
## Summary Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed Inference V2. Closes deepspeedai#7453 ## Changes - New model implementation: `deepspeed/inference/v2/model_implementations/exaone4/` - `container.py`: Transformer and non-transformer parameter containers - `model.py`: Inference model with post-norm architecture and QK-Norm support - `policy.py`: Inference V2 policy - Register EXAONE 4.0 in `engine_factory.py` and `__init__.py` ## Key architectural differences from Mistral/Llama - **Post-norm**: RMSNorm is applied after attention/MLP outputs (not before), followed by residual addition - **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the QKV linear layer - **Hybrid attention**: 32B model uses 3:1 sliding window/full attention ratio (via `layer_types` config) ## Supported models - [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B) (all full attention) - [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B) (hybrid sliding/full attention) Requires `transformers >= 4.54.0`. ## Related - Supersedes deepspeedai#7456 (draft, inactive for 6 months) --------- Signed-off-by: Bias92 <pewpewplay315@gmail.com>
deepspeedai#7846) Fixes deepspeedai#7843 On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in the code are not provided, e.g.: - `__ll2bfloat16_rn` - `__int2bfloat16_rn` - `__short2bfloat16_rn` - `__bfloat162uint_rn` This causes compilation errors on HIP platforms. This PR introduces fallback paths using functions available on HIP platform mirroring the [conversion util in csrc](https://github.com/deepspeedai/DeepSpeed/blob/2c362837b0ef906ea7e7506bab3a625faa945cdd/csrc/includes/conversion_utils.h#L351). The converion paths are: - int/uint -> bf16: convert to float (or double for 64-bit), then to bf16. - bf16 -> int/uint: convert bf16 to float, then to the integer type. - float -> bf16: build from bf16 via supported HIP helpers. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`EvoformerAttnBuilder` returns instances of `Path` from `include_paths` which then cause failures in `OpBuilder.builder` when passing them to `strip_empty_entries` that calls `len` on them which isn't defined for `Path` instances: > TypeError: object of type 'PosixPath' has no len() Fixes regression introduced in deepspeedai#7760 cc @sdvillal Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
…edai#7832) deepspeedai#7817 added a test to verify that we throw an error when parameters are modified in `GatheredParameters` and `modifier_rank` is None. However, the PR just checks devices and doesn't detect modifications on parameters. This causes an [error](https://github.com/deepspeedai/DeepSpeed/actions/runs/21653729382/job/62424014222) in our full test run. This PR adds the detection of parameter modifications to properly throw an error. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
PR deepspeedai#7839 introduced a regression by changing `TestZeroStaticScale` from `assert optim.dynamic_loss_scale == False` to `assert optim.loss_scale_config.dynamic_loss_scale == False`. `loss_scale_config` is not part of the ZeRO optimizer (only non-ZeRO optimizer have it), while this test runs with ZeRO optimizers. With this fix, `TestZeroStaticScale` now passes for stages 1/2/3. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The full test workflow passed though it is still flakey ([Success](https://github.com/deepspeedai/DeepSpeed/actions/runs/22269243373) / [Failure](https://github.com/deepspeedai/DeepSpeed/actions/runs/22266498530)) This PR schedules a nightly run of the full test. It is launched only when we have update since the last successful run. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…pspeedai#7874) Fix links and manu items for AutoTP doc Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The recent attempts of the night full test [kept failing](https://github.com/deepspeedai/DeepSpeed/actions/workflows/aws-torch-latest-full.yml). We added a fallback to an A100 node on the infra side. This PR detects the CUDA architecture and number of GPUs and sets them to env vars. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…eepspeedai#7932) ## Summary - Remove the `# Copyright (c) Microsoft Corporation.` line from the license header template in both AGENTS.md and CLAUDE.md - The project license header should only contain the SPDX identifier and team attribution ## Test plan - [x] Verify AGENTS.md and CLAUDE.md no longer reference Microsoft Corporation copyright Signed-off-by: Zhipeng Wang <zwanga@wustl.edu> Co-authored-by: Zhipeng Wang <zwanga@wustl.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
refactor(module_inject): consolidate duplicate transpose functions - Extract the duplicated `transpose` function into `deepspeed/module_inject/utils.py`. - Remove redundant `transpose` definitions from `policy.py` and `load_checkpoint.py`. - This resolves an existing `TODO (lekurile)` to consolidate the function across containers. --------- Signed-off-by: nathon-lee <leejianwoo@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The full CI test [fails](https://github.com/deepspeedai/DeepSpeed/actions/runs/23735417401/job/69138729446) throwing "RuntimeError: Cannot re-initialize CUDA" because of tests for universal checkpoint and AutoTP. It happens because they run `torch.cuda.current_device()` under `pytest --forked`. As the tests only touch universal checkpoint metadata, we won't need to call it. This PR skips constructor-time AutoTP materialization when `mp_group` is `None`. Partitioning still happens in the real AutoTP usage where an actual model-parallel group is given. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Removing the file used as the file-store while the process-group is still active is invalid as it is still in use. If `reuse_dist_env` is `True` the process group is still active and the processes will try reading from that file waiting for it to exists. In the shutdown (`destroy_process_group`) they will wait for all threads to join but (at least) one is still waiting for that file. This will cause the process to hang until a PyTorch-internal timeout is reached, which currently is ~ 5minutes Solution is to create a unique file. I chose to put it in in `tmpdir` and add a suffix to differentiate it. Note that `tmpdir` is not enough as this method is called through the fixture setup already once so that is not clean when called later in the test execution CC @mrwyattii , author of deepspeedai#3850 adding this code --------- Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
refactor(zero3): factor out defragment method to zero utils Resolves a TODO in the codebase by extracting the `defragment` logic out of `DeepSpeedZeroOptimizer_Stage3` and moving it to `deepspeed/runtime/zero/utils.py` as an independent utility function. This decouples the memory defragmentation logic from the core optimizer class and improves code maintainability and reusability. --------- Signed-off-by: nathon-lee <leejianwoo@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
this way one can register kernels based flash-attn as well with SP --------- Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
deepspeedai#7948) …U accelerator The on-device flatten path (introduced in deepspeedai#7828) passes nn.Parameter objects with requires_grad=True to torch.cat(), creating a flat buffer with CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces SplitBackward0 views that are assigned to model params. Inplace copy_() on these views during optimizer step raises: RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This especially affects CPU training where CPU_Accelerator.is_available() returns True and available_memory() returns system RAM, so the on-device path is always taken. Fix: add .detach() to the flattened buffer, matching the implicit detach behavior of the CPU-offload path (param.data.cpu() + .to(device)). Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific terminology in comments/logs with accelerator-generic equivalents. --------- Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There are 2 files with the same file stem resulting in conflicting Ninja rules as there will be 2 rules for `fp_quantize.o`. Simply rename the .cu file. Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> (cherry picked from commit cc45af3)
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR fixes a bug introduced in [deepspeedai#6550](deepspeedai#6550), which was also pointed out in [this comment](deepspeedai#6550 (comment)). The issue is that gradients are only copied to CPU when micro_step_id=0. For micro_step_id > 0, the gradients were effectively dropped instead of being accumulated, which leads to an artificially smaller gradient norm. With this fix, gradients are copied and accumulated on every microstep, matching the expected behavior and restoring the correct gradient norm. The plot below shows the impact clearly: the previous implementation significantly underestimates the gradient norm compared to the fixed version. <img width="808" height="584" alt="grad_norms" src="https://github.com/user-attachments/assets/6a0d968c-88cc-4b69-b990-3e2aa1c892b0" /> Setup: SFT run using OpenRLHF with DeepSpeed. - OpenRLHF CPU-offloaded buggy baseline: gradients dropped for microstep > 0 - OpenRLHF CPU-offloaded fixed version: correct accumulation across all microsteps - OpenRLHF GPU, non-offloaded version: reference correct behavior - Verl (FSDP optimizer): additional reference baseline using PyTorch FSDP The fixed version matches non-offloaded DeepSpeed and FSDP, confirming correct gradient accumulation. Effect on loss: <img width="2943" height="1742" alt="loss_cpu_optimizer_comparison" src="https://github.com/user-attachments/assets/edf1dfd7-9b5f-46fe-b174-fcc57b36225c" /> --------- Signed-off-by: Alexis Limozin <alexis@limozin.net>
This PR fixes a ZeRO 1/2 overlap-comm correctness issue. When comparing loss values, we found that only ZeRO2 shows nan as a loss. - zero1: 11.201002 -> 11.165665 -> 11.213738 -> 11.121310 - zero2: 11.201002 -> 11.165665 -> nan - zero3: 11.201002 -> 11.165665 -> 11.204460 -> 11.121443 Here is what we found: In `allreduce_and_copy_with_multiple_ranks()` and `allreduce_and_copy()`, the reduction result and copied destination buffers were used on the reduction stream without recording that stream on the underlying storage, allowing the caching allocator to recycle that storage before the queued comm/copy work had completed. This could impact also ZeRO1 though we only encountered the issue with ZeRO2. This PR adds `record_stream` to ensure the buffer is not freed until the queued work is done. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
DeepCompile+Z3 didn't work with PyTorch v2.9/2.10 because: - PyTorch v2.9+ started enforcing stricter TorchDynamo parameter tensor-match guards. During DeepCompile tracing, some ZeRO-3 parameters were temporarily all-gathered, so Dynamo recorded full sizes such as 4096 - By the time guard evaluation ran, DeepSpeed had already released those params back to the normal ZeRO-3 partitioned representation, where `param.data` is `empty(0)`. That produced guard failures like `expected 4096, actual 0`. This PR resolves the issue by: - Leep full-shape dummy tensors for symbolic tracing - Override guard size/stride metadata for ZeRO-3 params to the stable released representation instead of transient gathered sizes This PR also includes fixes of these bugs: - For v2.7 and v2.8, the compiled backward graph could hoist `end_backward` ahead of the real `reduce_grad` calls. - Selective unsharding pass can overcount the persistence memory budget. Note: DeepCompile is still incompatible with v2.11. It will be addressed by another PR. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`WarmupCosineLR` returned a singleton pre-start LR list even when the optimizer had multiple parameter groups. Because scheduler initialization applies LRs with `zip(param_groups, lrs)`, only group 0 was updated and later groups kept their base LR before the first optimizer step. The fix changes the pre-start scheduler outputs to match the multi-group contract by returning scalar `0.0` from `get_lr_ratio()` and a zero-filled LR list sized to `self.org_lrs`. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR enables the full test workflow to choose PyTorch version. It resolves each the version into matching `torch` / `torchvision` / `torchaudio` install versions. It also keeps nightly change detection schedule-only, so manual runs do not affect the daily baseline. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
…impl.cu (deepspeedai#7973) ## Summary - Fix `warning deepspeedai#68-D: integer conversion resulted in a change of sign` by using unsigned literal `1U` for all bit-shift expressions storing to unsigned types (`csrc/fp_quantizer/fp_quantize_impl.cu`) - Fix `warning deepspeedai#62-D: shift count is negative` by removing unused `mantisa_mask` dead code in `apply_dequantization` and `apply_selective_dequantization` The `_sign_mask` computation `1 << (_mantisa_bits + _exponent_bits)` in `apply_quantization` shifts a signed `int` literal by 31 bits (`_mantisa_bits=23, _exponent_bits=8`), which is undefined behavior in C++. Using `1U` makes the shift well-defined. For consistency and defensive programming, the same `1` → `1U` change is applied to all similar patterns in `round()`, `apply_dequantization`, and `apply_selective_dequantization`. The `mantisa_mask` variable in both dequantization functions was copy-pasted from the quantization function but **never used** in the dequantization code paths. Its initialization `mantisa_mask <<= (_mantisa_bits - q_mantisa_bits)` always produces a negative shift count because in these functions `_mantisa_bits` (quantized format, small: 1-7) < `q_mantisa_bits` (output format, large: 7 or 10). > **Note:** The issue suggested the template argument order in `launch_dequantization` / `launch_selective_dequantization` might be wrong, but analysis of the function body confirms the original order is correct — `quantized_bits = _mantisa_bits + _exponent_bits + 1` correctly computes the quantized format total bits, and `dst_mantisa << (q_mantisa_bits - _mantisa_bits)` correctly left-shifts the quantized mantissa into the output format's mantissa field. The warnings came solely from the unused dead code. Fixes deepspeedai#7971 ## Before / After <details> <summary>Before (18+ warnings)</summary> ``` $ nvcc -c fp_quantize_impl.cu -DBF16_AVAILABLE --expt-relaxed-constexpr \ -gencode arch=compute_86,code=sm_86 -std=c++17 fp_quantize_impl.cu(82): warning deepspeedai#68-D: integer conversion resulted in a change of sign fp_quantize_impl.cu(244): warning deepspeedai#62-D: shift count is negative (x9, one per template instantiation) fp_quantize_impl.cu(426): warning deepspeedai#62-D: shift count is negative (x9, one per template instantiation) ``` The `mantisa_mask` variable causing the shift warnings is declared but never used in either dequantization function. </details> <details> <summary>After (0 warnings from this file)</summary> ``` $ nvcc -c fp_quantize_impl.cu -DBF16_AVAILABLE --expt-relaxed-constexpr \ -gencode arch=compute_86,code=sm_86 -std=c++17 2>&1 | grep -E 'deepspeedai#62-D|deepspeedai#68-D' (no output — all warnings eliminated) ``` Compilation succeeds with exit code 0. Only unrelated `deepspeedai#821-D` warnings remain from `memory_access_utils.h`. </details> ## Changes ### `csrc/fp_quantizer/fp_quantize_impl.cu` 1. **Line 38, 40, 42** (`round()`) — `1` → `1U`: Consistent unsigned shifts in `mantisa_mask`, `offset`, and exponent overflow check. Not UB today (`1 << 23` fits in `int`), but prevents future issues and silences potential sign-conversion warnings. 2. **Line 82** (`apply_quantization`) — `1` → `1U`: Fix actual UB — `1 << 31` on signed `int` is undefined behavior. 3. **Line 237** (`apply_dequantization`) — `1` → `1U` in `_sign_mask`: Consistent with `apply_quantization`. Not UB with current template args (`1 << 7`), but defensive. 4. **Line 416** (`apply_selective_dequantization`) — `1` → `1U` in `_sign_mask`: Same as above. 5. **Lines 243-244** — Remove unused `mantisa_mask` in `apply_dequantization`: Copy-pasted from the quantization function but never referenced in the dequantization code path. 6. **Lines 425-426** — Remove unused `mantisa_mask` in `apply_selective_dequantization`: Same dead code as above. ## Test plan - [x] `nvcc` compilation with `-DBF16_AVAILABLE` — 0 `deepspeedai#62-D` / `deepspeedai#68-D` warnings (was 18+) - [x] Verified `mantisa_mask` (no underscore prefix) is unused in both dequantization functions by grepping all occurrences — only used in `apply_quantization` (line 142) and `round` (lines 38-42) - [x] Verified template parameter order in `launch_dequantization` and `launch_selective_dequantization` is correct by tracing all usages of `_mantisa_bits`, `_exponent_bits`, `q_mantisa_bits` in function bodies - [x] All `1` → `1U` changes are semantically identical for non-negative shift counts; the only behavioral fix is line 82 where `1 << 31` was UB Signed-off-by: Cursx <674760201@qq.com>
## Summary - Fix duplicate/wrong `-gencode=` flags in both JIT and non-JIT compilation paths (`op_builder/builder.py`) - Fix `TORCH_CUDA_ARCH_LIST` env-var restore logic in `OpBuilder.jit_load()` DeepSpeed's `compute_capability_args()` generates its own `-gencode` flags, but PyTorch (`load()` in JIT mode, `BuildExtension` in non-JIT mode) *also* reads `TORCH_CUDA_ARCH_LIST` and generates `-gencode` flags. This causes two problems: 1. **JIT mode**: `jit_load()` set `TORCH_CUDA_ARCH_LIST=""`, which PyTorch treats as *unset* and falls back to auto-detection — resulting in every flag appearing **twice**. 2. **Non-JIT mode**: subclasses that override `filter_ccs()` (e.g. `FPQuantizerBuilder`, `EvoformerAttnBuilder`) remove certain archs, but `BuildExtension` re-reads the **unfiltered** `TORCH_CUDA_ARCH_LIST` and adds them back — **undermining the filter**. The fix synchronises `TORCH_CUDA_ARCH_LIST` with the filtered arch list in `compute_capability_args()`, for both JIT and non-JIT paths. Fixes deepspeedai#7972 ## Before / After <details> <summary>Before (buggy behavior)</summary> **JIT mode** — `TORCH_CUDA_ARCH_LIST` cleared to `""`, PyTorch auto-detects and adds flags, DeepSpeed also adds the same flags: ``` nvcc ... -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 ``` Plus a spurious warning: ``` UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. ``` **Non-JIT mode** — `FPQuantizerBuilder.filter_ccs()` removes `< 8.0`, but `BuildExtension` re-adds them from the unfiltered env var: ``` # FPQuantizer compiled for sm_70 even though filter_ccs() removed it nvcc ... -gencode=arch=compute_80,code=sm_80 # from DeepSpeed (correct) ... -gencode=arch=compute_70,code=sm_70 # from BuildExtension (wrong!) ``` </details> <details> <summary>After (fixed behavior)</summary> **JIT mode** — `TORCH_CUDA_ARCH_LIST` is set to the detected architectures, PyTorch generates flags once, no duplicates: ``` nvcc ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 ``` No spurious warning. Env var is properly restored/removed after build. **Non-JIT mode** — `TORCH_CUDA_ARCH_LIST` is updated to the filtered list. Each extension keeps its own `-gencode` flags, and `BuildExtension` reads the filtered env var: ``` # FPQuantizer: only sm_80+ as intended nvcc ... -gencode=arch=compute_80,code=sm_80 # from DeepSpeed ... -gencode=arch=compute_80,code=sm_80 # from BuildExtension (harmless dup) ``` > **Note:** in multi-builder `setup.py` builds, the last builder's filtered arch list wins for `TORCH_CUDA_ARCH_LIST`. This may cause harmless duplicates for some extensions, but will never reintroduce archs that any builder's `filter_ccs()` removed — a strict improvement over the current behavior where the unfiltered original is always used. </details> ## Changes - `op_builder/builder.py` - `CUDAOpBuilder.compute_capability_args()`: - Always sync `TORCH_CUDA_ARCH_LIST` with the filtered arch list - JIT mode: return `[]` (PyTorch generates flags via `load()`) - Non-JIT mode: return `-gencode` args as before (per-builder flags in `extra_compile_args`) - `OpBuilder.jit_load()`: simplified stash/restore — properly `del` the env var if it was not originally set --------- Signed-off-by: Cursx <674760201@qq.com>
Followup to deepspeedai#7973 for deepspeedai#7971 The naming of q_mantisa_bits and mantisa_bits was swapped. The invocation set: ``` q_mantisa_bits = mantisa _mantisa_bits = CONST_Q_MANTISA_BITS _exponent_bits = CONST_Q_EXPONENT_BITS ``` so correct them by swapping the names back. I noticed that the code needs a thorough review because multiple places look suspicious: ``` // Why the default args? They seem to not even be matching (16 != 3+4+1) int total_q_bits = 16, int q_mantisa_bits = 3, int q_exponent_bits = 4> // Why recompute if there is a total_q_bits template? constexpr int quantized_bits = q_mantisa_bits + q_exponent_bits + 1; // Likely wrong: total_q_bits < mantisa_bits --> negative bits? Likely caused by wrong naming constexpr int q_exponent_bits = total_q_bits - mantisa_bits - 1; // should likey use a `q_` prefix not `_` constexpr uint16_t _mantisa_mask = (1 << q_mantisa_bits) - 1; constexpr uint16_t _exponent_mask = ((1 << q_exponent_bits) - 1) << q_mantisa_bits; constexpr uint16_t _sign_mask = 1U << (q_mantisa_bits + q_exponent_bits); ``` cc @Cursx Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add AutoEP
@codex