Skip to content

Add AutoEP#5

Open
tohtana wants to merge 84 commits intotohtana/add_autoep_reviewfrom
tohtana/add_autoep
Open

Add AutoEP#5
tohtana wants to merge 84 commits intotohtana/add_autoep_reviewfrom
tohtana/add_autoep

Conversation

@tohtana
Copy link
Copy Markdown
Owner

@tohtana tohtana commented Feb 8, 2026

Add AutoEP
@codex

sfc-gh-truwase and others added 6 commits February 3, 2026 22:26
Correctly handle `ds_grad_is_ready` in ZeRO2

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The current code has the following issues:
- `use_default_specs: false` doesn't work
- Injection by the traditional pattern runs even when custom patterns
are set
- `mpu` needs to be passed to `deepspeed.initialize` (HF integration
doesn't pass mpu)

This PR fixes AutoTP setup to respect `use_default_specs: false` and
disable the traditional injection path when custom patterns are enabled.
Also, when `mpu` is not passed, we create a TP group in the
initialization process.


With these changes, the [related
tests](https://github.com/deepspeedai/DeepSpeed/tree/master/tests/unit/model_parallelism)
pass and [all AutoTP
examples](https://github.com/tohtana/DeepSpeedExamples/tree/tohtana/custom_auto_tp/training/tensor_parallel)
in DeepSpeedExamples work now
([PR](deepspeedai/DeepSpeedExamples#998)).

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Copy Markdown
Owner Author

tohtana commented Feb 8, 2026

@codex

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@tohtana tohtana changed the base branch from master to tohtana/add_autoep_review February 8, 2026 07:27
@tohtana
Copy link
Copy Markdown
Owner Author

tohtana commented Feb 9, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd07c93a5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread deepspeed/checkpoint/autoep_universal.py Outdated
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Copy Markdown
Owner Author

tohtana commented Feb 9, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cabfebcdca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +527 to +529
if self.return_router_logits:
logits = self._cached_router_logits
self._cached_router_logits = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate router logits when returning tuple output

When _detect_forward_contract sets return_router_logits=True for legacy MoE blocks (router_logits_capture_target == "moe_block"), _register_logit_hook is not installed and _cached_router_logits is never set. The forward path then returns (output, None) here, which breaks callers that expect actual router logits (e.g., OutputRecorder/z-loss paths that rely on the second return value). This only shows up for models using the MoE-block tuple contract, but in that case the logits are silently missing.

Useful? React with 👍 / 👎.

sdvillal and others added 16 commits February 9, 2026 16:22
Current metaclasses for layers and parameters access annotations in a
way that is incompatible with python 3.14+

See:
- [Python 3.14 release
notes](https://docs.python.org/3/whatsnew/3.14.html)
- [Porting
annotations](https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-porting-annotations)
- [PEP649](https://peps.python.org/pep-0649/) and
[PEP749](https://peps.python.org/pep-0749/)

This PR uses annotationlib from python 3.14 onwards and keeps backwards
compatibility.

closes deepspeedai#7673
should unblock CF builds for py3.14
conda-forge/deepspeed-feedstock#114

A question is, does deepspeed support officially 3.14 yet? Should we
test it in CIs?

---------

Signed-off-by: Santi Villalba <sdvillal@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
## Bug
`fractions.gcd` was deprecated in Python 3.5 and removed in Python 3.9.
This causes an `AttributeError` on Python 3.9+.

## Fix
Replaced `fractions.gcd` with `math.gcd` which is the standard
replacement.
Fixes: deepspeedai#7837

ZeRO-0 + bf16 has two bugs in `engine.py`: 
1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with
`cur_scale=65536` but `engine.backward()` never scales the loss, so
`step()` divides gradients by 65536
2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing
gradient accumulation.

Fix: disable loss scaling for bf16 and remove the `zero_optimization()`
gate on `zero_grad`.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…peedai#7840)

Fixes deepspeedai#7835.

On torch==2.10.0, importing DeepSpeed emitted deprecation warnings from
import-time JIT-decorated helpers.
This change updates the compatibility path to align with PyTorch
guidance while keeping import clean.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
This PR addresses deepspeedai#7677 by flattening parameter tensors on the
accelerators instead of the CPU during zero stage 1 and 2
initialization. This should alleviate CPU contention, with the caveat
that the optimization is only used when there is enough VRAM to allocate
a full copy of the parameter buffers.

On 8 x H100s and a Intel Xeon Platinum 8480+, profiling the
initialization of DeepSpeed on 32 layers of `Qwen3-30B` with Z2 gives
the following:

Old = ~382s
New = ~130s

-------------------------

If necessary, this optimization can be extended to allowed a tiered
system that trades off VRAM space with performance, which might look
like the following:

```
if enough VRAM for 2x model_size:
    naive flatten
else if enough VRAM for model_size / N:
    distributed flatten across N devices
else:
    flatten on CPU
```

The distributed flatten would involve each device flattening a portion
of the parameters and performing an all-gather to assemble the full
flattened model. See deepspeedai#7677 for original discussion.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Kento Sugama <kentosugama@protonmail.ch>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: nathon <leejianwoo@gmail.com>
Co-authored-by: Vensen <vensenmu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: jp <jsb10121249@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR enables shared memory communication in single node for arm hosts
- deepspeedai#7625

<img width="908" height="108" alt="image"
src="https://github.com/user-attachments/assets/a5d1a5c7-f28e-4129-9503-cc2b477993ac"
/>

---------

Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Added a new news entry about DeepSpeed ZeRO++ support for LLM
distillation work at LinkedIn.
## Summary
Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed
Inference V2.

Closes deepspeedai#7453

## Changes
- New model implementation:
`deepspeed/inference/v2/model_implementations/exaone4/`
  - `container.py`: Transformer and non-transformer parameter containers
- `model.py`: Inference model with post-norm architecture and QK-Norm
support
  - `policy.py`: Inference V2 policy
- Register EXAONE 4.0 in `engine_factory.py` and `__init__.py`

## Key architectural differences from Mistral/Llama
- **Post-norm**: RMSNorm is applied after attention/MLP outputs (not
before), followed by residual addition
- **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the
QKV linear layer
- **Hybrid attention**: 32B model uses 3:1 sliding window/full attention
ratio (via `layer_types` config)

## Supported models
- [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B)
(all full attention)
- [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
(hybrid sliding/full attention)

Requires `transformers >= 4.54.0`.

## Related
- Supersedes deepspeedai#7456 (draft, inactive for 6 months)

---------

Signed-off-by: Bias92 <pewpewplay315@gmail.com>
deepspeedai#7846)

Fixes deepspeedai#7843

On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in
the code are not provided, e.g.:
- `__ll2bfloat16_rn`
- `__int2bfloat16_rn`
- `__short2bfloat16_rn`
- `__bfloat162uint_rn`

This causes compilation errors on HIP platforms.

This PR introduces fallback paths using functions available on HIP
platform mirroring the [conversion util in
csrc](https://github.com/deepspeedai/DeepSpeed/blob/2c362837b0ef906ea7e7506bab3a625faa945cdd/csrc/includes/conversion_utils.h#L351).
The converion paths are:

- int/uint -> bf16: convert to float (or double for 64-bit), then to
bf16.
- bf16 -> int/uint: convert bf16 to float, then to the integer type.
- float -> bf16: build from bf16 via supported HIP helpers.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`EvoformerAttnBuilder` returns instances of `Path` from `include_paths`
which then cause failures in `OpBuilder.builder` when passing them to
`strip_empty_entries` that calls `len` on them which isn't defined for
`Path` instances:
>   TypeError: object of type 'PosixPath' has no len()

Fixes regression introduced in deepspeedai#7760

cc @sdvillal

Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
…edai#7832)

deepspeedai#7817 added a test to verify that we throw an error when parameters are
modified in `GatheredParameters` and `modifier_rank` is None. However,
the PR just checks devices and doesn't detect modifications on
parameters.
This causes an
[error](https://github.com/deepspeedai/DeepSpeed/actions/runs/21653729382/job/62424014222)
in our full test run.

This PR adds the detection of parameter modifications to properly throw
an error.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
PR deepspeedai#7839 introduced a regression by changing `TestZeroStaticScale` from
`assert optim.dynamic_loss_scale == False` to `assert
optim.loss_scale_config.dynamic_loss_scale == False`.
`loss_scale_config` is not part of the ZeRO optimizer (only non-ZeRO
optimizer have it), while this test runs with ZeRO optimizers.

With this fix, `TestZeroStaticScale` now passes for stages 1/2/3.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The full test workflow passed though it is still flakey
([Success](https://github.com/deepspeedai/DeepSpeed/actions/runs/22269243373)
/
[Failure](https://github.com/deepspeedai/DeepSpeed/actions/runs/22266498530))

This PR schedules a nightly run of the full test. It is launched only
when we have update since the last successful run.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…pspeedai#7874)

Fix links and manu items for AutoTP doc

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana and others added 30 commits March 30, 2026 08:16
The recent attempts of the night full test [kept
failing](https://github.com/deepspeedai/DeepSpeed/actions/workflows/aws-torch-latest-full.yml).
We added a fallback to an A100 node on the infra side.
This PR detects the CUDA architecture and number of GPUs and sets them
to env vars.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…eepspeedai#7932)

## Summary
- Remove the `# Copyright (c) Microsoft Corporation.` line from the
license header template in both AGENTS.md and CLAUDE.md
- The project license header should only contain the SPDX identifier and
team attribution

## Test plan
- [x] Verify AGENTS.md and CLAUDE.md no longer reference Microsoft
Corporation copyright

Signed-off-by: Zhipeng Wang <zwanga@wustl.edu>
Co-authored-by: Zhipeng Wang <zwanga@wustl.edu>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
refactor(module_inject): consolidate duplicate transpose functions

- Extract the duplicated `transpose` function into
`deepspeed/module_inject/utils.py`.
- Remove redundant `transpose` definitions from `policy.py` and
`load_checkpoint.py`.
- This resolves an existing `TODO (lekurile)` to consolidate the
function across containers.

---------

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The full CI test
[fails](https://github.com/deepspeedai/DeepSpeed/actions/runs/23735417401/job/69138729446)
throwing "RuntimeError: Cannot re-initialize CUDA" because of tests for
universal checkpoint and AutoTP.

It happens because they run `torch.cuda.current_device()` under `pytest
--forked`. As the tests only touch universal checkpoint metadata, we
won't need to call it. This PR skips constructor-time AutoTP
materialization when `mp_group` is `None`.
Partitioning still happens in the real AutoTP usage where an actual
model-parallel group is given.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Removing the file used as the file-store while the process-group is
still active is invalid as it is still in use.
If `reuse_dist_env` is `True` the process group is still active and the
processes will try reading from that file waiting for it to exists. In
the shutdown (`destroy_process_group`) they will wait for all threads to
join but (at least) one is still waiting for that file. This will cause
the process to hang until a PyTorch-internal timeout is reached, which
currently is ~ 5minutes

Solution is to create a unique file. I chose to put it in in `tmpdir`
and add a suffix to differentiate it.

Note that `tmpdir` is not enough as this method is called through the
fixture setup already once so that is not clean when called later in the
test execution

CC @mrwyattii , author of deepspeedai#3850 adding this code

---------

Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
refactor(zero3): factor out defragment method to zero utils

Resolves a TODO in the codebase by extracting the `defragment` logic out
of `DeepSpeedZeroOptimizer_Stage3` and moving it to
`deepspeed/runtime/zero/utils.py`
as an independent utility function. This decouples the memory
defragmentation logic
from the core optimizer class and improves code maintainability and
reusability.

---------

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
this way one can register kernels based flash-attn as well with SP

---------

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
deepspeedai#7948)

…U accelerator

The on-device flatten path (introduced in deepspeedai#7828) passes nn.Parameter
objects with requires_grad=True to torch.cat(), creating a flat buffer
with CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces
SplitBackward0 views that are assigned to model params. Inplace copy_()
on these views during optimizer step raises:
RuntimeError: Output 0 of SplitBackward0 is a view and is being modified
inplace.

This especially affects CPU training where
CPU_Accelerator.is_available() returns True and available_memory()
returns system RAM, so the on-device path is always taken.

Fix: add .detach() to the flattened buffer, matching the implicit detach
behavior of the CPU-offload path (param.data.cpu() + .to(device)).

Also rename flatten_on_gpu -> flatten_on_accelerator and replace
GPU-specific terminology in comments/logs with accelerator-generic
equivalents.

---------

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There are 2 files with the same file stem resulting in conflicting Ninja
rules as there will be 2 rules for `fp_quantize.o`.

Simply rename the .cu file.

Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
(cherry picked from commit cc45af3)
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR fixes a bug introduced in
[deepspeedai#6550](deepspeedai#6550), which was
also pointed out in [this
comment](deepspeedai#6550 (comment)).

The issue is that gradients are only copied to CPU when micro_step_id=0.
For micro_step_id > 0, the gradients were effectively dropped instead of
being accumulated, which leads to an artificially smaller gradient norm.

With this fix, gradients are copied and accumulated on every microstep,
matching the expected behavior and restoring the correct gradient norm.

The plot below shows the impact clearly: the previous implementation
significantly underestimates the gradient norm compared to the fixed
version.

<img width="808" height="584" alt="grad_norms"
src="https://github.com/user-attachments/assets/6a0d968c-88cc-4b69-b990-3e2aa1c892b0"
/>

Setup: SFT run using OpenRLHF with DeepSpeed.

- OpenRLHF CPU-offloaded buggy baseline: gradients dropped for microstep
> 0
- OpenRLHF CPU-offloaded fixed version: correct accumulation across all
microsteps
- OpenRLHF GPU, non-offloaded version: reference correct behavior
- Verl (FSDP optimizer): additional reference baseline using PyTorch
FSDP

The fixed version matches non-offloaded DeepSpeed and FSDP, confirming
correct gradient accumulation.

Effect on loss:
<img width="2943" height="1742" alt="loss_cpu_optimizer_comparison"
src="https://github.com/user-attachments/assets/edf1dfd7-9b5f-46fe-b174-fcc57b36225c"
/>

---------

Signed-off-by: Alexis Limozin <alexis@limozin.net>
This PR fixes a ZeRO 1/2 overlap-comm correctness issue. 

When comparing loss values, we found that only ZeRO2 shows nan as a
loss.

- zero1: 11.201002 -> 11.165665 -> 11.213738 -> 11.121310
- zero2: 11.201002 -> 11.165665 -> nan
- zero3: 11.201002 -> 11.165665 -> 11.204460 -> 11.121443

Here is what we found:
In `allreduce_and_copy_with_multiple_ranks()` and
`allreduce_and_copy()`, the reduction result and copied destination
buffers were used on the reduction stream without recording that stream
on the underlying storage, allowing the caching allocator to recycle
that storage before the queued comm/copy work had completed.
This could impact also ZeRO1 though we only encountered the issue with
ZeRO2.

This PR adds `record_stream` to ensure the buffer is not freed until the
queued work is done.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
DeepCompile+Z3 didn't work with PyTorch v2.9/2.10 because:
- PyTorch v2.9+ started enforcing stricter TorchDynamo parameter
tensor-match guards. During DeepCompile tracing, some ZeRO-3 parameters
were temporarily all-gathered, so Dynamo recorded full sizes such as
4096
- By the time guard evaluation ran, DeepSpeed had already released those
params back to the normal ZeRO-3 partitioned representation, where
`param.data` is `empty(0)`. That produced guard failures like `expected
4096, actual 0`.

This PR resolves the issue by:
- Leep full-shape dummy tensors for symbolic tracing
- Override guard size/stride metadata for ZeRO-3 params to the stable
released representation instead of transient gathered sizes

This PR also includes fixes of these bugs:
- For v2.7 and v2.8, the compiled backward graph could hoist
`end_backward` ahead of the real `reduce_grad` calls. - Selective
unsharding pass can overcount the persistence memory budget.

Note: DeepCompile is still incompatible with v2.11. It will be addressed
by another PR.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`WarmupCosineLR` returned a singleton pre-start LR list even when the
optimizer had multiple parameter groups. Because scheduler
initialization applies LRs with `zip(param_groups, lrs)`, only group 0
was updated and later groups kept their base LR before the first
optimizer step.

The fix changes the pre-start scheduler outputs to match the multi-group
contract by returning scalar `0.0` from `get_lr_ratio()` and a
zero-filled LR list sized to `self.org_lrs`.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR enables the full test workflow to choose PyTorch version. It
resolves each the version into matching `torch` / `torchvision` /
`torchaudio` install versions.
It also keeps nightly change detection schedule-only, so manual runs do
not affect the daily baseline.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
…impl.cu (deepspeedai#7973)

## Summary

- Fix `warning deepspeedai#68-D: integer conversion resulted in a change of sign`
by using unsigned literal `1U` for all bit-shift expressions storing to
unsigned types (`csrc/fp_quantizer/fp_quantize_impl.cu`)
- Fix `warning deepspeedai#62-D: shift count is negative` by removing unused
`mantisa_mask` dead code in `apply_dequantization` and
`apply_selective_dequantization`

The `_sign_mask` computation `1 << (_mantisa_bits + _exponent_bits)` in
`apply_quantization` shifts a signed `int` literal by 31 bits
(`_mantisa_bits=23, _exponent_bits=8`), which is undefined behavior in
C++. Using `1U` makes the shift well-defined. For consistency and
defensive programming, the same `1` → `1U` change is applied to all
similar patterns in `round()`, `apply_dequantization`, and
`apply_selective_dequantization`.

The `mantisa_mask` variable in both dequantization functions was
copy-pasted from the quantization function but **never used** in the
dequantization code paths. Its initialization `mantisa_mask <<=
(_mantisa_bits - q_mantisa_bits)` always produces a negative shift count
because in these functions `_mantisa_bits` (quantized format, small:
1-7) < `q_mantisa_bits` (output format, large: 7 or 10).

> **Note:** The issue suggested the template argument order in
`launch_dequantization` / `launch_selective_dequantization` might be
wrong, but analysis of the function body confirms the original order is
correct — `quantized_bits = _mantisa_bits + _exponent_bits + 1`
correctly computes the quantized format total bits, and `dst_mantisa <<
(q_mantisa_bits - _mantisa_bits)` correctly left-shifts the quantized
mantissa into the output format's mantissa field. The warnings came
solely from the unused dead code.

Fixes deepspeedai#7971

## Before / After

<details>
<summary>Before (18+ warnings)</summary>

```
$ nvcc -c fp_quantize_impl.cu -DBF16_AVAILABLE --expt-relaxed-constexpr \
       -gencode arch=compute_86,code=sm_86 -std=c++17

fp_quantize_impl.cu(82):  warning deepspeedai#68-D: integer conversion resulted in a change of sign
fp_quantize_impl.cu(244): warning deepspeedai#62-D: shift count is negative   (x9, one per template instantiation)
fp_quantize_impl.cu(426): warning deepspeedai#62-D: shift count is negative   (x9, one per template instantiation)
```

The `mantisa_mask` variable causing the shift warnings is declared but
never used in either dequantization function.

</details>

<details>
<summary>After (0 warnings from this file)</summary>

```
$ nvcc -c fp_quantize_impl.cu -DBF16_AVAILABLE --expt-relaxed-constexpr \
       -gencode arch=compute_86,code=sm_86 -std=c++17 2>&1 | grep -E 'deepspeedai#62-D|deepspeedai#68-D'

(no output — all warnings eliminated)
```

Compilation succeeds with exit code 0. Only unrelated `deepspeedai#821-D` warnings
remain from `memory_access_utils.h`.

</details>

## Changes

### `csrc/fp_quantizer/fp_quantize_impl.cu`

1. **Line 38, 40, 42** (`round()`) — `1` → `1U`: Consistent unsigned
shifts in `mantisa_mask`, `offset`, and exponent overflow check. Not UB
today (`1 << 23` fits in `int`), but prevents future issues and silences
potential sign-conversion warnings.

2. **Line 82** (`apply_quantization`) — `1` → `1U`: Fix actual UB — `1
<< 31` on signed `int` is undefined behavior.

3. **Line 237** (`apply_dequantization`) — `1` → `1U` in `_sign_mask`:
Consistent with `apply_quantization`. Not UB with current template args
(`1 << 7`), but defensive.

4. **Line 416** (`apply_selective_dequantization`) — `1` → `1U` in
`_sign_mask`: Same as above.

5. **Lines 243-244** — Remove unused `mantisa_mask` in
`apply_dequantization`: Copy-pasted from the quantization function but
never referenced in the dequantization code path.

6. **Lines 425-426** — Remove unused `mantisa_mask` in
`apply_selective_dequantization`: Same dead code as above.

## Test plan

- [x] `nvcc` compilation with `-DBF16_AVAILABLE` — 0 `deepspeedai#62-D` / `deepspeedai#68-D`
warnings (was 18+)
- [x] Verified `mantisa_mask` (no underscore prefix) is unused in both
dequantization functions by grepping all occurrences — only used in
`apply_quantization` (line 142) and `round` (lines 38-42)
- [x] Verified template parameter order in `launch_dequantization` and
`launch_selective_dequantization` is correct by tracing all usages of
`_mantisa_bits`, `_exponent_bits`, `q_mantisa_bits` in function bodies
- [x] All `1` → `1U` changes are semantically identical for non-negative
shift counts; the only behavioral fix is line 82 where `1 << 31` was UB

Signed-off-by: Cursx <674760201@qq.com>
## Summary

- Fix duplicate/wrong `-gencode=` flags in both JIT and non-JIT
compilation paths (`op_builder/builder.py`)
- Fix `TORCH_CUDA_ARCH_LIST` env-var restore logic in
`OpBuilder.jit_load()`

DeepSpeed's `compute_capability_args()` generates its own `-gencode`
flags, but PyTorch (`load()` in JIT mode, `BuildExtension` in non-JIT
mode) *also* reads `TORCH_CUDA_ARCH_LIST` and generates `-gencode`
flags. This causes two problems:

1. **JIT mode**: `jit_load()` set `TORCH_CUDA_ARCH_LIST=""`, which
PyTorch treats as *unset* and falls back to auto-detection — resulting
in every flag appearing **twice**.
2. **Non-JIT mode**: subclasses that override `filter_ccs()` (e.g.
`FPQuantizerBuilder`, `EvoformerAttnBuilder`) remove certain archs, but
`BuildExtension` re-reads the **unfiltered** `TORCH_CUDA_ARCH_LIST` and
adds them back — **undermining the filter**.

The fix synchronises `TORCH_CUDA_ARCH_LIST` with the filtered arch list
in `compute_capability_args()`, for both JIT and non-JIT paths.

Fixes deepspeedai#7972

## Before / After

<details>
<summary>Before (buggy behavior)</summary>

**JIT mode** — `TORCH_CUDA_ARCH_LIST` cleared to `""`, PyTorch
auto-detects and adds flags, DeepSpeed also adds the same flags:

```
nvcc ... -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80
     ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80
```

Plus a spurious warning:
```
UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
```

**Non-JIT mode** — `FPQuantizerBuilder.filter_ccs()` removes `< 8.0`,
but `BuildExtension` re-adds them from the unfiltered env var:

```
# FPQuantizer compiled for sm_70 even though filter_ccs() removed it
nvcc ... -gencode=arch=compute_80,code=sm_80   # from DeepSpeed (correct)
     ... -gencode=arch=compute_70,code=sm_70   # from BuildExtension (wrong!)
```

</details>

<details>
<summary>After (fixed behavior)</summary>

**JIT mode** — `TORCH_CUDA_ARCH_LIST` is set to the detected
architectures, PyTorch generates flags once, no duplicates:

```
nvcc ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80
```

No spurious warning. Env var is properly restored/removed after build.

**Non-JIT mode** — `TORCH_CUDA_ARCH_LIST` is updated to the filtered
list. Each extension keeps its own `-gencode` flags, and
`BuildExtension` reads the filtered env var:

```
# FPQuantizer: only sm_80+ as intended
nvcc ... -gencode=arch=compute_80,code=sm_80   # from DeepSpeed
     ... -gencode=arch=compute_80,code=sm_80   # from BuildExtension (harmless dup)
```

> **Note:** in multi-builder `setup.py` builds, the last builder's
filtered arch list wins for `TORCH_CUDA_ARCH_LIST`. This may cause
harmless duplicates for some extensions, but will never reintroduce
archs that any builder's `filter_ccs()` removed — a strict improvement
over the current behavior where the unfiltered original is always used.

</details>

## Changes

- `op_builder/builder.py`
  - `CUDAOpBuilder.compute_capability_args()`:
    - Always sync `TORCH_CUDA_ARCH_LIST` with the filtered arch list
    - JIT mode: return `[]` (PyTorch generates flags via `load()`)
- Non-JIT mode: return `-gencode` args as before (per-builder flags in
`extra_compile_args`)
- `OpBuilder.jit_load()`: simplified stash/restore — properly `del` the
env var if it was not originally set

---------

Signed-off-by: Cursx <674760201@qq.com>
Followup to deepspeedai#7973 for deepspeedai#7971

The naming of q_mantisa_bits and mantisa_bits was swapped. The
invocation set:

```
q_mantisa_bits = mantisa
_mantisa_bits = CONST_Q_MANTISA_BITS
_exponent_bits = CONST_Q_EXPONENT_BITS
```

so correct them by swapping the names back.

I noticed that the code needs a thorough review because multiple places
look suspicious:
```
	// Why the default args? They seem to not even be matching (16 != 3+4+1)
          int total_q_bits = 16,
          int q_mantisa_bits = 3,
          int q_exponent_bits = 4>

	// Why recompute if there is a total_q_bits template?
    constexpr int quantized_bits = q_mantisa_bits + q_exponent_bits + 1;
    // Likely wrong: total_q_bits < mantisa_bits --> negative bits? Likely caused by wrong naming
    constexpr int q_exponent_bits = total_q_bits - mantisa_bits - 1;

    // should likey use a `q_` prefix not `_`
    constexpr uint16_t _mantisa_mask = (1 << q_mantisa_bits) - 1;
    constexpr uint16_t _exponent_mask = ((1 << q_exponent_bits) - 1) << q_mantisa_bits;
    constexpr uint16_t _sign_mask = 1U << (q_mantisa_bits + q_exponent_bits);
```

cc @Cursx

Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.