NVIDIA-NeMo · HollowMan6 · Jun 29, 2026
@@ -1,6 +1,6 @@
 # Quantization-Aware RL (QARL)
 
-Quantization-Aware RL (QARL) integrates [NVIDIA Model Optimizer (ModelOpt)](https://github.com/NVIDIA/Model-Optimizer) into the NeMo RL training loop, enabling quantization-aware training and generation for both GRPO and on-policy distillation workflows. QARL automatically quantizes a standard model at initialization, maintains quantizer state (amax values) throughout training, and transfers quantized state to vLLM during weight refit. By default, vLLM generation uses fake-quantized modules. For NVFP4 W4A16 rollout experiments, NeMo RL can instead stream packed real-quant ModelOpt NVFP4 weights into vLLM.
+Quantization-Aware RL (QARL) integrates [NVIDIA Model Optimizer (ModelOpt)](https://github.com/NVIDIA/Model-Optimizer) into the NeMo RL training loop, enabling quantization-aware training and generation for both GRPO and on-policy distillation workflows. QARL automatically quantizes a standard model at initialization, maintains quantizer state (amax values) throughout training, and transfers quantized state to vLLM during weight refit. By default, vLLM generation uses fake-quantized modules. For NVFP4 W4A16 rollout experiments, NeMo RL can instead stream packed real-quant ModelOpt NVFP4 weights and scales into vLLM.
 
 ## Overview
 
@@ -9,7 +9,7 @@ In a standard NeMo RL loop, model weights are trained in full precision and refi
 There are two vLLM rollout modes:
 
 - **Fake-quant rollout**: vLLM receives folded full-precision weights and runs fake-quantized layers. This is the default when `policy.generation.quant_cfg` is set.
-- **Real-quant rollout**: vLLM is initialized with ModelOpt NVFP4 kernels and receives packed NVFP4 weights plus scale tensors during every refit. Enable this with `policy.generation.real_quant: true`.
+- **Real-quant rollout**: vLLM is initialized with ModelOpt NVFP4 kernels and receives packed NVFP4 weights plus scale tensors during every refit. Enable this with `policy.generation.real_quant: true`. W4A16 real-quant rollout supports dense ModelOpt NVFP4 layers and fused MoE weights exported through Megatron-Bridge.
 
 See [Verified Configurations](#verified-configurations) for the workflow + recipe combinations that have been empirically validated, and [Supported Quantization Formats](#supported-quantization-formats) for the full set of available formats. W4A4 (`NVFP4_DEFAULT_CFG`) converges for on-policy distillation but has been observed to have convergence issues on GRPO; W4A16 (NVFP4 weights, native-dtype activations) works for GRPO.
 
@@ -26,8 +26,9 @@ The following workflow + quantization recipe combinations have been validated en
 | QA-Distillation | W4A4 | `examples/modelopt/quant_configs/nano3_nvfp4_default.yaml` | ✅ Converges | `examples/modelopt/qa_distillation_nano3_megatron.yaml` |
 | QA-GRPO | W4A16 | `NVFP4_MLP_WEIGHT_ONLY_CFG` | ✅ Smoke tested on MoE | `examples/modelopt/qa_grpo_qwen3_30ba3b_megatron.yaml` |
 | QA-GRPO real quantization rollout | W4A16 | `examples/modelopt/quant_configs/nvfp4_a16_mlp_only.yaml` with `policy.generation.real_quant: true` | ✅ Converges | `examples/configs/recipes/llm/grpo-qwen2.5-0.5b-dapo-1n8g-megatron-qa-nvfp4-w4a16.yaml` |
+| QA-GRPO real quantization rollout | W4A16 | `examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml` with `policy.generation.real_quant: true` and `policy.generation.real_quant_ignore: NANO3_NVFP4_IGNORE` | ✅ Converges tested on hybrid MoE/Mamba | `examples/configs/recipes/llm/grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real.yaml` |
 
-The `nvfp4_a16.yaml` custom YAML enables NVFP4 e2m1 weight quantization (with dynamic e4m3 micro-block scales) and leaves activations unquantized; weights are still exercised through both Megatron training and vLLM generation. The `nvfp4_a16_mlp_only.yaml` recipe restricts W4A16 to MLP weights for real-quant rollout. The `nvfp4_w4a8_fp8.yaml` recipe uses the same NVFP4 weight format and enables FP8 e4m3 input activation fake quantization.
+The `nvfp4_a16.yaml` custom YAML enables NVFP4 e2m1 weight quantization (with dynamic e4m3 micro-block scales) and leaves activations unquantized; weights are still exercised through both Megatron training and vLLM generation. The `nvfp4_a16_mlp_only.yaml` recipe restricts W4A16 to MLP weights for real-quant rollout. The Nano3 `nano3_nvfp4_weightonly.yaml` recipe applies the same W4A16 weight-only format to the supported MLP/MoE weights while keeping Nano3-sensitive Mamba, attention, gate/router, shared-expert, norm, and selected layer paths in BF16 through the `NANO3_NVFP4_IGNORE` profile. The `nvfp4_w4a8_fp8.yaml` and `nano3_nvfp4_w4a8_fp8.yaml` recipes use the same NVFP4 weight format and enable FP8 e4m3 input activation fake quantization.
 
 ## ModelOpt Layer Spec Toggle
 
@@ -92,7 +93,7 @@ sbatch \
 
 Real-quant rollout is intended for checking the deployment-style vLLM path during RL, not only the fake-quant training path. With `policy.generation.real_quant: true`, the Megatron policy worker exports ModelOpt QAT weights as packed NVFP4 tensors during refit, and the vLLM worker loads them into ModelOpt NVFP4 layers. This exercises vLLM's real FP4 kernel path during rollout while the policy training worker remains a QAT model.
 
-This path is validated for W4A16.
+This path is validated for W4A16. Dense models can use the default real-quant ignore profile. Hybrid MoE/Mamba models such as Nano3 should use a model-specific ignore profile so unsupported or numerically sensitive paths stay in BF16.
 
 ### Minimal Configuration
 
@@ -108,12 +109,34 @@ policy:
     real_quant: true
 ```
 
+For Nano3 W4A16 real-quant rollout, use the Nano3 weight-only recipe and the named ignore profile:
+
+```yaml
+policy:
+  quant_cfg: examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml
+
+  generation:
+    backend: vllm
+    quant_cfg: examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml
+    real_quant: true
+    real_quant_ignore: NANO3_NVFP4_IGNORE
+    vllm_cfg:
+      gpu_memory_utilization: 0.35
+      enable_prefix_caching: false
+```
+
 The ready-to-run 1-node DAPO smoke recipe is:
 
 ```text
 examples/configs/recipes/llm/grpo-qwen2.5-0.5b-dapo-1n8g-megatron-qa-nvfp4-w4a16.yaml
 ```
 
+The ready-to-run Nano3 4-node x 4-GPU smoke recipe is:
+
+```text
+examples/configs/recipes/llm/grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real.yaml
+```
+
 Use the matching BF16 recipe as the baseline:
 
 ```text
@@ -130,6 +153,14 @@ uv run --extra mcore --extra modelopt --extra vllm \
   --config examples/configs/recipes/llm/grpo-qwen2.5-0.5b-dapo-1n8g-megatron-qa-nvfp4-w4a16.yaml
 ```
 
+For Nano3:
+
+```bash
+uv run --extra mcore --extra modelopt --extra vllm \
+  examples/run_grpo.py \
+  --config examples/configs/recipes/llm/grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real.yaml
+```
+
 For a BF16 comparison run:
 
 ```bash
@@ -164,7 +195,8 @@ For long runs on queues with short wall times, enable periodic checkpointing and
 A healthy W4A16 real-rollout run should include these lines or equivalent vLLM logs:
 
 ```text
-quantization=modelopt_fp4
+quantization=modelopt
+Detected ModelOpt NVFP4 checkpoint
 Using NvFp4LinearBackend.MARLIN for NVFP4 GEMM
 MegatronQuantPolicyWorker[rank=0]: Packed ... groups of tensors
 ```
@@ -183,10 +215,11 @@ For an initial sanity check, compare the first `Generation KL Error` with the BF
 
 | Symptom | Likely Cause | Action |
 |---|---|---|
-| vLLM does not log `quantization=modelopt_fp4` | `policy.generation.real_quant` is not set or generation is not using vLLM | Check the YAML under `policy.generation` |
+| vLLM does not log `quantization=modelopt` | `policy.generation.real_quant` is not set or generation is not using vLLM | Check the YAML under `policy.generation` |
 | `Using rollout logprobs` appears | The run is bypassing policy/reference logprob computation | Do not use rollout logprobs for real-quant validation |
 | First-step W4A16 `Generation KL Error` is much higher than BF16 | Stale converted Megatron checkpoint or refit/export mismatch | Clear checkpoints and rerun; confirm packed tensors are streamed |
 | `negative scales` warning appears | Invalid or stale NVFP4 scale tensors reached vLLM | Clear checkpoints and verify `nvfp4_a16_mlp_only.yaml` is used for both policy and generation |
+| Nano3 first-step KL is high while dense W4A16 is healthy | Nano3-sensitive paths were quantized or the vLLM ignore set does not match the policy recipe | Use `nano3_nvfp4_weightonly.yaml` for policy and generation, and set `policy.generation.real_quant_ignore: NANO3_NVFP4_IGNORE` |
 | CUDA invalid argument during refit or generation | vLLM consumed malformed packed tensors or stale IPC state | Restart from a fresh job and inspect the first real-quant refit logs |
 
 ## Fake-Quant NVFP4 Rollout (W4A8)
@@ -258,7 +291,7 @@ Generation-specific parameters are added under `policy.generation`:
 |---|---|
 | `quant_cfg` | Quantization config used by the vLLM generation worker. For QARL, this should normally match `policy.quant_cfg`. |
 | `real_quant` | When `true`, vLLM uses ModelOpt NVFP4 real kernels and receives packed quantized weights during refit. When unset or `false`, vLLM uses fake-quantized generation. |
-| `real_quant_ignore` | Optional list of vLLM parameter name patterns that should stay in native dtype during real-quant rollout. If omitted, NeMo RL uses the default ModelOpt NVFP4 ignore set for sensitive layers such as attention and output heads. |
+| `real_quant_ignore` | Optional list of vLLM parameter name patterns, or a named profile, that should stay in native dtype during real-quant rollout. If omitted, NeMo RL uses the default ModelOpt NVFP4 ignore set for sensitive layers such as attention and output heads. Use `NANO3_NVFP4_IGNORE` for Nano3 hybrid MoE/Mamba W4A16 real-quant rollout. |
 
 ## Megatron Checkpoint Directory
 
@@ -311,7 +344,7 @@ uv run --extra mcore --extra modelopt \
 
 - **Generation**: Currently only vLLM is supported for generation.
 - **DTensor backend**: Quantization support for the DTensor policy worker is not yet implemented.
-- **Real-quant rollout**: W4A16 real rollout is supported for dense vLLM ModelOpt NVFP4 layers.
+- **Real-quant rollout**: W4A16 real rollout is supported for dense and fused-MoE vLLM ModelOpt NVFP4 layers. Hybrid MoE/Mamba recipes should keep unsupported or sensitive non-MLP paths in BF16 via `real_quant_ignore`.
 - **W4A8 rollout**: W4A8 is supported through fake-quant rollout.
 - **Input quantization**: Only per-tensor input (activation) quantization is supported.
 - **Model support**: Dense Transformer, MoE (Mixture of Experts), and hybrid MoE/Mamba models are supported on the Megatron policy + vLLM generation path when Megatron-Bridge and ModelOpt support the model architecture and quantization recipe. MoE/Mamba support is currently covered by smoke-tested example configs rather than broad convergence guarantees.
@@ -0,0 +1,41 @@
+defaults: ../../../../examples/modelopt/qa_grpo_nano3_megatron.yaml
+grpo:
+  max_num_steps: 1
+  val_period: 0
+checkpointing:
+  checkpoint_dir: results/grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real
+policy:
+  model_name: /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_ci/artifacts/model/nvidia_nvidia-nemotron-3-nano-30b-a3b-bf16/hf/hf-cbd3fa9_orig
+  tokenizer:
+    name: ${policy.model_name}
+  quant_calib_size: 16
+  quant_sequence_length: 1024
+  generation:
+    real_quant: true
+    real_quant_ignore: NANO3_NVFP4_IGNORE
+    vllm_cfg:
+      gpu_memory_utilization: 0.35
+      enable_prefix_caching: false
+    vllm_kwargs:
+      tokenizer: ${policy.tokenizer.name}
+data:
+  max_input_seq_length: 1024
+  train:
+    dataset_name: DAPOMath17K
+  default:
+    prompt_file: null
+env:
+  dapo:
+    num_workers: 2
+  math:
+    num_workers: 2
+    math_verify_impl: dapo_math_verify
+logger:
+  log_dir: logs/grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real
+  wandb:
+    name: grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real
+  tensorboard:
+    log_dir: tb_logs-grpo-nanov3-30ba3b-4n4g-megatron-qa-nvfp4-w4a16-real
+cluster:
+  gpus_per_node: 4
+  num_nodes: 4
@@ -80,6 +80,8 @@ quantize:
       enable: false
     - quantizer_name: '*router*'
       enable: false
+    - quantizer_name: '*.gate.*'
+      enable: false
     - quantizer_name: '*mlp.gate.*'
       enable: false
     - quantizer_name: '*mlp.shared_expert_gate.*'
@@ -88,13 +90,29 @@ quantize:
       enable: false
     - quantizer_name: '*mixer.conv1d*'
       enable: false
+    - quantizer_name: '*.mixer.in_proj.*'
+      enable: false
+    - quantizer_name: '*.mixer.out_proj.*'
+      enable: false
+    - quantizer_name: '*.shared_expert.*'
+      enable: false
+    - quantizer_name: '*.shared_experts.*'
+      enable: false
+    - quantizer_name: '*.norm.*'
+      enable: false
     - quantizer_name: '*output_layer*'
       enable: false
     - quantizer_name: 'output.*'
       enable: false
     # Nano3-specific BF16 layers: attention projections and explicit
     # pre-attention layers 4/11/18/25/32/41.
-    - quantizer_name: '*.[q|k|v|o]_proj.*'
+    - quantizer_name: '*.q_proj.*'
+      enable: false
+    - quantizer_name: '*.k_proj.*'
+      enable: false
+    - quantizer_name: '*.v_proj.*'
+      enable: false
+    - quantizer_name: '*.o_proj.*'
       enable: false
     - quantizer_name: '*.qkv_proj.*'
       enable: false

@@ -0,0 +1,98 @@
+# Nano3 (NemotronH) NVFP4 W4A8 custom quantization recipe.
+#
+# Defines W4A8: NVFP4 (e2m1, dynamic e4m3 micro-block scales) on weights
+# and FP8 e4m3 on layer-input activations.
+#
+# This mirrors the Nano3 exclusions used by the W4A16/W4A4 recipes so Mamba
+# projections, attention projections, shared experts, gates/routers, norms, and
+# explicit BF16 layers stay out of the fake-quant path.
+metadata:
+  recipe_type: ptq
+  description: Nano3 NVFP4 weights with FP8 input activations for W4A8 QAT.
+quantize:
+  algorithm: max
+  quant_cfg:
+    - quantizer_name: '*'
+      enable: false
+    - quantizer_name: '*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    - quantizer_name: '*input_quantizer'
+      enable: true
+      cfg:
+        num_bits: e4m3
+    - parent_class: 'nn.BatchNorm1d'
+      quantizer_name: '*'
+      enable: false
+    - parent_class: 'nn.BatchNorm2d'
+      quantizer_name: '*'
+      enable: false
+    - parent_class: 'nn.BatchNorm3d'
+      quantizer_name: '*'
+      enable: false
+    - parent_class: 'nn.LeakyReLU'
+      quantizer_name: '*'
+      enable: false
+    - quantizer_name: '*lm_head*'
+      enable: false
+    - quantizer_name: '*proj_out.*'
+      enable: false
+    - quantizer_name: '*block_sparse_moe.gate*'
+      enable: false
+    - quantizer_name: '*router*'
+      enable: false
+    - quantizer_name: '*.gate.*'
+      enable: false
+    - quantizer_name: '*mlp.gate.*'
+      enable: false
+    - quantizer_name: '*mlp.shared_expert_gate.*'
+      enable: false
+    - quantizer_name: '*linear_attn.conv1d*'
+      enable: false
+    - quantizer_name: '*mixer.conv1d*'
+      enable: false
+    - quantizer_name: '*.mixer.in_proj.*'
+      enable: false
+    - quantizer_name: '*.mixer.out_proj.*'
+      enable: false
+    - quantizer_name: '*.shared_expert.*'
+      enable: false
+    - quantizer_name: '*.shared_experts.*'
+      enable: false
+    - quantizer_name: '*.norm.*'
+      enable: false
+    - quantizer_name: '*output_layer*'
+      enable: false
+    - quantizer_name: 'output.*'
+      enable: false
+    - quantizer_name: '*.q_proj.*'
+      enable: false
+    - quantizer_name: '*.k_proj.*'
+      enable: false
+    - quantizer_name: '*.v_proj.*'
+      enable: false
+    - quantizer_name: '*.o_proj.*'
+      enable: false
+    - quantizer_name: '*.qkv_proj.*'
+      enable: false
+    - quantizer_name: '*.linear_proj.*'
+      enable: false
+    - quantizer_name: '*.linear_qkv.*'
+      enable: false
+    - quantizer_name: '*.layers.4.*'
+      enable: false
+    - quantizer_name: '*.layers.11.*'
+      enable: false
+    - quantizer_name: '*.layers.18.*'
+      enable: false
+    - quantizer_name: '*.layers.25.*'
+      enable: false
+    - quantizer_name: '*.layers.32.*'
+      enable: false
+    - quantizer_name: '*.layers.41.*'
+      enable: false
@@ -71,6 +71,8 @@ quantize:
       enable: false
     - quantizer_name: '*router*'
       enable: false
+    - quantizer_name: '*.gate.*'
+      enable: false
     - quantizer_name: '*mlp.gate.*'
       enable: false
     - quantizer_name: '*mlp.shared_expert_gate.*'
@@ -79,13 +81,29 @@ quantize:
       enable: false
     - quantizer_name: '*mixer.conv1d*'
       enable: false
+    - quantizer_name: '*.mixer.in_proj.*'
+      enable: false
+    - quantizer_name: '*.mixer.out_proj.*'
+      enable: false
+    - quantizer_name: '*.shared_expert.*'
+      enable: false
+    - quantizer_name: '*.shared_experts.*'
+      enable: false
+    - quantizer_name: '*.norm.*'
+      enable: false
     - quantizer_name: '*output_layer*'
       enable: false
     - quantizer_name: 'output.*'
       enable: false
     # Nano3-specific BF16 layers: attention projections and explicit
     # pre-attention layers 4/11/18/25/32/41.
-    - quantizer_name: '*.[q|k|v|o]_proj.*'
+    - quantizer_name: '*.q_proj.*'
+      enable: false
+    - quantizer_name: '*.k_proj.*'
+      enable: false
+    - quantizer_name: '*.v_proj.*'
+      enable: false
+    - quantizer_name: '*.o_proj.*'
       enable: false
     - quantizer_name: '*.qkv_proj.*'
       enable: false