Conversation
There was a problem hiding this comment.
Pull request overview
Refactors AutoRound toward a new “context + compressor + algorithm” architecture, introducing new compressors_new/ and context/ modules and updating scheme parsing/export helpers to support the new flow.
Changes:
- Added new context singletons (
ModelContext,CompressContext) and a newcompressors_newimplementation path. - Expanded scheme parsing to reconcile
bits/data_typeand support user overrides + AutoScheme integration. - Added new calibration utilities and algorithm scaffolding for quantization backends (AutoRound/RTN).
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/utils/model.py | Avoids runtime import cycles via TYPE_CHECKING for QuantizationScheme. |
| auto_round/schemes.py | Adds scheme override + parsing helpers and bits/dtype reconciliation. |
| auto_round/formats.py | Switches divisibility checks to global supported-layer constants. |
| auto_round/context/model_context.py | Introduces model lifecycle/loading + AMP setup and forward-hook management. |
| auto_round/context/compress_context.py | Introduces device/device_map and memory-usage knobs as shared context. |
| auto_round/context/base.py | Adds simple singleton context base. |
| auto_round/context/init.py | Package init for new context module. |
| auto_round/compressors_new/utils.py | New utility module (layer config, gguf mapping, caching helpers, forward helpers). |
| auto_round/compressors_new/shard_writer.py | New shard-based saver with optional safetensors support. |
| auto_round/compressors_new/config.py | Introduces extra/legacy config dataclasses for the new compressor path. |
| auto_round/compressors_new/base.py | New “BaseCompressor” implementation wiring contexts, formats, caching, quant loop. |
| auto_round/compressors_new/init.py | Package init for compressors_new. |
| auto_round/compressors/utils.py | Extends legacy layer-config resolution to include safetensors-only tensors and skip missing modules. |
| auto_round/calibration/utils.py | Adds helpers for “early stop” caching and input reshaping for block tuning. |
| auto_round/calibration/init.py | Package init for calibration. |
| auto_round/algorithms/quantization/rtn/rtn.py | Adds placeholder RTN quantization module file. |
| auto_round/algorithms/quantization/rtn/config.py | Adds RTN algorithm config stub. |
| auto_round/algorithms/quantization/rtn/init.py | Package init for RTN quantization. |
| auto_round/algorithms/quantization/base.py | Adds base quantization class stub. |
| auto_round/algorithms/quantization/auto_round/quantize.py | Adds new AutoRound quantizer implementation (algorithm object). |
| auto_round/algorithms/quantization/auto_round/config.py | Adds new AutoRound algorithm config. |
| auto_round/algorithms/quantization/auto_round/init.py | Package init for AutoRound quantization algorithm. |
| auto_round/algorithms/quantization/init.py | Package init for quantization algorithms. |
| auto_round/algorithms/base.py | Adds base algorithm stub. |
| auto_round/algorithms/alg_config.py | Adds base algorithm config stub. |
| auto_round/algorithms/init.py | Package init for algorithms. |
|
If there is already an algorithm folder, what is the purpose of the compressor folder? |
…uo/new_ar_arch
…uo/new_ar_arch
…uo/new_ar_arch
Signed-off-by: n1ck-guo <heng.guo@intel.com>
for more information, see https://pre-commit.ci
| enable_norm_bias_tuning (bool): Whether to enable fast norm/layer_bias tuning | ||
| """ | ||
|
|
||
| _alg_cls = "SignRoundQuantizer" |
There was a problem hiding this comment.
Is there a better way to map these two? Would it be better to provide a clear function that developers are required to implement?
wenhuach21
left a comment
There was a problem hiding this comment.
Thank you very much for the great effort!
| dynamic_max_gap: int = -1, | ||
| enable_quanted_input: bool = True, | ||
| optimizer: str = None, | ||
| enable_adam: bool = False, |
There was a problem hiding this comment.
as adam is decoupled, could we remove this argument from the config
| # Subclasses that support diffusion models should override this with the | ||
| # appropriate output key mapping, e.g.: | ||
| # DIFFUSION_OUTPUT_CONFIGS = {"FluxTransformerBlock": ["encoder_hidden_states", "hidden_states"]} | ||
| DIFFUSION_OUTPUT_CONFIGS: dict = {} |
There was a problem hiding this comment.
this argument should be added to the AutoRound interface instead of this one
|
|
||
| @property | ||
| def amp_dtype(self): | ||
| import torch |
There was a problem hiding this comment.
amp is only for tuning algorithms, so it's better to refine it. No need to refine it in this pr
|
|
||
| return getattr(self.model_context, "amp_dtype", torch.float32) | ||
|
|
||
| def _register_act_max_hook(self, model): |
There was a problem hiding this comment.
we should provide an interface to support customized hooks and should not register act_max_hook by default, which is not required by most algortihm
|
|
||
| @torch.inference_mode() | ||
| def _quantize_embedding_layer(self): | ||
| """Quantizes embedding layers in the model according to the configuration. |
There was a problem hiding this comment.
To align the function with other funcitons, this one should be changed to _quantize_embedding_layer(self, layer), and this one should also be designed to be overridden by subclasses. If it's difficult, feel free to support it in the futhure
| output keys. Subclasses override ``DIFFUSION_OUTPUT_CONFIGS`` to add | ||
| support for new diffusion architectures. | ||
| """ | ||
| output = defaultdict(list) |
There was a problem hiding this comment.
I prefer to move this one to utils and decouple the quantizer from model types
|
This PR will not make any further feature changes. I will collect all relevant comments and then modify them in future PRs. |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…uo/new_ar_arch
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…ntext init - _hardware_setup: apply act-quantize/alg-ext guard before compile_func, matching _resolve_block_forward() and old-arch behavior. On HPU where enable_torch_compile stays True for FP8_STATIC, this avoids creating a compiled graph that wastes ~264 MB of HPU memory. - ModelContext.__init__: gc.collect + malloc_trim after model/tokenizer loading to reclaim C heap fragmentation (~96 MB). Signed-off-by: n1ck-guo <heng.guo@intel.com>
…init reorder - Add _force_trim_malloc() in device.py that unconditionally calls malloc_trim(0), bypassing the counter-based throttle in _maybe_trim_malloc() which was skipping critical lifecycle trim points - ClearMemory HPU path: replace _maybe_trim_malloc() with _force_trim_malloc() so heap pages are reclaimed before each MemoryMonitor RSS sample, preventing inflated peak_ram readings - ModelContext._load_model: add gc.collect + _force_trim_malloc before llm_load_model to reclaim temporary HTTP/config objects from is_mllm_model/is_diffusion_model/AutoConfig.from_pretrained calls - ModelContext.__init__: use _force_trim_malloc at end so the trim actually fires (previously _maybe_trim_malloc was a no-op at counter=1) - BaseCompressor.__init__: reorder context creation so ModelContext (large model allocation) is created before CompressContext (small), matching OLD arch allocation order to reduce heap fragmentation - BaseCompressor.post_init: add gc.collect + _force_trim_malloc after the five init phases to start quantize loop from tighter baseline - CalibCompressor.quantize: use _force_trim_malloc at loop start
xin3he
left a comment
There was a problem hiding this comment.
LGTM, please get the approval from Wenhua and Liang.
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…C, and dataloader cleanup - Defer ShardWriter creation from post_init to save_quantized (or _adjust_immediate_packing for immediate-save flows) to avoid heap fragmentation from parameter iteration during initialization - Add gc.collect + _force_trim_malloc between Phase 4 (layer config) and Phase 5 (hardware setup) to compact heap before compile setup - Release calibration dataloader after cache_inter_data completes to free tokenized sample tensors earlier
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 Is the API still like this. If so, please change. |
lkk12014402
left a comment
There was a problem hiding this comment.
LGTM, please fix the CI issues
|
Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run. |
Signed-off-by: n1ck-guo <heng.guo@intel.com>
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run. |
…uo/new_ar_arch
- Replace need_calibration with data_type parameter throughout transform pipeline - Add data_type-aware block_size defaults (mx_fp->32, nv_fp->16) - Disable triton kernel path for NV_FP data types - Expand ROTATION_SUPPORTED_SCHEMES to include MXFP8, MXFP4, NVFP4 - Simplify patch functions: delegate to original _qdq_weight/_qdq_act - Use QModuleBase instead of MXQuantLinearBase for target type detection - Add orig_dtype preservation in input transform hooks - Remove check_supported_schemes from compressor entry point - Remove precision param from weight transform build (keep for input transform)
|
Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run. |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Description
Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.
Usage of new api:
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting