new architecture for auto_round by n1ck-guo · Pull Request #1542 · intel/auto-round

n1ck-guo · 2026-03-13T02:08:50Z

Description

Compressor:
Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.
Calibration: Handles the calibration process (Work in Progress)
Context: Manages shared configurations and model states throughout the quantization pipeline, providing centralized control to prevent cross-module dependencies
- ModelContext: Handles model loading and tracks model states and relevant configurations
- CompressContext: Stores shared compression settings such as low_cpu_mem_usage, enable_torch_compile, etc.
Algorithms: Concrete quantization and weight transformation implementations
- Quantization: Various quantization algorithms, including AutoRound, RTN, OptRTN, etc.
- Transform: Weight transformation algorithms such as Hadamard transform

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 
from auto_round.compressor_new import AutoRound

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = AutoRound(
    alg_configs=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: n1ck-guo <heng.guo@intel.com>

Copilot

Pull request overview

Refactors AutoRound toward a new “context + compressor + algorithm” architecture, introducing new compressors_new/ and context/ modules and updating scheme parsing/export helpers to support the new flow.

Changes:

Added new context singletons (ModelContext, CompressContext) and a new compressors_new implementation path.
Expanded scheme parsing to reconcile bits/data_type and support user overrides + AutoScheme integration.
Added new calibration utilities and algorithm scaffolding for quantization backends (AutoRound/RTN).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
auto_round/utils/model.py	Avoids runtime import cycles via `TYPE_CHECKING` for `QuantizationScheme`.
auto_round/schemes.py	Adds scheme override + parsing helpers and bits/dtype reconciliation.
auto_round/formats.py	Switches divisibility checks to global supported-layer constants.
auto_round/context/model_context.py	Introduces model lifecycle/loading + AMP setup and forward-hook management.
auto_round/context/compress_context.py	Introduces device/device_map and memory-usage knobs as shared context.
auto_round/context/base.py	Adds simple singleton context base.
auto_round/context/init.py	Package init for new `context` module.
auto_round/compressors_new/utils.py	New utility module (layer config, gguf mapping, caching helpers, forward helpers).
auto_round/compressors_new/shard_writer.py	New shard-based saver with optional safetensors support.
auto_round/compressors_new/config.py	Introduces extra/legacy config dataclasses for the new compressor path.
auto_round/compressors_new/base.py	New “BaseCompressor” implementation wiring contexts, formats, caching, quant loop.
auto_round/compressors_new/init.py	Package init for `compressors_new`.
auto_round/compressors/utils.py	Extends legacy layer-config resolution to include safetensors-only tensors and skip missing modules.
auto_round/calibration/utils.py	Adds helpers for “early stop” caching and input reshaping for block tuning.
auto_round/calibration/init.py	Package init for `calibration`.
auto_round/algorithms/quantization/rtn/rtn.py	Adds placeholder RTN quantization module file.
auto_round/algorithms/quantization/rtn/config.py	Adds RTN algorithm config stub.
auto_round/algorithms/quantization/rtn/init.py	Package init for RTN quantization.
auto_round/algorithms/quantization/base.py	Adds base quantization class stub.
auto_round/algorithms/quantization/auto_round/quantize.py	Adds new AutoRound quantizer implementation (algorithm object).
auto_round/algorithms/quantization/auto_round/config.py	Adds new AutoRound algorithm config.
auto_round/algorithms/quantization/auto_round/init.py	Package init for AutoRound quantization algorithm.
auto_round/algorithms/quantization/init.py	Package init for quantization algorithms.
auto_round/algorithms/base.py	Adds base algorithm stub.
auto_round/algorithms/alg_config.py	Adds base algorithm config stub.
auto_round/algorithms/init.py	Package init for algorithms.

wenhuach21 · 2026-03-13T02:16:59Z

If there is already an algorithm folder, what is the purpose of the compressor folder?

…uo/new_ar_arch

Signed-off-by: n1ck-guo <heng.guo@intel.com>

…uo/new_ar_arch

Signed-off-by: n1ck-guo <heng.guo@intel.com>

for more information, see https://pre-commit.ci

wenhuach21 · 2026-04-10T08:55:13Z

+        enable_norm_bias_tuning (bool): Whether to enable fast norm/layer_bias tuning
+    """
+
+    _alg_cls = "SignRoundQuantizer"


Is there a better way to map these two? Would it be better to provide a clear function that developers are required to implement?

wenhuach21

Thank you very much for the great effort!

wenhuach21 · 2026-04-10T08:56:03Z

+        dynamic_max_gap: int = -1,
+        enable_quanted_input: bool = True,
+        optimizer: str = None,
+        enable_adam: bool = False,


as adam is decoupled, could we remove this argument from the config

wenhuach21 · 2026-04-10T08:58:27Z

+    # Subclasses that support diffusion models should override this with the
+    # appropriate output key mapping, e.g.:
+    #   DIFFUSION_OUTPUT_CONFIGS = {"FluxTransformerBlock": ["encoder_hidden_states", "hidden_states"]}
+    DIFFUSION_OUTPUT_CONFIGS: dict = {}


this argument should be added to the AutoRound interface instead of this one

wenhuach21 · 2026-04-10T09:00:19Z

+
+    @property
+    def amp_dtype(self):
+        import torch


amp is only for tuning algorithms, so it's better to refine it. No need to refine it in this pr

wenhuach21 · 2026-04-10T09:04:45Z

+
+        return getattr(self.model_context, "amp_dtype", torch.float32)
+
+    def _register_act_max_hook(self, model):


we should provide an interface to support customized hooks and should not register act_max_hook by default, which is not required by most algortihm

wenhuach21 · 2026-04-10T09:11:52Z

+
+    @torch.inference_mode()
+    def _quantize_embedding_layer(self):
+        """Quantizes embedding layers in the model according to the configuration.


To align the function with other funcitons, this one should be changed to _quantize_embedding_layer(self, layer), and this one should also be designed to be overridden by subclasses. If it's difficult, feel free to support it in the futhure

wenhuach21 · 2026-04-10T09:14:29Z

+        output keys.  Subclasses override ``DIFFUSION_OUTPUT_CONFIGS`` to add
+        support for new diffusion architectures.
+        """
+        output = defaultdict(list)


I prefer to move this one to utils and decouple the quantizer from model types

n1ck-guo · 2026-04-11T05:18:45Z

This PR will not make any further feature changes. I will collect all relevant comments and then modify them in future PRs.

n1ck-guo · 2026-04-11T09:52:12Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-11T09:52:23Z

Azure Pipelines successfully started running 1 pipeline(s).

…uo/new_ar_arch

Signed-off-by: n1ck-guo <heng.guo@intel.com>

…ntext init - _hardware_setup: apply act-quantize/alg-ext guard before compile_func, matching _resolve_block_forward() and old-arch behavior. On HPU where enable_torch_compile stays True for FP8_STATIC, this avoids creating a compiled graph that wastes ~264 MB of HPU memory. - ModelContext.__init__: gc.collect + malloc_trim after model/tokenizer loading to reclaim C heap fragmentation (~96 MB). Signed-off-by: n1ck-guo <heng.guo@intel.com>

…init reorder - Add _force_trim_malloc() in device.py that unconditionally calls malloc_trim(0), bypassing the counter-based throttle in _maybe_trim_malloc() which was skipping critical lifecycle trim points - ClearMemory HPU path: replace _maybe_trim_malloc() with _force_trim_malloc() so heap pages are reclaimed before each MemoryMonitor RSS sample, preventing inflated peak_ram readings - ModelContext._load_model: add gc.collect + _force_trim_malloc before llm_load_model to reclaim temporary HTTP/config objects from is_mllm_model/is_diffusion_model/AutoConfig.from_pretrained calls - ModelContext.__init__: use _force_trim_malloc at end so the trim actually fires (previously _maybe_trim_malloc was a no-op at counter=1) - BaseCompressor.__init__: reorder context creation so ModelContext (large model allocation) is created before CompressContext (small), matching OLD arch allocation order to reduce heap fragmentation - BaseCompressor.post_init: add gc.collect + _force_trim_malloc after the five init phases to start quantize loop from tighter baseline - CalibCompressor.quantize: use _force_trim_malloc at loop start

xin3he

LGTM, please get the approval from Wenhua and Liang.

Signed-off-by: n1ck-guo <heng.guo@intel.com>

…C, and dataloader cleanup - Defer ShardWriter creation from post_init to save_quantized (or _adjust_immediate_packing for immediate-save flows) to avoid heap fragmentation from parameter iteration during initialization - Add gc.collect + _force_trim_malloc between Phase 4 (layer config) and Phase 5 (hardware setup) to compact heap before compile setup - Release calibration dataloader after cache_inter_data completes to free tokenized sample tensors earlier

n1ck-guo · 2026-04-14T05:04:31Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-14T05:04:41Z

Azure Pipelines successfully started running 1 pipeline(s).

wenhuach21 · 2026-04-14T07:03:55Z

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = Compressor(
    config=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

1 Is the API still like this. If so, please change.
2 ask xuehao for help to run the release test for this API

lkk12014402

LGTM, please fix the CI issues

Signed-off-by: n1ck-guo <heng.guo@intel.com>

azure-pipelines · 2026-04-15T04:49:51Z

Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run.

Signed-off-by: n1ck-guo <heng.guo@intel.com>

n1ck-guo · 2026-04-15T08:16:11Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-15T08:16:23Z

Azure Pipelines: Successfully started running 1 pipeline(s).

Signed-off-by: n1ck-guo <heng.guo@intel.com>

azure-pipelines · 2026-04-16T06:08:25Z

Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run.

…uo/new_ar_arch

- Replace need_calibration with data_type parameter throughout transform pipeline - Add data_type-aware block_size defaults (mx_fp->32, nv_fp->16) - Disable triton kernel path for NV_FP data types - Expand ROTATION_SUPPORTED_SCHEMES to include MXFP8, MXFP4, NVFP4 - Simplify patch functions: delegate to original _qdq_weight/_qdq_act - Use QModuleBase instead of MXQuantLinearBase for target type detection - Add orig_dtype preservation in input transform hooks - Remove check_supported_schemes from compressor entry point - Remove precision param from weight transform build (keep for input transform)

azure-pipelines · 2026-04-16T06:52:43Z

Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run.

n1ck-guo · 2026-04-16T07:42:47Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-16T07:42:57Z

Azure Pipelines: Successfully started running 1 pipeline(s).

init

7698b93

Signed-off-by: n1ck-guo <heng.guo@intel.com>

n1ck-guo requested review from Copilot, lkk12014402, lvliang-intel, wenhuach21 and xin3he March 13, 2026 02:08

n1ck-guo added the draft label Mar 13, 2026

Copilot started reviewing on behalf of n1ck-guo March 13, 2026 02:09 View session

n1ck-guo added the engineering label Mar 13, 2026

Copilot AI reviewed Mar 13, 2026

View reviewed changes

wenhuach21 reviewed Mar 13, 2026

View reviewed changes

Comment thread auto_round/compressors_new/base.py Outdated

n1ck-guo requested review from WeiweiZhang1 and yiliu30 and removed request for xin3he March 13, 2026 05:31

n1ck-guo added 3 commits March 13, 2026 14:00

Merge branch 'main' of https://github.com/intel/auto-round into hengg…

75b4141

…uo/new_ar_arch

update

ca17097

Signed-off-by: n1ck-guo <heng.guo@intel.com>

Merge branch 'main' of https://github.com/intel/auto-round into hengg…

a092e37

…uo/new_ar_arch

lvliang-intel reviewed Mar 16, 2026

View reviewed changes

Comment thread auto_round/compressors_new/base.py Outdated

lvliang-intel reviewed Mar 16, 2026

View reviewed changes

Comment thread auto_round/algorithms/quantization/auto_round/quantize.py Outdated

lvliang-intel reviewed Mar 16, 2026

View reviewed changes

Comment thread auto_round/algorithms/alg_config.py

lvliang-intel reviewed Mar 16, 2026

View reviewed changes

Comment thread auto_round/compressors_new/config.py

lvliang-intel reviewed Mar 16, 2026

View reviewed changes

Comment thread auto_round/algorithms/quantization/auto_round/quantize.py Outdated

Merge branch 'main' of https://github.com/intel/auto-round into hengg…

cec4ce4

…uo/new_ar_arch

chensuyue added this to the 0.12.0 milestone Mar 16, 2026

This was referenced Mar 17, 2026

decouple quanitzers #787

Open

Refactor collection for v0.13.0 release #1134

Open

n1ck-guo and others added 3 commits March 17, 2026 17:02

update

e265b8f

Signed-off-by: n1ck-guo <heng.guo@intel.com>

merge main

868a82d

Signed-off-by: n1ck-guo <heng.guo@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9dc930c

for more information, see https://pre-commit.ci

wenhuach21 reviewed Apr 10, 2026

View reviewed changes

n1ck-guo added 6 commits April 11, 2026 18:06

Merge branch 'main' of https://github.com/intel/auto-round into hengg…

550158b

…uo/new_ar_arch

performance

4806d5a

Signed-off-by: n1ck-guo <heng.guo@intel.com>

fix

1f1fbd9

Signed-off-by: n1ck-guo <heng.guo@intel.com>

update

e4fdfe6

Signed-off-by: n1ck-guo <heng.guo@intel.com>

xin3he reviewed Apr 13, 2026

View reviewed changes

n1ck-guo added 2 commits April 14, 2026 10:56

merge main

5d4a85d

Signed-off-by: n1ck-guo <heng.guo@intel.com>

lkk12014402 approved these changes Apr 14, 2026

View reviewed changes

fix

29969c8

Signed-off-by: n1ck-guo <heng.guo@intel.com>

merge main

654c733

Signed-off-by: n1ck-guo <heng.guo@intel.com>

update

c7f21a7

Signed-off-by: n1ck-guo <heng.guo@intel.com>

n1ck-guo added 2 commits April 16, 2026 14:28

Merge branch 'main' of https://github.com/intel/auto-round into hengg…

72c04f9

…uo/new_ar_arch


		return getattr(self.model_context, "amp_dtype", torch.float32)

		def _register_act_max_hook(self, model):

Conversation

n1ck-guo commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 left a comment

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

n1ck-guo commented Apr 11, 2026

Uh oh!

n1ck-guo commented Apr 11, 2026

Uh oh!

azure-pipelines bot commented Apr 11, 2026

Uh oh!

xin3he left a comment

Choose a reason for hiding this comment

Uh oh!

n1ck-guo commented Apr 14, 2026

Uh oh!

azure-pipelines bot commented Apr 14, 2026

Uh oh!

wenhuach21 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkk12014402 left a comment

Choose a reason for hiding this comment

Uh oh!

azure-pipelines bot commented Apr 15, 2026

Uh oh!

n1ck-guo commented Apr 15, 2026

n1ck-guo commented Mar 13, 2026 •

edited

Loading

wenhuach21 Apr 10, 2026 •

edited

Loading

wenhuach21 Apr 10, 2026 •

edited

Loading

wenhuach21 commented Apr 14, 2026 •

edited

Loading