Skip to content

new architecture for auto_round#1542

Open
n1ck-guo wants to merge 82 commits intomainfrom
hengguo/new_ar_arch
Open

new architecture for auto_round#1542
n1ck-guo wants to merge 82 commits intomainfrom
hengguo/new_ar_arch

Conversation

@n1ck-guo
Copy link
Copy Markdown
Contributor

@n1ck-guo n1ck-guo commented Mar 13, 2026

Description

  • Compressor:
    Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.
  • Calibration: Handles the calibration process (Work in Progress)
  • Context: Manages shared configurations and model states throughout the quantization pipeline, providing centralized control to prevent cross-module dependencies
    • ModelContext: Handles model loading and tracks model states and relevant configurations
    • CompressContext: Stores shared compression settings such as low_cpu_mem_usage, enable_torch_compile, etc.
  • Algorithms: Concrete quantization and weight transformation implementations
    • Quantization: Various quantization algorithms, including AutoRound, RTN, OptRTN, etc.
    • Transform: Weight transformation algorithms such as Hadamard transform

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 
from auto_round.compressor_new import AutoRound

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = AutoRound(
    alg_configs=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: n1ck-guo <heng.guo@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors AutoRound toward a new “context + compressor + algorithm” architecture, introducing new compressors_new/ and context/ modules and updating scheme parsing/export helpers to support the new flow.

Changes:

  • Added new context singletons (ModelContext, CompressContext) and a new compressors_new implementation path.
  • Expanded scheme parsing to reconcile bits/data_type and support user overrides + AutoScheme integration.
  • Added new calibration utilities and algorithm scaffolding for quantization backends (AutoRound/RTN).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
auto_round/utils/model.py Avoids runtime import cycles via TYPE_CHECKING for QuantizationScheme.
auto_round/schemes.py Adds scheme override + parsing helpers and bits/dtype reconciliation.
auto_round/formats.py Switches divisibility checks to global supported-layer constants.
auto_round/context/model_context.py Introduces model lifecycle/loading + AMP setup and forward-hook management.
auto_round/context/compress_context.py Introduces device/device_map and memory-usage knobs as shared context.
auto_round/context/base.py Adds simple singleton context base.
auto_round/context/init.py Package init for new context module.
auto_round/compressors_new/utils.py New utility module (layer config, gguf mapping, caching helpers, forward helpers).
auto_round/compressors_new/shard_writer.py New shard-based saver with optional safetensors support.
auto_round/compressors_new/config.py Introduces extra/legacy config dataclasses for the new compressor path.
auto_round/compressors_new/base.py New “BaseCompressor” implementation wiring contexts, formats, caching, quant loop.
auto_round/compressors_new/init.py Package init for compressors_new.
auto_round/compressors/utils.py Extends legacy layer-config resolution to include safetensors-only tensors and skip missing modules.
auto_round/calibration/utils.py Adds helpers for “early stop” caching and input reshaping for block tuning.
auto_round/calibration/init.py Package init for calibration.
auto_round/algorithms/quantization/rtn/rtn.py Adds placeholder RTN quantization module file.
auto_round/algorithms/quantization/rtn/config.py Adds RTN algorithm config stub.
auto_round/algorithms/quantization/rtn/init.py Package init for RTN quantization.
auto_round/algorithms/quantization/base.py Adds base quantization class stub.
auto_round/algorithms/quantization/auto_round/quantize.py Adds new AutoRound quantizer implementation (algorithm object).
auto_round/algorithms/quantization/auto_round/config.py Adds new AutoRound algorithm config.
auto_round/algorithms/quantization/auto_round/init.py Package init for AutoRound quantization algorithm.
auto_round/algorithms/quantization/init.py Package init for quantization algorithms.
auto_round/algorithms/base.py Adds base algorithm stub.
auto_round/algorithms/alg_config.py Adds base algorithm config stub.
auto_round/algorithms/init.py Package init for algorithms.

Comment thread auto_round/compressors_new/utils.py
Comment thread auto_round/compressors_new/base.py Outdated
Comment thread auto_round/compressors_new/shard_writer.py
Comment thread auto_round/algorithms/quantization/base.py Outdated
Comment thread auto_round/context/model_context.py Outdated
Comment thread auto_round/algorithms/quantization/auto_round/quantize.py Outdated
Comment thread auto_round/algorithms/quantization/auto_round/config.py Outdated
Comment thread auto_round/context/model.py
Comment thread auto_round/schemes.py
Comment thread auto_round/algorithms/quantization/auto_round/quantize.py Outdated
@wenhuach21
Copy link
Copy Markdown
Contributor

If there is already an algorithm folder, what is the purpose of the compressor folder?

Comment thread auto_round/compressors_new/base.py Outdated
@n1ck-guo n1ck-guo requested review from WeiweiZhang1 and yiliu30 and removed request for xin3he March 13, 2026 05:31
Comment thread auto_round/compressors_new/base.py Outdated
Comment thread auto_round/algorithms/quantization/auto_round/quantize.py Outdated
Comment thread auto_round/algorithms/alg_config.py
Comment thread auto_round/compressors_new/config.py
Comment thread auto_round/algorithms/quantization/auto_round/quantize.py Outdated
@chensuyue chensuyue added this to the 0.12.0 milestone Mar 16, 2026
n1ck-guo and others added 3 commits March 17, 2026 17:02
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
enable_norm_bias_tuning (bool): Whether to enable fast norm/layer_bias tuning
"""

_alg_cls = "SignRoundQuantizer"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to map these two? Would it be better to provide a clear function that developers are required to implement?

Copy link
Copy Markdown
Contributor

@wenhuach21 wenhuach21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the great effort!

dynamic_max_gap: int = -1,
enable_quanted_input: bool = True,
optimizer: str = None,
enable_adam: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as adam is decoupled, could we remove this argument from the config

# Subclasses that support diffusion models should override this with the
# appropriate output key mapping, e.g.:
# DIFFUSION_OUTPUT_CONFIGS = {"FluxTransformerBlock": ["encoder_hidden_states", "hidden_states"]}
DIFFUSION_OUTPUT_CONFIGS: dict = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this argument should be added to the AutoRound interface instead of this one


@property
def amp_dtype(self):
import torch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amp is only for tuning algorithms, so it's better to refine it. No need to refine it in this pr


return getattr(self.model_context, "amp_dtype", torch.float32)

def _register_act_max_hook(self, model):
Copy link
Copy Markdown
Contributor

@wenhuach21 wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should provide an interface to support customized hooks and should not register act_max_hook by default, which is not required by most algortihm


@torch.inference_mode()
def _quantize_embedding_layer(self):
"""Quantizes embedding layers in the model according to the configuration.
Copy link
Copy Markdown
Contributor

@wenhuach21 wenhuach21 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align the function with other funcitons, this one should be changed to _quantize_embedding_layer(self, layer), and this one should also be designed to be overridden by subclasses. If it's difficult, feel free to support it in the futhure

output keys. Subclasses override ``DIFFUSION_OUTPUT_CONFIGS`` to add
support for new diffusion architectures.
"""
output = defaultdict(list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to move this one to utils and decouple the quantizer from model types

Comment thread auto_round/algorithms/quantization/config.py Outdated
@n1ck-guo
Copy link
Copy Markdown
Contributor Author

This PR will not make any further feature changes. I will collect all relevant comments and then modify them in future PRs.

@n1ck-guo
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…ntext init

- _hardware_setup: apply act-quantize/alg-ext guard before compile_func,
  matching _resolve_block_forward() and old-arch behavior.  On HPU where
  enable_torch_compile stays True for FP8_STATIC, this avoids creating
  a compiled graph that wastes ~264 MB of HPU memory.
- ModelContext.__init__: gc.collect + malloc_trim after model/tokenizer
  loading to reclaim C heap fragmentation (~96 MB).

Signed-off-by: n1ck-guo <heng.guo@intel.com>
…init reorder

- Add _force_trim_malloc() in device.py that unconditionally calls
  malloc_trim(0), bypassing the counter-based throttle in
  _maybe_trim_malloc() which was skipping critical lifecycle trim points

- ClearMemory HPU path: replace _maybe_trim_malloc() with
  _force_trim_malloc() so heap pages are reclaimed before each
  MemoryMonitor RSS sample, preventing inflated peak_ram readings

- ModelContext._load_model: add gc.collect + _force_trim_malloc before
  llm_load_model to reclaim temporary HTTP/config objects from
  is_mllm_model/is_diffusion_model/AutoConfig.from_pretrained calls

- ModelContext.__init__: use _force_trim_malloc at end so the trim
  actually fires (previously _maybe_trim_malloc was a no-op at counter=1)

- BaseCompressor.__init__: reorder context creation so ModelContext
  (large model allocation) is created before CompressContext (small),
  matching OLD arch allocation order to reduce heap fragmentation

- BaseCompressor.post_init: add gc.collect + _force_trim_malloc after
  the five init phases to start quantize loop from tighter baseline

- CalibCompressor.quantize: use _force_trim_malloc at loop start
Copy link
Copy Markdown
Contributor

@xin3he xin3he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please get the approval from Wenhua and Liang.

Signed-off-by: n1ck-guo <heng.guo@intel.com>
…C, and dataloader cleanup

- Defer ShardWriter creation from post_init to save_quantized (or
  _adjust_immediate_packing for immediate-save flows) to avoid heap
  fragmentation from parameter iteration during initialization
- Add gc.collect + _force_trim_malloc between Phase 4 (layer config)
  and Phase 5 (hardware setup) to compact heap before compile setup
- Release calibration dataloader after cache_inter_data completes to
  free tokenized sample tensors earlier
@n1ck-guo
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wenhuach21
Copy link
Copy Markdown
Contributor

wenhuach21 commented Apr 14, 2026

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = Compressor(
    config=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

1 Is the API still like this. If so, please change.
2 ask xuehao for help to run the release test for this API

Copy link
Copy Markdown
Contributor

@lkk12014402 lkk12014402 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix the CI issues

Signed-off-by: n1ck-guo <heng.guo@intel.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 6 pipeline(s).
1 pipeline(s) require an authorized user to comment /azp run to run.

Signed-off-by: n1ck-guo <heng.guo@intel.com>
@n1ck-guo
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Signed-off-by: n1ck-guo <heng.guo@intel.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 6 pipeline(s).
1 pipeline(s) require an authorized user to comment /azp run to run.

- Replace need_calibration with data_type parameter throughout transform pipeline
- Add data_type-aware block_size defaults (mx_fp->32, nv_fp->16)
- Disable triton kernel path for NV_FP data types
- Expand ROTATION_SUPPORTED_SCHEMES to include MXFP8, MXFP4, NVFP4
- Simplify patch functions: delegate to original _qdq_weight/_qdq_act
- Use QModuleBase instead of MXQuantLinearBase for target type detection
- Add orig_dtype preservation in input transform hooks
- Remove check_supported_schemes from compressor entry point
- Remove precision param from weight transform build (keep for input transform)
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 6 pipeline(s).
1 pipeline(s) require an authorized user to comment /azp run to run.

@n1ck-guo
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api/new engineering ready only add when the PR is ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants