From fa846e1d5744613e2e2f269556097e49894b0a58 Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Tue, 17 Mar 2026 20:18:08 +0800
Subject: [PATCH 1/5] Enhance API documentation for batch compile and encrypt
 compile functions by adding missing parameters and deprecating
 `hardware_cut_opt`. Update examples to reflect the latest API usage.

---
 docs/toolchain/manual_5_nef.md | 41 ++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/docs/toolchain/manual_5_nef.md b/docs/toolchain/manual_5_nef.md
index 1f9c165..37a1fda 100644
--- a/docs/toolchain/manual_5_nef.md
+++ b/docs/toolchain/manual_5_nef.md
@@ -8,7 +8,16 @@ Batch compile turns multiple models into a single binary file. We have two APIs
 
 ```python
 #[API]
-ktc.compile(model_list, output_dir="/data1/kneron_flow", dedicated_output_buffer=True, weight_compress=False)
+ktc.compile(
+    model_list,
+    output_dir="/data1/kneron_flow",
+    dedicated_output_buffer=True,
+    weight_compress=False,
+    flatbuffer=True,
+    compiler_tiling="default",
+    weight_bandwidth=None,
+    dma_bandwidth=None,
+)
 ```
 
 Compile the models and generate the nef file. The nef path will be returned.
@@ -19,12 +28,31 @@ Args:
 * output_dir (str, optional): output directory. Defaults to "/data1/kneron_flow".
 * dedicated_output_buffer (bool, optional): dedicated output buffer. Defaults to True.
 * weight_compress (bool, optional): compress weight to slightly reduce the binary file size. Defaults to False.
-* hardware_cut_opt (bool, optional): optimize the hardware memory usage while processing large inputs. This option might cause the compiling time increase. Currently, only available for 720. Defaults to False.
+* hardware_cut_opt (bool, optional): DEPRECATED. Use `compiler_tiling="deep_search"` instead. If True and `compiler_tiling` is `"default"`, `compiler_tiling` will be treated as `"deep_search"`. Defaults to False.
 * flatbuffer (bool, optional): enable new flatbuffer mode for 720. Defaults to True.
+* compiler_tiling (str, optional): choose from `"default"`, `"deep_search"`, or `"partial_graph_search"`. Ignored when a model provides its own compiler config json. KDP520 always uses `"default"`. Defaults to `"default"`.
+* weight_bandwidth: weight bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
+* dma_bandwidth: dma bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
 
 ```python
 #[API]
-ktc.encrypt_compile(model_list, output_dir="/data1/kneron_flow", dedicated_output_buffer=True, mode=None, key="", key_file="", encryption_efuse_key="", weight_compress=False)
+ktc.encrypt_compile(
+    model_list,
+    output_dir="/data1/kneron_flow",
+    dedicated_output_buffer=True,
+    mode=None,
+    key="",
+    key_file="",
+    encryption_efuse_key="",
+    weight_compress=False,
+    hardware_cut_opt=False,
+    flatbuffer=True,
+    debug=False,
+    compiler_tiling="default",
+    weight_bandwidth=None,
+    dma_bandwidth=None,
+    dma_bandwidthh=None,
+)
 ```
 
 Compile the models, generate an encrypted nef file. The nef path will be returned.
@@ -39,8 +67,13 @@ Args:
 * key_file (str, optional): key file path. Required in mode 1. Defaults to "".
 * encryption_efuse_key (str, optional): a hex code. Required in mode 2 and optional in mode 1. Defaults to "".
 * weight_compress (bool, optional): compress weight to slightly reduce the binary file size. Defaults to False.
-* hardware_cut_opt (bool, optional): optimize the hardware memory usage while processing large inputs. This option might cause the compiling time increase. Currently, only available for 720. Defaults to False.
+* hardware_cut_opt (bool, optional): DEPRECATED. Use `compiler_tiling="deep_search"` instead. If True and `compiler_tiling` is `"default"`, `compiler_tiling` will be treated as `"deep_search"`. Defaults to False.
 * flatbuffer (bool, optional): enable new flatbuffer mode for 720. Defaults to True.
+* compiler_tiling (str, optional): choose from `"default"`, `"deep_search"`, or `"partial_graph_search"`. Ignored when a model provides its own compiler config json. KDP520 always uses `"default"`. Defaults to `"default"`.
+* weight_bandwidth: weight bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
+* dma_bandwidth: dma bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
+
+If you previously used `hardware_cut_opt=True`, use `compiler_tiling="deep_search"` instead.
 
 We would start with single model first.
 

From 50af92dd69f25494a1925afa54ceba2d3f7770ef Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Thu, 19 Mar 2026 16:06:45 +0800
Subject: [PATCH 2/5] Add update log for toolchain v0.32.1 release.

---
 docs/toolchain/appendix/app_flow_manual.md | 2 +-
 docs/toolchain/appendix/history.md         | 6 ++++++
 docs/toolchain/manual_1_overview.md        | 8 +++++++-
 docs/toolchain/manual_3_onnx.md            | 3 ++-
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/docs/toolchain/appendix/app_flow_manual.md b/docs/toolchain/appendix/app_flow_manual.md
index 1085c86..b12914c 100644
--- a/docs/toolchain/appendix/app_flow_manual.md
+++ b/docs/toolchain/appendix/app_flow_manual.md
@@ -1,4 +1,4 @@
-# Kneron End to End Simulator v0.32.0
+# Kneron End to End Simulator v0.32.1
 
 This project allows users to perform image inference using Kneron's built in simulator. We encourage users to use simply use the kneron_inference function to perform the tests on your inputs.
 
diff --git a/docs/toolchain/appendix/history.md b/docs/toolchain/appendix/history.md
index 50fc85c..7e971b9 100644
--- a/docs/toolchain/appendix/history.md
+++ b/docs/toolchain/appendix/history.md
@@ -24,6 +24,12 @@
 
 ## Toolchain Change log
 
+* **[v0.32.1]**
+    * Add `dma_bandwidth` and `weight_bandwidth` to IP evaluator arguments.
+    * Change `hardware_cut_opt` to `compiler_tiling` to keep consistent with other toolchain apis.
+    * Update evaluator to raise warning when meeting unsupported operator instead of error.
+    * Update ktc to clean up more intermediate files generated during the flow.
+    * Fix the evaluator bug using wrong 730 frequency.
 * **[v0.32.0]**
     * Add Einsum defusion in kneronnxopt.
     * Support Cast to int64 in knerex and compiler.
diff --git a/docs/toolchain/manual_1_overview.md b/docs/toolchain/manual_1_overview.md
index 14ed2fd..8c62678 100644
--- a/docs/toolchain/manual_1_overview.md
+++ b/docs/toolchain/manual_1_overview.md
@@ -5,7 +5,7 @@
 # 1. Toolchain Overview
 
 **2026-03**
-**Toolchain v0.32.0**
+**Toolchain v0.32.1**
 
 ## 1.1. Introduction
 
@@ -19,6 +19,12 @@ In this document, you'll learn:
 3. How to utilize the tools through Python API.
 
 **Major changes of the current version**
+* **[v0.32.1]**
+    * Add `dma_bandwidth` and `weight_bandwidth` to IP evaluator arguments.
+    * Change `hardware_cut_opt` to `compiler_tiling` to keep consistent with other toolchain apis.
+    * Update evaluator to raise warning when meeting unsupported operator instead of error.
+    * Update ktc to clean up more intermediate files generated during the flow.
+    * Fix the evaluator bug using wrong 730 frequency.
 * **[v0.32.0]**
     * Add Einsum defusion in kneronnxopt.
     * Support Cast to int64 in knerex and compiler.
diff --git a/docs/toolchain/manual_3_onnx.md b/docs/toolchain/manual_3_onnx.md
index 64e8402..4b25949 100644
--- a/docs/toolchain/manual_3_onnx.md
+++ b/docs/toolchain/manual_3_onnx.md
@@ -300,7 +300,7 @@ You can use `-o` or `--optimizer-only` to only run the optimization step without
 You can use `-h` or `--help` to see all the options.
 
 ```
-usage: python -m ktc.opt_and_eval [-h] [-e] [-E EVALUATOR_REPORT_PATH] [-o] [-O OPTIMIZED_PATH] [--deep-search] {520,720,530,630,730} path
+usage: python -m ktc.opt_and_eval [-h] [-P] [-e] [-E EVALUATOR_REPORT_PATH] [-o] [-O OPTIMIZED_PATH] [--deep-search] {520,720,530,630,730} path
 
 Optimize ONNX model and run IP Evaluator
 
@@ -318,4 +318,5 @@ optional arguments:
   -O OPTIMIZED_PATH, --optimized-path OPTIMIZED_PATH
                         Path to save the optimized ONNX model.
   --deep-search         Use deep search for optimization, which may take longer but can yield better performance.
+  -P, --print          Print the evaluation result in the terminal.
 ```

From 92f0ede4087d444801e30fd38b9c05196217cec5 Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Thu, 19 Mar 2026 16:17:26 +0800
Subject: [PATCH 3/5] Fix typos and indent.

---
 docs/toolchain/manual_3_onnx.md | 2 +-
 docs/toolchain/manual_5_nef.md  | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/toolchain/manual_3_onnx.md b/docs/toolchain/manual_3_onnx.md
index 4b25949..e048195 100644
--- a/docs/toolchain/manual_3_onnx.md
+++ b/docs/toolchain/manual_3_onnx.md
@@ -318,5 +318,5 @@ optional arguments:
   -O OPTIMIZED_PATH, --optimized-path OPTIMIZED_PATH
                         Path to save the optimized ONNX model.
   --deep-search         Use deep search for optimization, which may take longer but can yield better performance.
-  -P, --print          Print the evaluation result in the terminal.
+  -P, --print           Print the evaluation result in the terminal.
 ```
diff --git a/docs/toolchain/manual_5_nef.md b/docs/toolchain/manual_5_nef.md
index 37a1fda..b756e22 100644
--- a/docs/toolchain/manual_5_nef.md
+++ b/docs/toolchain/manual_5_nef.md
@@ -51,7 +51,6 @@ ktc.encrypt_compile(
     compiler_tiling="default",
     weight_bandwidth=None,
     dma_bandwidth=None,
-    dma_bandwidthh=None,
 )
 ```
 

From 7bb1a0207944fcabec325538960d708544ffb4db Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Thu, 19 Mar 2026 16:33:56 +0800
Subject: [PATCH 4/5] Update kneronnxopt document.

---
 docs/toolchain/appendix/kneronnxopt.md | 56 +++++++++++++++++++++-----
 docs/toolchain/manual_3_onnx.md        | 16 ++++++--
 2 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/docs/toolchain/appendix/kneronnxopt.md b/docs/toolchain/appendix/kneronnxopt.md
index 47d40bf..b3fff2e 100644
--- a/docs/toolchain/appendix/kneronnxopt.md
+++ b/docs/toolchain/appendix/kneronnxopt.md
@@ -1,6 +1,6 @@
 # Kneronnxopt
 
-Kneronnxopt is the ONNX optimizer project for kneron hardware platforms. Its purpose is to provide shapes for all the tensors as well as accelerate the inference and compiling process. Currently, we support ONNX up to opset 18.
+Kneronnxopt is the ONNX optimizer project for Kneron hardware platforms. It prepares tensor shapes and optimizes graph structures to improve inference and compilation flow. Currently, it supports ONNX opset 8 to 18.
 
 ## 1. Preparation
 
@@ -12,24 +12,60 @@ conda activate onnx1.13
 
 ## 2. Usage
 
-The tool is under `/workspace/libs/kneronnxopt`. You can use the following command to run the tool:
+### 2.1. Standard model optimization
+
+Use module execution for standard ONNX models:
 
 ```bash
-python /workspace/libs/kneronnxopt/kneronnxopt/optimize.py -o <output_onnx_model> <input_onnx_model>
+python -m kneronnxopt.optimize <input_onnx_model> -o <output_onnx_model>
 ```
 
-It also has the following optional arguments:
+Optional arguments:
 
 * `-h, --help`: Show this help message and exit.
 * `--log`: Set log level (default: INFO). Available log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.
-* `--duplicate-shared-weight`: Duplicate shared weights in the model. Default is False.
-* `--skip-check`: Skip the onnxruntime check or not. Enabling this flag can speed up the script, but also introcduce risks for future model deployment.
+* `--duplicate-shared-weights`: By what level to duplicate shared weights. `0`: no duplication, `1`: duplicate only when required by compiler, `2`: always duplicate. Default is `1`.
+* `--skip-check`: Skip the onnxruntime check. Enabling this flag can speed up the script, but also introduces risks for future model deployment.
 * `--overwrite-input-shapes`: Overwrite the input shape. The format is "input_name:dim0,dim1,...,dimN" or simply "dim0,dim1,...,dimN" when there is only one input, for example, "data:1,3,224,224" or "1,3,224,224". Note: you might want to use some visualization tools like netron to make sure what the input name and dimension ordering (NCHW or NHWC) is.
+* `--skip-fuse-qkv`: Skip the `fuse_qkv` optimization.
+* `--clear-descriptions`: Clear all descriptions in the graph.
+* `--clear-shapes`: Clear all existing shapes in the graph except input shapes.
+* `--opt-matmul`: Optimize MatMul operators for Kneron compiler.
+* `--replace-avgpool-with-conv`: Replace AveragePool with depthwise Conv when possible to avoid CPU nodes.
+* `--replace-dilated-conv`: Replace dilated Conv patterns when possible.
+* `--defuse-gaps`: Defuse GAP patterns when possible.
 
-## 3. Notes
+Notes:
+
+* If `-o` is not provided, output defaults to `<input>_optimized.onnx`.
+
+### 2.2. Large model optimization (>2 GiB)
+
+For large ONNX models, use the large-model module entry:
+
+```bash
+python -m kneronnxopt.large_model_fast_proc <input_onnx_model> -o <output_onnx_model>
+```
+
+Optional arguments:
 
-This tool is still under development. If you have any questions, please feel free to contact us.
+* `-h, --help`: Show this help message and exit.
+* `--log`: Set log level (default: INFO). Available log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.
+* `--overwrite-input-shapes`: Overwrite input shapes for simplify and shape inference.
+* `--skip-fuse-qkv`: Skip the `fuse_qkv` optimization.
+* `--onnxtool`: Use `onnx-tool` for shape inference. This is useful when shapes cannot be inferred by the default pass. However, this tool may clip off some nodes, so use with caution and always check the output model.
+
+### 2.3. Help command
+
+To inspect full and current options from the tool directly:
+
+```bash
+python -m kneronnxopt.optimize -h
+python -m kneronnxopt.large_model_fast_proc -h
+```
+
+## 3. Notes
 
-This tool automatically update the model opset to 18. This process has no good way to reverse. Please use other tools is you do not want to upgrade your model opset.
+This appendix focuses on console usage. For Python API usage, please refer to [3.1.2 ONNX Optimization](../manual_3_onnx.md#312-onnx-optimization).
 
-If you want to cut the model, please use `onnx.utils.extract_model` from ONNX. Please check <https://onnx.ai/onnx/api/utils.html>
\ No newline at end of file
+If you want to cut the model, please use `onnx.utils.extract_model` from ONNX. Please check <https://onnx.ai/onnx/api/utils.html>
diff --git a/docs/toolchain/manual_3_onnx.md b/docs/toolchain/manual_3_onnx.md
index e048195..2ed22ea 100644
--- a/docs/toolchain/manual_3_onnx.md
+++ b/docs/toolchain/manual_3_onnx.md
@@ -27,10 +27,15 @@ kneronnxopt.optimize(
     duplicate_shared_weights=1,
     skip_check=False,
     overwrite_input_shapes=None,
+    convert_f16=True,
     skipped_optimizers=None,
     skip_fuse_qkv=False,
     clear_descriptions=False,
     opt_matmul=False,
+    clear_shapes=False,
+    replace_avgpool_with_conv=False,
+    replace_dilated_conv=False,
+    defuse_gaps=False,
 ):
 ```
 
@@ -42,10 +47,15 @@ Args:
 * duplicate_shared_weights (int, optional): by what level, duplicate shared weight. 0-no duplication, 1-duplicate shared weights only when kneron compiler not support, 2-duplicate shared weights always. Default is 1.
 * skip_check (bool): skip the final check or not.
 * overwrite_input_shapes (List\[str\]): overwrite the input shape. The format is "input_name:dim0,dim1,...,dimN" or simply "dim0,dim1,...,dimN" when there is only one input, for example, "data:1,3,224,224" or "1,3,224,224". Note: you might want to use some visualization tools like netron to make sure what the input name and dimension ordering (NCHW or NHWC) is.
-* skipped_optimizers (list): skip the onnx optimizers. Check onnx document for details. Default is None.
+* convert_f16 (bool): convert f16 initializers and constants to f32 or not. Default is True.
+* skipped_optimizers (list): skip selected optimizers. Check onnx-simplifier documents for details. Default is None.
 * skip_fuse_qkv (bool): skip the fuse_qkv optimization or not. By default, fuse_qkv is enabled.
 * clear_descriptions (bool): clear all descriptions in the graph. By default, descriptions are not cleared.
 * opt_matmul (bool): optimize matmul operators for specific kneron compiler. By default, this option is not set.
+* clear_shapes (bool): clear all existing shapes in the graph except for input shapes. By default, shapes are not cleared.
+* replace_avgpool_with_conv (bool): replace AveragePool with depthwise Conv when possible to avoid CPU nodes. By default, this option is not set.
+* replace_dilated_conv (bool): replace dilated Conv patterns when possible. By default, this option is not set.
+* defuse_gaps (bool): defuse GAP patterns when possible. By default, this option is not set.
 
 Suppose we have a onnx object, here is the example python code:
 
@@ -54,7 +64,7 @@ import kneronnxopt
 optimized_m = kneronnxopt.optimize(input_m, skip_fuse_qkv=True)
 ```
 
-In this line of python code, `kneronnxopt.optimize` is the function that takes an onnx object and optimize it. The return value `result_m` is the converted onnx object.
+In this line of python code, `kneronnxopt.optimize` is the function that takes an onnx object and optimize it. The return value `optimized_m` is the optimized onnx object.
 
 The previous `onnx2onnx_flow` API is also available in the `onnx1.13` environment. It is a wrapper of the `kneronnxopt.optimize` API. But not all the previous options are available in the `onnx1.13` environment. We recommend you to use the `kneronnxopt.optimize` API instead of the `onnx2onnx_flow` API.
 
@@ -78,7 +88,7 @@ By the way, to save the model, you can use the following function from the onnx
 onnx.save(optimized_m, '/data1/optimized.onnx')
 ```
 
-We also provide a command line tool for both model optimization and evaluation. Please check FAQ 3.4.4 for details.
+For kneronnxopt console usage, please check [Kneronnxopt](appendix/kneronnxopt.md). We also provide a command line tool for both model optimization and evaluation. Please check FAQ 3.4.4 for details.
 
 ### 3.1.3. ONNX Editing
 

From 4f5c4bcf6d122bbe2d47e43f2fb508ba6269f2e3 Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Thu, 19 Mar 2026 16:46:17 +0800
Subject: [PATCH 5/5] Deprecate `hardware_cut_opt` in favor of
 `compiler_tiling` and update relevant documentation

---
 docs/toolchain/appendix/history.md  | 2 +-
 docs/toolchain/manual_1_overview.md | 2 +-
 docs/toolchain/manual_5_nef.md      | 2 --
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/docs/toolchain/appendix/history.md b/docs/toolchain/appendix/history.md
index 7e971b9..b482592 100644
--- a/docs/toolchain/appendix/history.md
+++ b/docs/toolchain/appendix/history.md
@@ -26,7 +26,7 @@
 
 * **[v0.32.1]**
     * Add `dma_bandwidth` and `weight_bandwidth` to IP evaluator arguments.
-    * Change `hardware_cut_opt` to `compiler_tiling` to keep consistent with other toolchain apis.
+    * Replace `hardware_cut_opt` with `compiler_tiling` to keep consistent with other toolchain apis. The `hardware_cut_opt` is now deprecated and will be removed in future versions. Please use `compiler_tiling` instead.
     * Update evaluator to raise warning when meeting unsupported operator instead of error.
     * Update ktc to clean up more intermediate files generated during the flow.
     * Fix the evaluator bug using wrong 730 frequency.
diff --git a/docs/toolchain/manual_1_overview.md b/docs/toolchain/manual_1_overview.md
index 8c62678..2a7adf1 100644
--- a/docs/toolchain/manual_1_overview.md
+++ b/docs/toolchain/manual_1_overview.md
@@ -21,7 +21,7 @@ In this document, you'll learn:
 **Major changes of the current version**
 * **[v0.32.1]**
     * Add `dma_bandwidth` and `weight_bandwidth` to IP evaluator arguments.
-    * Change `hardware_cut_opt` to `compiler_tiling` to keep consistent with other toolchain apis.
+    * Replace `hardware_cut_opt` with `compiler_tiling` to keep consistent with other toolchain apis. The `hardware_cut_opt` is now deprecated and will be removed in future versions. Please use `compiler_tiling` instead.
     * Update evaluator to raise warning when meeting unsupported operator instead of error.
     * Update ktc to clean up more intermediate files generated during the flow.
     * Fix the evaluator bug using wrong 730 frequency.
diff --git a/docs/toolchain/manual_5_nef.md b/docs/toolchain/manual_5_nef.md
index b756e22..9ce225f 100644
--- a/docs/toolchain/manual_5_nef.md
+++ b/docs/toolchain/manual_5_nef.md
@@ -45,9 +45,7 @@ ktc.encrypt_compile(
     key_file="",
     encryption_efuse_key="",
     weight_compress=False,
-    hardware_cut_opt=False,
     flatbuffer=True,
-    debug=False,
     compiler_tiling="default",
     weight_bandwidth=None,
     dma_bandwidth=None,