kneron · MrWhoami · Mar 19, 2026 · Mar 17, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/docs/toolchain/appendix/app_flow_manual.md b/docs/toolchain/appendix/app_flow_manual.md
@@ -1,4 +1,4 @@
-# Kneron End to End Simulator v0.32.0
+# Kneron End to End Simulator v0.32.1
 
 This project allows users to perform image inference using Kneron's built in simulator. We encourage users to use simply use the kneron_inference function to perform the tests on your inputs.
 

diff --git a/docs/toolchain/appendix/history.md b/docs/toolchain/appendix/history.md
@@ -24,6 +24,12 @@
 
 ## Toolchain Change log
 
+* **[v0.32.1]**
+    * Add `dma_bandwidth` and `weight_bandwidth` to IP evaluator arguments.
+    * Replace `hardware_cut_opt` with `compiler_tiling` to keep consistent with other toolchain apis. The `hardware_cut_opt` is now deprecated and will be removed in future versions. Please use `compiler_tiling` instead.
+    * Update evaluator to raise warning when meeting unsupported operator instead of error.
+    * Update ktc to clean up more intermediate files generated during the flow.
+    * Fix the evaluator bug using wrong 730 frequency.
 * **[v0.32.0]**
     * Add Einsum defusion in kneronnxopt.
     * Support Cast to int64 in knerex and compiler.

diff --git a/docs/toolchain/appendix/kneronnxopt.md b/docs/toolchain/appendix/kneronnxopt.md
@@ -1,6 +1,6 @@
 # Kneronnxopt
 
-Kneronnxopt is the ONNX optimizer project for kneron hardware platforms. Its purpose is to provide shapes for all the tensors as well as accelerate the inference and compiling process. Currently, we support ONNX up to opset 18.
+Kneronnxopt is the ONNX optimizer project for Kneron hardware platforms. It prepares tensor shapes and optimizes graph structures to improve inference and compilation flow. Currently, it supports ONNX opset 8 to 18.
 
 ## 1. Preparation
 
@@ -12,24 +12,60 @@ conda activate onnx1.13
 
 ## 2. Usage
 
-The tool is under `/workspace/libs/kneronnxopt`. You can use the following command to run the tool:
+### 2.1. Standard model optimization
+
+Use module execution for standard ONNX models:
 
 ```bash
-python /workspace/libs/kneronnxopt/kneronnxopt/optimize.py -o <output_onnx_model> <input_onnx_model>
+python -m kneronnxopt.optimize <input_onnx_model> -o <output_onnx_model>
 ```
 
-It also has the following optional arguments:
+Optional arguments:
 
 * `-h, --help`: Show this help message and exit.
 * `--log`: Set log level (default: INFO). Available log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.
-* `--duplicate-shared-weight`: Duplicate shared weights in the model. Default is False.
-* `--skip-check`: Skip the onnxruntime check or not. Enabling this flag can speed up the script, but also introcduce risks for future model deployment.
+* `--duplicate-shared-weights`: By what level to duplicate shared weights. `0`: no duplication, `1`: duplicate only when required by compiler, `2`: always duplicate. Default is `1`.
+* `--skip-check`: Skip the onnxruntime check. Enabling this flag can speed up the script, but also introduces risks for future model deployment.
 * `--overwrite-input-shapes`: Overwrite the input shape. The format is "input_name:dim0,dim1,...,dimN" or simply "dim0,dim1,...,dimN" when there is only one input, for example, "data:1,3,224,224" or "1,3,224,224". Note: you might want to use some visualization tools like netron to make sure what the input name and dimension ordering (NCHW or NHWC) is.
+* `--skip-fuse-qkv`: Skip the `fuse_qkv` optimization.
+* `--clear-descriptions`: Clear all descriptions in the graph.
+* `--clear-shapes`: Clear all existing shapes in the graph except input shapes.
+* `--opt-matmul`: Optimize MatMul operators for Kneron compiler.
+* `--replace-avgpool-with-conv`: Replace AveragePool with depthwise Conv when possible to avoid CPU nodes.
+* `--replace-dilated-conv`: Replace dilated Conv patterns when possible.
+* `--defuse-gaps`: Defuse GAP patterns when possible.
 
-## 3. Notes
+Notes:
+
+* If `-o` is not provided, output defaults to `<input>_optimized.onnx`.
+
+### 2.2. Large model optimization (>2 GiB)
+
+For large ONNX models, use the large-model module entry:
+
+```bash
+python -m kneronnxopt.large_model_fast_proc <input_onnx_model> -o <output_onnx_model>
+```
+
+Optional arguments:
 
-This tool is still under development. If you have any questions, please feel free to contact us.
+* `-h, --help`: Show this help message and exit.
+* `--log`: Set log level (default: INFO). Available log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.
+* `--overwrite-input-shapes`: Overwrite input shapes for simplify and shape inference.
+* `--skip-fuse-qkv`: Skip the `fuse_qkv` optimization.
+* `--onnxtool`: Use `onnx-tool` for shape inference. This is useful when shapes cannot be inferred by the default pass. However, this tool may clip off some nodes, so use with caution and always check the output model.
+
+### 2.3. Help command
+
+To inspect full and current options from the tool directly:
+
+```bash
+python -m kneronnxopt.optimize -h
+python -m kneronnxopt.large_model_fast_proc -h
+```
+
+## 3. Notes
 
-This tool automatically update the model opset to 18. This process has no good way to reverse. Please use other tools is you do not want to upgrade your model opset.
+This appendix focuses on console usage. For Python API usage, please refer to [3.1.2 ONNX Optimization](../manual_3_onnx.md#312-onnx-optimization).
 
-If you want to cut the model, please use `onnx.utils.extract_model` from ONNX. Please check <https://onnx.ai/onnx/api/utils.html>
+If you want to cut the model, please use `onnx.utils.extract_model` from ONNX. Please check <https://onnx.ai/onnx/api/utils.html>
diff --git a/docs/toolchain/manual_1_overview.md b/docs/toolchain/manual_1_overview.md
@@ -5,7 +5,7 @@
 # 1. Toolchain Overview
 
 **2026-03**
-**Toolchain v0.32.0**
+**Toolchain v0.32.1**
 
 ## 1.1. Introduction
 
@@ -19,6 +19,12 @@ In this document, you'll learn:
 3. How to utilize the tools through Python API.
 
 **Major changes of the current version**
+* **[v0.32.1]**
+    * Add `dma_bandwidth` and `weight_bandwidth` to IP evaluator arguments.
+    * Replace `hardware_cut_opt` with `compiler_tiling` to keep consistent with other toolchain apis. The `hardware_cut_opt` is now deprecated and will be removed in future versions. Please use `compiler_tiling` instead.
+    * Update evaluator to raise warning when meeting unsupported operator instead of error.
+    * Update ktc to clean up more intermediate files generated during the flow.
+    * Fix the evaluator bug using wrong 730 frequency.
 * **[v0.32.0]**
     * Add Einsum defusion in kneronnxopt.
     * Support Cast to int64 in knerex and compiler.

diff --git a/docs/toolchain/manual_3_onnx.md b/docs/toolchain/manual_3_onnx.md
@@ -27,10 +27,15 @@ kneronnxopt.optimize(
     duplicate_shared_weights=1,
     skip_check=False,
     overwrite_input_shapes=None,
+    convert_f16=True,
     skipped_optimizers=None,
     skip_fuse_qkv=False,
     clear_descriptions=False,
     opt_matmul=False,
+    clear_shapes=False,
+    replace_avgpool_with_conv=False,
+    replace_dilated_conv=False,
+    defuse_gaps=False,
 ):
 ```
 
@@ -42,10 +47,15 @@ Args:
 * duplicate_shared_weights (int, optional): by what level, duplicate shared weight. 0-no duplication, 1-duplicate shared weights only when kneron compiler not support, 2-duplicate shared weights always. Default is 1.
 * skip_check (bool): skip the final check or not.
 * overwrite_input_shapes (List\[str\]): overwrite the input shape. The format is "input_name:dim0,dim1,...,dimN" or simply "dim0,dim1,...,dimN" when there is only one input, for example, "data:1,3,224,224" or "1,3,224,224". Note: you might want to use some visualization tools like netron to make sure what the input name and dimension ordering (NCHW or NHWC) is.
-* skipped_optimizers (list): skip the onnx optimizers. Check onnx document for details. Default is None.
+* convert_f16 (bool): convert f16 initializers and constants to f32 or not. Default is True.
+* skipped_optimizers (list): skip selected optimizers. Check onnx-simplifier documents for details. Default is None.
 * skip_fuse_qkv (bool): skip the fuse_qkv optimization or not. By default, fuse_qkv is enabled.
 * clear_descriptions (bool): clear all descriptions in the graph. By default, descriptions are not cleared.
 * opt_matmul (bool): optimize matmul operators for specific kneron compiler. By default, this option is not set.
+* clear_shapes (bool): clear all existing shapes in the graph except for input shapes. By default, shapes are not cleared.
+* replace_avgpool_with_conv (bool): replace AveragePool with depthwise Conv when possible to avoid CPU nodes. By default, this option is not set.
+* replace_dilated_conv (bool): replace dilated Conv patterns when possible. By default, this option is not set.
+* defuse_gaps (bool): defuse GAP patterns when possible. By default, this option is not set.
 
 Suppose we have a onnx object, here is the example python code:
 
@@ -54,7 +64,7 @@ import kneronnxopt
 optimized_m = kneronnxopt.optimize(input_m, skip_fuse_qkv=True)
 ```
 
-In this line of python code, `kneronnxopt.optimize` is the function that takes an onnx object and optimize it. The return value `result_m` is the converted onnx object.
+In this line of python code, `kneronnxopt.optimize` is the function that takes an onnx object and optimize it. The return value `optimized_m` is the optimized onnx object.
 
 The previous `onnx2onnx_flow` API is also available in the `onnx1.13` environment. It is a wrapper of the `kneronnxopt.optimize` API. But not all the previous options are available in the `onnx1.13` environment. We recommend you to use the `kneronnxopt.optimize` API instead of the `onnx2onnx_flow` API.
 
@@ -78,7 +88,7 @@ By the way, to save the model, you can use the following function from the onnx
 onnx.save(optimized_m, '/data1/optimized.onnx')
 ```
 
-We also provide a command line tool for both model optimization and evaluation. Please check FAQ 3.4.4 for details.
+For kneronnxopt console usage, please check [Kneronnxopt](appendix/kneronnxopt.md). We also provide a command line tool for both model optimization and evaluation. Please check FAQ 3.4.4 for details.
 
 ### 3.1.3. ONNX Editing
 
@@ -300,7 +310,7 @@ You can use `-o` or `--optimizer-only` to only run the optimization step without
 You can use `-h` or `--help` to see all the options.
 
 ```
-usage: python -m ktc.opt_and_eval [-h] [-e] [-E EVALUATOR_REPORT_PATH] [-o] [-O OPTIMIZED_PATH] [--deep-search] {520,720,530,630,730} path
+usage: python -m ktc.opt_and_eval [-h] [-P] [-e] [-E EVALUATOR_REPORT_PATH] [-o] [-O OPTIMIZED_PATH] [--deep-search] {520,720,530,630,730} path
 
 Optimize ONNX model and run IP Evaluator
 
@@ -318,4 +328,5 @@ optional arguments:
   -O OPTIMIZED_PATH, --optimized-path OPTIMIZED_PATH
                         Path to save the optimized ONNX model.
   --deep-search         Use deep search for optimization, which may take longer but can yield better performance.
+  -P, --print           Print the evaluation result in the terminal.
 ```
diff --git a/docs/toolchain/manual_5_nef.md b/docs/toolchain/manual_5_nef.md
@@ -8,7 +8,16 @@ Batch compile turns multiple models into a single binary file. We have two APIs
 
 ```python
 #[API]
-ktc.compile(model_list, output_dir="/data1/kneron_flow", dedicated_output_buffer=True, weight_compress=False)
+ktc.compile(
+    model_list,
+    output_dir="/data1/kneron_flow",
+    dedicated_output_buffer=True,
+    weight_compress=False,
+    flatbuffer=True,
+    compiler_tiling="default",
+    weight_bandwidth=None,
+    dma_bandwidth=None,
+)
 ```
 
 Compile the models and generate the nef file. The nef path will be returned.
@@ -19,12 +28,28 @@ Args:
 * output_dir (str, optional): output directory. Defaults to "/data1/kneron_flow".
 * dedicated_output_buffer (bool, optional): dedicated output buffer. Defaults to True.
 * weight_compress (bool, optional): compress weight to slightly reduce the binary file size. Defaults to False.
-* hardware_cut_opt (bool, optional): optimize the hardware memory usage while processing large inputs. This option might cause the compiling time increase. Currently, only available for 720. Defaults to False.
+* hardware_cut_opt (bool, optional): DEPRECATED. Use `compiler_tiling="deep_search"` instead. If True and `compiler_tiling` is `"default"`, `compiler_tiling` will be treated as `"deep_search"`. Defaults to False.
 * flatbuffer (bool, optional): enable new flatbuffer mode for 720. Defaults to True.
+* compiler_tiling (str, optional): choose from `"default"`, `"deep_search"`, or `"partial_graph_search"`. Ignored when a model provides its own compiler config json. KDP520 always uses `"default"`. Defaults to `"default"`.
+* weight_bandwidth: weight bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
+* dma_bandwidth: dma bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
 
 ```python
 #[API]
-ktc.encrypt_compile(model_list, output_dir="/data1/kneron_flow", dedicated_output_buffer=True, mode=None, key="", key_file="", encryption_efuse_key="", weight_compress=False)
+ktc.encrypt_compile(
+    model_list,
+    output_dir="/data1/kneron_flow",
+    dedicated_output_buffer=True,
+    mode=None,
+    key="",
+    key_file="",
+    encryption_efuse_key="",
+    weight_compress=False,
+    flatbuffer=True,
+    compiler_tiling="default",
+    weight_bandwidth=None,
+    dma_bandwidth=None,
+)
 ```
 
 Compile the models, generate an encrypted nef file. The nef path will be returned.
@@ -39,8 +64,13 @@ Args:
 * key_file (str, optional): key file path. Required in mode 1. Defaults to "".
 * encryption_efuse_key (str, optional): a hex code. Required in mode 2 and optional in mode 1. Defaults to "".
 * weight_compress (bool, optional): compress weight to slightly reduce the binary file size. Defaults to False.
-* hardware_cut_opt (bool, optional): optimize the hardware memory usage while processing large inputs. This option might cause the compiling time increase. Currently, only available for 720. Defaults to False.
+* hardware_cut_opt (bool, optional): DEPRECATED. Use `compiler_tiling="deep_search"` instead. If True and `compiler_tiling` is `"default"`, `compiler_tiling` will be treated as `"deep_search"`. Defaults to False.
 * flatbuffer (bool, optional): enable new flatbuffer mode for 720. Defaults to True.
+* compiler_tiling (str, optional): choose from `"default"`, `"deep_search"`, or `"partial_graph_search"`. Ignored when a model provides its own compiler config json. KDP520 always uses `"default"`. Defaults to `"default"`.
+* weight_bandwidth: weight bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
+* dma_bandwidth: dma bandwidth in gbps. Defaults to None to use the platform default for the IP evaluator.
+
+If you previously used `hardware_cut_opt=True`, use `compiler_tiling="deep_search"` instead.
 
 We would start with single model first.