From cf52e6cb2dff23226078df3d9f4cdf15c7659f34 Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Fri, 6 Mar 2026 15:27:51 +0800
Subject: [PATCH 1/2] Update for toolchain release v0.32.0.

---
 docs/toolchain/appendix/app_flow_manual.md |  2 +-
 docs/toolchain/appendix/fx_report.md       | 80 ++++++++++++----------
 docs/toolchain/appendix/history.md         | 12 ++++
 docs/toolchain/manual_1_overview.md        | 28 ++++----
 docs/toolchain/manual_4_bie.md             |  6 +-
 5 files changed, 72 insertions(+), 56 deletions(-)
diff --git a/docs/toolchain/appendix/app_flow_manual.md b/docs/toolchain/appendix/app_flow_manual.md
index cd4716a..1085c86 100644
--- a/docs/toolchain/appendix/app_flow_manual.md
+++ b/docs/toolchain/appendix/app_flow_manual.md
@@ -1,4 +1,4 @@
-# Kneron End to End Simulator v0.31.1
+# Kneron End to End Simulator v0.32.0
 
 This project allows users to perform image inference using Kneron's built in simulator. We encourage users to use simply use the kneron_inference function to perform the tests on your inputs.
 
diff --git a/docs/toolchain/appendix/fx_report.md b/docs/toolchain/appendix/fx_report.md
index 970dc65..cc5f7d7 100644
--- a/docs/toolchain/appendix/fx_report.md
+++ b/docs/toolchain/appendix/fx_report.md
@@ -28,28 +28,29 @@ The summary will show the IP evaluator information. Below are some examples of r
 <p><span style="font-weight: bold;">Figure 4.</span> Summary for platform 730, mode 2 (with fixed-point model generated and snr check.) </p>
 </div>
 
-| **name**                | **explaination**                                                               | **availability**                 |
-|-------------------------|--------------------------------------------------------------------------------|----------------------------------|
-| **docker_version**      | the version of the toolchain docker for this report                            |                                  |
-| **comments**            | extra information                                                              |                                  |
-| **input bitwidth**      | customer set input bitwidth: int8 or int16                                     |                                  |
-| **output bitwidth**     | customer set output bitwidth: int8 or int16                                    |                                  |
-| **datapath bitwidth**   | customer set data bitwidth (or activation bitwidth): int8 or int16             |                                  |
-| **weight bitwidth**     | customer set weight bitwidth: int8 or int16 or int4. int4 only for certain HW. |                                  |
-| **fps**                 | estimated frame per second.                                                    |                                  |
-| **ITC**                 | estimated inference time.                                                      |                                  |
-| **RDMA bandwidth**      | set effective peak RDMA bandwidth based on HW                                  |                                  |
-| **WDMA bandwidth**      | set effective peak WDMA bandwidth based on HW                                  |                                  |
-| **GETW bandwidth**      | set effective peak weight loading bandwidth based on HW                        |                                  |
-| **RV**                  | Total data load (except weight load) from DDR in one inference                 |                                  |
-| **WV**                  | Total data write to DDR in one inference                                       |                                  |
-| **cpu node**            | CPU node in model will be listed here                                          | if any cpu node exists           |
-| **SNR(dB)**             | The snr of fix point model inferenced results.                                 | mode 2 and 3                     |
-| **btm_dynasty_path**    | path to inferenced results                                                     | mode 2 and 3                     |
-| **btm**                 | check the bit-true-match between dynasty and csim inference                    | mode 2 and 3                     |
-| **bie**                 | generated bie file (fix point model) for dynasty inference                     | mode 1/2/3                       |
-| **nef**                 | generated nef file (fix point model) for csim / dongle inference               | mode 1/2/3                       |
-| **gen fx model report** | file name of this report                                                       |                                  |
+| **name**                | **explaination**                                                               | **availability**       |
+| ----------------------- | ------------------------------------------------------------------------------ | ---------------------- |
+| **docker_version**      | the version of the toolchain docker for this report                            |                        |
+| **comments**            | extra information                                                              |                        |
+| **input bitwidth**      | customer set input bitwidth: int8 or int16                                     |                        |
+| **output bitwidth**     | customer set output bitwidth: int8 or int16                                    |                        |
+| **datapath bitwidth**   | customer set data bitwidth (or activation bitwidth): int8 or int16             |                        |
+| **weight bitwidth**     | customer set weight bitwidth: int8 or int16 or int4. int4 only for certain HW. |                        |
+| **fps**                 | estimated frame per second.                                                    |                        |
+| **ITC**                 | estimated inference time.                                                      |                        |
+| **RDMA bandwidth**      | set effective peak RDMA bandwidth based on HW                                  |                        |
+| **WDMA bandwidth**      | set effective peak WDMA bandwidth based on HW                                  |                        |
+| **GETW bandwidth**      | set effective peak weight loading bandwidth based on HW                        |                        |
+| **RV**                  | Total data load (except weight load) from DDR in one inference                 |                        |
+| **WV**                  | Total data write to DDR in one inference                                       |                        |
+| **cpu node**            | CPU node in model will be listed here                                          | if any cpu node exists |
+| **SNR(dB)**             | The snr of fix point model inferenced results.                                 | mode 2 and 3           |
+| **btm_dynasty_path**    | path to inferenced results                                                     | mode 2 and 3           |
+| **btm**                 | check the bit-true-match between dynasty and csim inference                    | mode 2 and 3           |
+| **bie**                 | generated bie file (fix point model) for dynasty inference                     | mode 1/2/3             |
+| **nef**                 | generated nef file (fix point model) for csim / dongle inference               | mode 1/2/3             |
+| **backend node graph**  | the graph after node fusion and decomposition, with backend node information.  |                        |
+| **gen fx model report** | file name of this report                                                       |                        |
 
 
 
@@ -75,20 +76,23 @@ The summary will show the IP evaluator information. Below are some examples of r
 <p><span style="font-weight: bold;">Figure 8.</span> Node details for platform 730, mode 2 (with fixed-point model generated and SNR check). </p>
 </div>
 
-| **column**                     | **explanation**                                                                                                                                                                                                                            | **availability**                                        |
-|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
-| **node**                       | model operation node name after node fusion and decomposition     |                                                         |
-| **SNR**                        | SNR score between fixed-point model and original model (per layer)                                                                                                                                                                            | every layer for mode 3 and only output layer for mode 2 |
-| **node origin**                | corresponding operation node name in original onnx before node fusion and decomposition  |                                                         |
-| **type**                       | NPU / FUSED / CPU                                                                                                                                                                                                                          |                                                         |
-| **node backend**               | corresponding backend node name                                                                                                                                                                                                            |                                                         |
-| **CMD_node_idx**               | index of command node                                                                                                                                                                                                                      | below info not available for 520                        |
-| **bw in / bw out / bw weight** | input / output / weight bitwidth for this node                                                                                                                                                                                             | mode 1 / 2 / 3                                          |
-| **MAC_cycle**                  | MAC engine runtime cycle number for this backend node.     |                                                         |
-| **MAC_runtime(ms)**            | MAC engine runtime for this backend node.                                                                  |                                                         |
-| **RDMA_amount(Byte)**          | RDMA amount for this backend node.                                                                                                                                                                                                         |                                                         |
-| **WDMA_amount(Byte)**          | WDMA amount for this backend node.                                                                                                                                                                                                         |                                                         |
-| **Weight_amount(Byte)**        | weight amount for this backend node.                                                                                                                                                                                                       |                                                         |
-| **runtime(ms)**                | operator runtime.                                                                                                                                                                                                                          |                                                         |
-| **in_fmt / out_fmt**           | input/output data formats. If only one input/output or multiple inputs/outputs with same format, the only format will be shown. If multiple formats for this node, then the details will be listed as “FORMAT1:IN1,IN2 \ FORMAT2:IN3”.     |                                                         |
+| **column**                     | **explanation**                                                                                                                                                                                                                        | **availability**                                        |
+| ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+| **node**                       | model operation node name after node fusion and decomposition                                                                                                                                                                          |                                                         |
+| **SNR**                        | SNR score between fixed-point model and original model (per layer)                                                                                                                                                                     | every layer for mode 3 and only output layer for mode 2 |
+| **node origin**                | corresponding operation node name in original onnx before node fusion and decomposition                                                                                                                                                |                                                         |
+| **type**                       | NPU / FUSED / CPU                                                                                                                                                                                                                      |                                                         |
+| **node backend**               | corresponding backend node name                                                                                                                                                                                                        |                                                         |
+| **CMD_node_idx**               | index of command node                                                                                                                                                                                                                  | below info not available for 520                        |
+| **bw in / bw out / bw weight** | input / output / weight bitwidth for this node                                                                                                                                                                                         | mode 1 / 2 / 3                                          |
+| **MAC_cycle**                  | MAC engine runtime cycle number for this backend node.                                                                                                                                                                                 |                                                         |
+| **MAC_runtime(ms)**            | MAC engine runtime for this backend node.                                                                                                                                                                                              |                                                         |
+| **RDMA_amount(Byte)**          | RDMA amount for this backend node.                                                                                                                                                                                                     |                                                         |
+| **WDMA_amount(Byte)**          | WDMA amount for this backend node.                                                                                                                                                                                                     |                                                         |
+| **Weight_amount(Byte)**        | weight amount for this backend node.                                                                                                                                                                                                   |                                                         |
+| **runtime(ms)**                | operator runtime. It's the total runtime including CFUNC, PFUNC, and SYNC.                                                                                                                                                             |                                                         |
+| **CFUNC_runtime(ms)**          | CFUNC runtime.                                                                                                                                                                                                                         |                                                         |
+| **PFUNC_runtime(ms)**          | PFUNC runtime.                                                                                                                                                                                                                         |                                                         |
+| **SYNC_runtime(ms)**           | SYNC runtime.                                                                                                                                                                                                                          |                                                         |
+| **in_fmt / out_fmt**           | input/output data formats. If only one input/output or multiple inputs/outputs with same format, the only format will be shown. If multiple formats for this node, then the details will be listed as “FORMAT1:IN1,IN2 \ FORMAT2:IN3”. |                                                         |
 
diff --git a/docs/toolchain/appendix/history.md b/docs/toolchain/appendix/history.md
index db1f8da..649f4e9 100644
--- a/docs/toolchain/appendix/history.md
+++ b/docs/toolchain/appendix/history.md
@@ -24,6 +24,18 @@
 
 ## Toolchain Change log
 
+* **[v0.32.0]**
+    * Add Einsum defusion in kneronnxopt.
+    * Support Cast to int64 in knerex and compiler.
+    * Support HardSwish, Topk and Split nodes in knerex and compiler.
+    * Update the regression flow log printing. Print success log seperately from errors to avoid confusing.
+    * Update IP evaluator for DMA with small length.
+    * Fix the kneronnxopt bug in `replace_Gather_with_Slice`.
+    * Fix the knerex bug: node Concat channel mismatch.
+    * Fix the dynasty float bug in InstanceNorm pad edge mode.
+    * Fix knerex/compiler bug in CPU node settings for Resize node.
+    * Verified opset18 operator validity for knerex and compiler.
+    * Reduce memory usage (especially for large models) for compiler.
 * **[v0.31.1]**
     * Add `const_in_bitwidth_mode` option for quantization. The default is int16. Unless the customer particularly desires to increase the speed, it can be changed to int8
     * Update analyzer exception log.
diff --git a/docs/toolchain/manual_1_overview.md b/docs/toolchain/manual_1_overview.md
index bf66af7..3e9f52d 100644
--- a/docs/toolchain/manual_1_overview.md
+++ b/docs/toolchain/manual_1_overview.md
@@ -4,8 +4,8 @@
 
 # 1. Toolchain Overview
 
-**2025 Nov**
-**Toolchain v0.31.1**
+**2026 Mar**
+**Toolchain v0.32.0**
 
 ## 1.1. Introduction
 
@@ -19,18 +19,18 @@ In this document, you'll learn:
 3. How to utilize the tools through Python API.
 
 **Major changes of the current version**
-* **[v0.31.1]**
-    * Add `const_in_bitwidth_mode` option for quantization. The default is int16. Unless the customer particularly desires to increase the speed, it can be changed to int8
-    * Update analyzer exception log.
-    * Update kneronnxopt to set expanding dilated Conv to False by default.
-    * Update kneronnxopt to diable fusing BatchNormalization into Conv by default.
-    * Update compiler for the deep search memory estimation algorithm.
-    * Update compiler to extend the timeout for deep search.
-    * Update compiler to change expt/log/softmax to 16b.
-    * Fix the ktc bug in some default output path names.
-    * Fix the kneronnxopt bug in duplicating shared weights.
-    * Fix the compiler bug in broadcasting.
-    * Fix the compiler bug that Concat channel axis not supported.
+* **[v0.32.0]**
+    * Add Einsum defusion in kneronnxopt.
+    * Support Cast to int64 in knerex and compiler.
+    * Support HardSwish, Topk and Split nodes in knerex and compiler.
+    * Update the regression flow log printing. Print success log seperately from errors to avoid confusing.
+    * Update IP evaluator for DMA with small length.
+    * Fix the kneronnxopt bug in `replace_Gather_with_Slice`.
+    * Fix the knerex bug: node Concat channel mismatch.
+    * Fix the dynasty float bug in InstanceNorm pad edge mode.
+    * Fix knerex/compiler bug in CPU node settings for Resize node.
+    * Verified opset18 operator validity for knerex and compiler.
+    * Reduce memory usage (especially for large models) for compiler.
 
 ## 1.2. Workflow Overview
 
diff --git a/docs/toolchain/manual_4_bie.md b/docs/toolchain/manual_4_bie.md
index 7e94182..ced1c84 100644
--- a/docs/toolchain/manual_4_bie.md
+++ b/docs/toolchain/manual_4_bie.md
@@ -53,9 +53,9 @@ Args:
 * optimize (int, optional): level of featuremap optimization, which is a search runtime optimization based on partial graph comparison. Worse performance than deep search. It is recommended to enable when model is huge or search runtime is long. 0-4, the larger number, the better model performance, but takes longer. Defaults to 0.
     * 0: the knerex generated quantization model.
     * 1: bias adjust parallel, no featuremap cut improvement.
-    * 2: bias adjust parallel, with featuremap cut improvement.
+    * 2: bias adjust parallel, with featuremap cut improvement (same as `compiler_tiling="deep_search"`).
     * 3: bias adjust sequential, no featuremap cut improvement. SLOW!
-    * 4: bias adjust sequential, with featuremap cut improvement.  SLOW!
+    * 4: bias adjust sequential, with featuremap cut improvement (same as `compiler_tiling="deep_search"`).  SLOW!
 
 Please also note that this step would be very time-consuming since it analysis the model with every input data you provide.
 
@@ -71,7 +71,7 @@ input_images = [preprocess("/workspace/examples/mobilenetv2/images/" + image_nam
 input_mapping = {"images": input_images}
 
 # Quantization with only deep_search enabled.
-bie_path = km.analysis(input_mapping, threads = 4, fm_cut='deep_search')
+bie_path = km.analysis(input_mapping, threads = 4, compiler_tiling='deep_search')
 ```
 
 Since toolchain v0.21.0, the analysis step also generates a detailed report in html format. You can find it under

From bdf7b51f72df70a5844f04101458a08deb94a8ea Mon Sep 17 00:00:00 2001
From: Jiyuan Liu <jiyuan@kneron.us>
Date: Fri, 6 Mar 2026 15:32:35 +0800
Subject: [PATCH 2/2] Fix typos.

---
 docs/toolchain/manual_1_overview.md |  2 +-
 docs/toolchain/manual_4_bie.md      | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/toolchain/manual_1_overview.md b/docs/toolchain/manual_1_overview.md
index 3e9f52d..7dbcd60 100644
--- a/docs/toolchain/manual_1_overview.md
+++ b/docs/toolchain/manual_1_overview.md
@@ -4,7 +4,7 @@
 
 # 1. Toolchain Overview
 
-**2026 Mar**
+**2026-03**
 **Toolchain v0.32.0**
 
 ## 1.1. Introduction
diff --git a/docs/toolchain/manual_4_bie.md b/docs/toolchain/manual_4_bie.md
index ced1c84..4afa099 100644
--- a/docs/toolchain/manual_4_bie.md
+++ b/docs/toolchain/manual_4_bie.md
@@ -1,6 +1,6 @@
 # 4. BIE Workflow
 
-As mentioned briefly in the previous section, the bie file is the model file which is usually generated after quantization. It is encrpyted and not available for visuanlization.
+As mentioned briefly in the previous section, the bie file is the model file which is usually generated after quantization. It is encrypted and not available for visualization.
 In this chapter, we would go through the steps of quantization.
 
 ## 4.1. Quantization
@@ -22,7 +22,7 @@ Args:
 * output_dir (str, optional): path to the output directory. Defaults to "/data1/kneron_flow"".
 * threads (int, optional): multithread setting. Defaults to 4.
 * quantize_mode (str, optional): quantize_mode setting. Currently support default and post_sigmoid. Defaults to "default".
-* datapath_range_method (str, optional): could be 'mmse' or 'percentage. mmse: use snr-based-range method. percentage: use arbitary percentage. Default to 'percentage'.
+* datapath_range_method (str, optional): could be 'mmse' or 'percentage. mmse: use snr-based-range method. percentage: use arbitrary percentage. Default to 'percentage'.
 * percentile (float, optional): used under 'mmse' mode. The range to search. The larger the value, the larger the search range, the better the performance but the longer the simulation time. Defaults to 0.001,
 * outlier_factor (float, optional): used under 'mmse' mode. The factor applied on outliers. For example, if clamping data is sensitive to your model, set outlier_factor to 2 or higher. Higher outlier_factor will reduce outlier removal by increasing range. Defaults to 1.0.
 * percentage (float, optional): used under 'percentage' mode. Suggest to set value between 0.999 and 1.0. Use 1.0 for detection models. **Must be smaller than or equal to percentage_16b.** Defaults to 0.999.
@@ -59,7 +59,7 @@ Args:
 
 Please also note that this step would be very time-consuming since it analysis the model with every input data you provide.
 
-Here as a simple example, we only use four input image as exmaple and run it with the `ktc.ModelConfig` object `km` created in section 3.2:
+Here as a simple example, we only use four input image as example and run it with the `ktc.ModelConfig` object `km` created in section 3.2:
 
 ```python
 # Preprocess images as the quantization inputs. The preprocess function is defined in the previous section.
@@ -122,7 +122,7 @@ Raises:
 
 ## 4.3. FAQ
 
-### 4.3.1. What if the E2E simulator results of floating-point and fixed-point lost too match accuracy?
+### 4.3.1. What if the E2E simulator results of floating-point and fixed-point lost too much accuracy?
 
 Please try the following solutions:
 
@@ -148,7 +148,7 @@ Please consider replace the unsupported nodes with other nodes.
 
 **Causes**:
 
-This error can be caused by many differenct reasons. Here are the possible reasons:
+This error can be caused by many different reasons. Here are the possible reasons:
 
 1. The most common ones are that the input image number is too large, the thread number is too large and the model is too large which causes the FP analyser killed by the system.
 2. The path in the configuration file is invalid. Thus, the updater failed to load it.