aws-neuron · ymwangg · Jun 2, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/nkigen-lite/README.md b/nkigen-lite/README.md
@@ -0,0 +1,7 @@
+# nkigen-lite
+
+Lightweight IR-based kernel generation backend for NKIPy.
+
+Provides a tensor-level IR (`tensor_ir`) and tile-level NKI IR (`nki_ir`) with
+lowering passes to convert high-level tensor operations into NeuronCore-native
+tile operations.
diff --git a/nkigen-lite/docs/floor_divide_precision.md b/nkigen-lite/docs/floor_divide_precision.md
@@ -0,0 +1,136 @@
+# Floor-Divide Precision on NeuronCore
+
+This document explains the precision strategy used by nkigen-lite for
+`floor_divide` and `mod` operations, how it was derived from neuronx-cc's
+behavior, and the remaining known issue.
+
+## Background
+
+NeuronCore hardware has no native division or floor instruction. Division
+is implemented as `a * reciprocal(b)` where `reciprocal` is a NISA scalar
+engine instruction with ~23-bit precision. This means:
+
+- `1.0 / 0.6238614321` may produce `1.60292005` instead of the true `1.60292008`
+- `0.6238625646 * 1.60292005` may produce `0.999995` instead of `1.0000018`
+- `floor(0.999995)` gives `0` instead of the correct `1`
+
+A naive `floor(a * reciprocal(b))` implementation produces wrong results
+for approximately 0.003% of elements where `a/b` lands within 1 ULP of an
+exact integer.
+
+## neuronx-cc's Strategy (from BIR inspection)
+
+We examined the BIR (Backend IR) generated by neuronx-cc for HLO's
+`floor_divide` operation by compiling with `SaveTemps` and inspecting the
+generated `bir.json`:
+
+```
+[0-1] Load a, b                     — DMA from HBM to SBUF
+[2]   Reciprocal(b)                  — approximate 1/b
+[3]   TensorTensor(a, 1/b)           — q ≈ a/b (approximate quotient)
+[4]   GenericCopy(q) f32→f32         — copy for floor computation
+[5]   GenericCopy(q) f32→i32         — truncate to integer (trunc)
+[6]   TensorTensor(b, trunc)         — back-multiply: b * trunc
+[7]   TensorScalarPtr(logical_xor)   — sign bit comparison (uint8)
+[8]   TensorScalarPtr(mult, add)     — conditional correction (int32)
+[9-11] TensorTensor                  — final result assembly
+[12]  Save                           — DMA to HBM
+```
+
+Key insight: neuronx-cc does NOT use Newton-Raphson to refine the
+reciprocal. Instead, it uses a **divide-then-verify-and-correct** strategy:
+
+1. Compute approximate quotient via reciprocal
+2. Truncate to integer via i32 cast
+3. Back-multiply to verify: `remainder = a - b * trunc_q`
+4. Correct based on sign of remainder vs sign of divisor
+
+## nkigen-lite's Implementation
+
+We implement the same strategy in the decompose pass
+(`tensor_ir/passes/decompose.py`), expressed as tensor IR operations:
+
+```
+floor_divide(a, b):
+  q_approx = a * reciprocal(b)          # approximate quotient
+  q = floor(q_approx)                    # integer part (via i32 cast)
+  rem = a - b * q                        # back-verify
+
+  # Correction 1: quotient was too high (remainder has wrong sign)
+  corr_down = max(0, -(sign(rem) * sign(b)))
+
+  # Correction 2: quotient was too low (|remainder| exceeds |divisor|)
+  corr_up = max(0, sign(|rem| - |b|))
+
+  result = q - corr_down + corr_up
+```
+
+The `floor` operation itself is lowered to NISA as:
+
+```
+floor(x):
+  trunc_i32 = tensor_copy(x)            # f32 → i32 truncates toward zero
+  trunc_f   = tensor_copy(trunc_i32)    # i32 → f32 back to float
+  diff      = x - trunc_f               # fractional residual
+  correction = relu(-sign(diff))        # 1 when x < trunc (negative frac)
+  result    = trunc_f - correction      # subtract 1 for negative fracs
+```
+
+## Verification
+
+The correction strategy was verified to produce exact results through:
+
+1. **Tensor IR interpreter**: bitwise-exact match with numpy's `floor_divide`
+2. **NKI IR interpreter**: zero mismatches on 256×256 random arrays
+3. **Hardware execution** (via `nb.compile_and_execute`): zero mismatches
+4. **Hardware execution** (via Spike `DeviceKernel`): 0-1 mismatches per 65536
+
+## Remaining Issue
+
+When running through nkipy's full pipeline (compile via
+`nki.compiler.kernel_builder` then execute via Spike's `DeviceKernel` in
+the same process), 1 out of 65536 elements can produce wrong results.
+
+### Root Cause
+
+The issue is a **nanobind shared-state conflict** between
+`nki.compiler.kernel_builder` and `spike._spike` when both are loaded in
+the same Python process. At import time, warnings appear:
+
+```
+RuntimeWarning: nanobind: type 'TensorMetadata' was already registered!
+RuntimeWarning: nanobind: type 'Spike' was already registered!
+```
+
+This corrupts internal native state that affects execution correctness for
+numerically-sensitive instruction sequences. When compilation and execution
+run in separate processes, the issue does not occur.
+
+### Evidence
+
+| Execution method | Result | Process isolation |
+|:----------------|:------:|:-----------------:|
+| `nb.compile_and_execute()` | 1 ✓ | Single process (no Spike) |
+| `nb.compile_kernel()` + `compiled.execute()` | 1 ✓ | Single process (no Spike) |
+| Separate compile process + separate Spike process | 1 ✓ | Isolated |
+| nkipy pipeline (compile + Spike in same process) | 0 ✗ | Shared |
+
+### Workaround
+
+Run nkigen-lite compilation in a subprocess to isolate it from the Spike
+execution runtime. This is the approach that would fully resolve the issue
+without requiring changes to Spike or NKI.
+
+### Reproducer
+
+See `nkigen-lite/tests/spike_floor_divide_bug.py` for a self-contained
+reproducer script.
+
+## Related Operations
+
+- **mod(a, b)**: decomposed as `a - b * floor_divide(a, b)`, inherits the
+  same precision characteristics.
+- **ceil(x)**: decomposed as `neg(floor(neg(x)))`, uses the same floor
+  lowering.
+- **power(a, b)**: decomposed as `exp(b * log(a))` since NISA's `POW`
+  arith op only supports scalar exponents.