Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
ac1ba42
feat: add nkigen-lite as a standalone IR-based kernel generation backend
ymwangg Jun 2, 2026
934f863
feat: integrate nkigen-lite as a third backend for nkipy
ymwangg Jun 3, 2026
1ed3441
test: run shared test suites with both hlo and nkigen-lite backends
ymwangg Jun 3, 2026
1d93cbb
fix: support keepdims=False reduce and scalar outputs in nkigen-lite
ymwangg Jun 3, 2026
b4b6aa3
fix: improve nkigen-lite op coverage and input validation
ymwangg Jun 3, 2026
69c7f1b
fix: add expand_dims validation and skip HLO-specific error tests
ymwangg Jun 3, 2026
2643fc9
feat: support cross-lane MIN reduction via negate-max-negate
ymwangg Jun 3, 2026
176ee95
test: use on_device_test in alias tests for backend-agnostic execution
ymwangg Jun 3, 2026
b4e23cc
fix: handle scalar broadcast and f16 tensor_scalar_arith
ymwangg Jun 3, 2026
82feeec
feat: add native bitwise ops (AND, OR, XOR) via tensor_tensor_bitvec
ymwangg Jun 3, 2026
770b679
feat: add comparison and scalar bitvec primitives; rewrite floor
ymwangg Jun 3, 2026
8cdd829
feat: support strided slicing and fix numpy array in dynamic_update_s…
ymwangg Jun 4, 2026
a36c7d0
feat: add float8_e4m3 (IEEE) dtype support
ymwangg Jun 4, 2026
d7f3102
fix: set NEURON_RT_VISIBLE_CORES for xdist parallel test isolation
ymwangg Jun 4, 2026
652a484
feat: add comparison ops and where lowering for nkigen-lite
ymwangg Jun 4, 2026
f5ec24b
feat: add cos decomposition via sin(x + π/2)
ymwangg Jun 4, 2026
1defcad
feat: add dot op via composed_impl routing to matmul
ymwangg Jun 4, 2026
d5d2aad
feat: add arctan, bitwise_not, logical_and, constant for nkigen-lite
ymwangg Jun 5, 2026
ae6bff2
fix: correct np.take semantics for nkigen-lite
ymwangg Jun 5, 2026
bbc2a76
feat: add collective ops (all_reduce/gather/reduce_scatter/all_to_all…
ymwangg Jun 5, 2026
b90918d
fix: correct all_gather/reduce_scatter collective dim for nkigen-lite
ymwangg Jun 5, 2026
72774dc
fix: correct floor_divide/mod at exact-integer quotients for nkigen-lite
ymwangg Jun 5, 2026
ea53687
feat: support float8_e4m3fn on nkigen-lite backend
ymwangg Jun 23, 2026
ab68be6
feat: add iota primitive to nkigen-lite tensor_ir
ymwangg Jun 23, 2026
a9e5886
feat: add tril/triu/diag/trace for nkigen-lite
ymwangg Jun 24, 2026
75c7dae
feat: add pad/flip/roll/tile/diff for nkigen-lite
ymwangg Jun 24, 2026
79289c7
feat: add argmax/argmin for nkigen-lite
ymwangg Jun 24, 2026
b9d4c0a
feat: support non-uniform constants + cumsum for nkigen-lite
ymwangg Jun 24, 2026
26cdc0b
feat: add conv2d/conv3d for nkigen-lite via im2col
ymwangg Jun 24, 2026
3412eb2
feat: add repeat + split-with-indices for nkigen-lite
ymwangg Jun 24, 2026
e771269
fix: range-reduce sin/cos arguments for nkigen-lite
ymwangg Jun 24, 2026
9198d74
feat: add topk for nkigen-lite via iterative max-extraction
ymwangg Jun 24, 2026
7e8662c
feat: use hardware max8 primitive for nkigen-lite topk values
ymwangg Jun 24, 2026
eaf0650
feat: topk via canonical max8 + match_replace8 scan (k>8 support)
ymwangg Jun 24, 2026
f6ec7f1
test: fix nkigen-lite xdist failures (core isolation + spike import c…
ymwangg Jun 24, 2026
a1f10a3
test: skip slow/hanging conv2d and conv3d tests on nkigen-lite
ymwangg Jun 24, 2026
e3b5c93
feat: add gather_along_axis + take_along_axis for nkigen-lite
ymwangg Jun 25, 2026
8ed7979
feat: support dynamic (traced) indices in nkigen-lite take
ymwangg Jun 25, 2026
0eec074
feat: scatter family + row-gather for nkigen-lite via indirect DMA
ymwangg Jun 25, 2026
82fbd6b
fix: reshape lowering for boundary-crossing tiles; lift take partitio…
ymwangg Jun 25, 2026
fd79077
feat: support diff prepend/append, non-uniform dynamic_update_slice, …
ymwangg Jun 25, 2026
ecaba84
perf: emit strided slices as strided-DMA descriptors instead of per-e…
ymwangg Jun 25, 2026
2f8f706
perf: fast common-prefix reshape path (free-dim view, no scratch roun…
ymwangg Jun 25, 2026
350cdca
docs: document transpose lowering performance cliff and fix approach
ymwangg Jun 25, 2026
146d745
perf: collapse adjacent in-order axes in transpose lowering (8x on Qw…
ymwangg Jun 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions nkigen-lite/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# nkigen-lite

Lightweight IR-based kernel generation backend for NKIPy.

Provides a tensor-level IR (`tensor_ir`) and tile-level NKI IR (`nki_ir`) with
lowering passes to convert high-level tensor operations into NeuronCore-native
tile operations.
136 changes: 136 additions & 0 deletions nkigen-lite/docs/floor_divide_precision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Floor-Divide Precision on NeuronCore

This document explains the precision strategy used by nkigen-lite for
`floor_divide` and `mod` operations, how it was derived from neuronx-cc's
behavior, and the remaining known issue.

## Background

NeuronCore hardware has no native division or floor instruction. Division
is implemented as `a * reciprocal(b)` where `reciprocal` is a NISA scalar
engine instruction with ~23-bit precision. This means:

- `1.0 / 0.6238614321` may produce `1.60292005` instead of the true `1.60292008`
- `0.6238625646 * 1.60292005` may produce `0.999995` instead of `1.0000018`
- `floor(0.999995)` gives `0` instead of the correct `1`

A naive `floor(a * reciprocal(b))` implementation produces wrong results
for approximately 0.003% of elements where `a/b` lands within 1 ULP of an
exact integer.

## neuronx-cc's Strategy (from BIR inspection)

We examined the BIR (Backend IR) generated by neuronx-cc for HLO's
`floor_divide` operation by compiling with `SaveTemps` and inspecting the
generated `bir.json`:

```
[0-1] Load a, b — DMA from HBM to SBUF
[2] Reciprocal(b) — approximate 1/b
[3] TensorTensor(a, 1/b) — q ≈ a/b (approximate quotient)
[4] GenericCopy(q) f32→f32 — copy for floor computation
[5] GenericCopy(q) f32→i32 — truncate to integer (trunc)
[6] TensorTensor(b, trunc) — back-multiply: b * trunc
[7] TensorScalarPtr(logical_xor) — sign bit comparison (uint8)
[8] TensorScalarPtr(mult, add) — conditional correction (int32)
[9-11] TensorTensor — final result assembly
[12] Save — DMA to HBM
```

Key insight: neuronx-cc does NOT use Newton-Raphson to refine the
reciprocal. Instead, it uses a **divide-then-verify-and-correct** strategy:

1. Compute approximate quotient via reciprocal
2. Truncate to integer via i32 cast
3. Back-multiply to verify: `remainder = a - b * trunc_q`
4. Correct based on sign of remainder vs sign of divisor

## nkigen-lite's Implementation

We implement the same strategy in the decompose pass
(`tensor_ir/passes/decompose.py`), expressed as tensor IR operations:

```
floor_divide(a, b):
q_approx = a * reciprocal(b) # approximate quotient
q = floor(q_approx) # integer part (via i32 cast)
rem = a - b * q # back-verify

# Correction 1: quotient was too high (remainder has wrong sign)
corr_down = max(0, -(sign(rem) * sign(b)))

# Correction 2: quotient was too low (|remainder| exceeds |divisor|)
corr_up = max(0, sign(|rem| - |b|))

result = q - corr_down + corr_up
```

The `floor` operation itself is lowered to NISA as:

```
floor(x):
trunc_i32 = tensor_copy(x) # f32 → i32 truncates toward zero
trunc_f = tensor_copy(trunc_i32) # i32 → f32 back to float
diff = x - trunc_f # fractional residual
correction = relu(-sign(diff)) # 1 when x < trunc (negative frac)
result = trunc_f - correction # subtract 1 for negative fracs
```

## Verification

The correction strategy was verified to produce exact results through:

1. **Tensor IR interpreter**: bitwise-exact match with numpy's `floor_divide`
2. **NKI IR interpreter**: zero mismatches on 256×256 random arrays
3. **Hardware execution** (via `nb.compile_and_execute`): zero mismatches
4. **Hardware execution** (via Spike `DeviceKernel`): 0-1 mismatches per 65536

## Remaining Issue

When running through nkipy's full pipeline (compile via
`nki.compiler.kernel_builder` then execute via Spike's `DeviceKernel` in
the same process), 1 out of 65536 elements can produce wrong results.

### Root Cause

The issue is a **nanobind shared-state conflict** between
`nki.compiler.kernel_builder` and `spike._spike` when both are loaded in
the same Python process. At import time, warnings appear:

```
RuntimeWarning: nanobind: type 'TensorMetadata' was already registered!
RuntimeWarning: nanobind: type 'Spike' was already registered!
```

This corrupts internal native state that affects execution correctness for
numerically-sensitive instruction sequences. When compilation and execution
run in separate processes, the issue does not occur.

### Evidence

| Execution method | Result | Process isolation |
|:----------------|:------:|:-----------------:|
| `nb.compile_and_execute()` | 1 ✓ | Single process (no Spike) |
| `nb.compile_kernel()` + `compiled.execute()` | 1 ✓ | Single process (no Spike) |
| Separate compile process + separate Spike process | 1 ✓ | Isolated |
| nkipy pipeline (compile + Spike in same process) | 0 ✗ | Shared |

### Workaround

Run nkigen-lite compilation in a subprocess to isolate it from the Spike
execution runtime. This is the approach that would fully resolve the issue
without requiring changes to Spike or NKI.

### Reproducer

See `nkigen-lite/tests/spike_floor_divide_bug.py` for a self-contained
reproducer script.

## Related Operations

- **mod(a, b)**: decomposed as `a - b * floor_divide(a, b)`, inherits the
same precision characteristics.
- **ceil(x)**: decomposed as `neg(floor(neg(x)))`, uses the same floor
lowering.
- **power(a, b)**: decomposed as `exp(b * log(a))` since NISA's `POW`
arith op only supports scalar exponents.
Loading