feat: naive attention kernel by tetsuo-cpp · Pull Request #51 · tetsuo-cpp/warpforth

tetsuo-cpp · 2026-02-22T08:01:25Z

Summary

Add NVIDIA libdevice bitcode to lib/Bitcode/ and link it in the NVVM pipeline for math intrinsics support
Add naive scaled dot-product attention kernel with causal masking and numerically stable softmax
Add FileCheck pipeline test and end-to-end GPU test with NumPy reference validation
Tighten Vast.ai offer search with reliability and bandwidth filters

Closes #44

… search

tetsuo-cpp

Code Review

I traced through the full kernel stack state — the logic is correct:

Dot product: Q[row,:] . K[t,:] via a loop, scaled by 1/sqrt(head_dim) ✓
Causal mask: t > row → -1e30 ✓
Softmax: numerically stable (max subtraction before exp), serial reductions by thread 0 ✓
V accumulation: O[row,t] = sum_j attn[j] * V[j,t] ✓
Stack is clean at exit ✓

Issues

1. Vendored binary — lib/Bitcode/libdevice.10.bc (484KB)

This is NVIDIA's libdevice checked directly into git (no LFS). Concerns:

Repo bloat: Binary blobs in git history can't be garbage-collected.
Licensing: libdevice is distributed under the NVIDIA EULA. Redistributing it may require attribution or may not be permitted.
Alternative: Download from the CUDA toolkit at build time, use Git LFS, or locate it at the system CUDA install path (/usr/local/cuda/nvvm/libdevice/libdevice.10.bc).

2. Hardcoded libdevice path via compile definition

target_compile_definitions(obj.MLIRConversionPasses PRIVATE
  WARPFORTH_LIBDEVICE_PATH="${WARPFORTH_LIBDEVICE_PATH}")

The path is baked in at compile time. If the build directory moves or the binary is installed elsewhere, the pipeline will fail. Consider a fallback chain: env var → CUDA toolkit path → bundled path.

3. Variable shadowing in test assert (minor)

assert result == [pytest.approx(v) for v in expected]

The loop variable v shadows the outer v numpy array (line 616). Harmless but a linter would flag it — rename to e or x.

Non-blocking observations

Duplicate kernel source between test/Pipeline/attention.forth and inline in test_kernels.py — could drift apart, but acceptable since the tests serve different purposes (FileCheck vs runtime).
Serial reductions (thread 0 loops) are correct for "naive" — parallel reductions would be the natural follow-up.
FileCheck test is minimal (only checks gpu.binary @warpforth_module) — reasonable as a pipeline smoke test.

Overall the kernel and test logic look solid. The main concern is the vendored libdevice binary.

tetsuo-cpp added 4 commits February 22, 2026 16:35

fix(gpu-test): add reliability and bandwidth filters to Vast.ai offer…

2b03446

… search

feat: add libdevice bitcode and link in NVVM pipeline

d9b65b4

test(pipeline): add attention kernel FileCheck test

be2aaca

test(gpu-test): add attention end-to-end GPU test

bf39628

tetsuo-cpp force-pushed the flash-attention branch from f3d7997 to bf39628 Compare February 22, 2026 08:05

tetsuo-cpp changed the title ~~feat: naive flash attention kernel~~ feat: naive attention kernel Feb 22, 2026

tetsuo-cpp commented Feb 22, 2026

View reviewed changes

fix: stride V accumulation over head_dim in attention kernel

0995fb1

tetsuo-cpp merged commit c6d27e5 into canon Feb 22, 2026
1 check passed

tetsuo-cpp deleted the flash-attention branch February 22, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: naive attention kernel#51

feat: naive attention kernel#51
tetsuo-cpp merged 5 commits intocanonfrom
flash-attention

tetsuo-cpp commented Feb 22, 2026

Uh oh!

tetsuo-cpp left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tetsuo-cpp commented Feb 22, 2026

Summary

Uh oh!

tetsuo-cpp left a comment

Choose a reason for hiding this comment

Code Review

Issues

Non-blocking observations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant