Skip to content

feat: naive attention kernel#51

Merged
tetsuo-cpp merged 5 commits intocanonfrom
flash-attention
Feb 22, 2026
Merged

feat: naive attention kernel#51
tetsuo-cpp merged 5 commits intocanonfrom
flash-attention

Conversation

@tetsuo-cpp
Copy link
Owner

Summary

  • Add NVIDIA libdevice bitcode to lib/Bitcode/ and link it in the NVVM pipeline for math intrinsics support
  • Add naive scaled dot-product attention kernel with causal masking and numerically stable softmax
  • Add FileCheck pipeline test and end-to-end GPU test with NumPy reference validation
  • Tighten Vast.ai offer search with reliability and bandwidth filters

Closes #44

@tetsuo-cpp tetsuo-cpp changed the title feat: naive flash attention kernel feat: naive attention kernel Feb 22, 2026
Copy link
Owner Author

@tetsuo-cpp tetsuo-cpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I traced through the full kernel stack state — the logic is correct:

  • Dot product: Q[row,:] . K[t,:] via a loop, scaled by 1/sqrt(head_dim)
  • Causal mask: t > row → -1e30
  • Softmax: numerically stable (max subtraction before exp), serial reductions by thread 0 ✓
  • V accumulation: O[row,t] = sum_j attn[j] * V[j,t]
  • Stack is clean at exit

Issues

1. Vendored binary — lib/Bitcode/libdevice.10.bc (484KB)

This is NVIDIA's libdevice checked directly into git (no LFS). Concerns:

  • Repo bloat: Binary blobs in git history can't be garbage-collected.
  • Licensing: libdevice is distributed under the NVIDIA EULA. Redistributing it may require attribution or may not be permitted.
  • Alternative: Download from the CUDA toolkit at build time, use Git LFS, or locate it at the system CUDA install path (/usr/local/cuda/nvvm/libdevice/libdevice.10.bc).

2. Hardcoded libdevice path via compile definition

target_compile_definitions(obj.MLIRConversionPasses PRIVATE
  WARPFORTH_LIBDEVICE_PATH="${WARPFORTH_LIBDEVICE_PATH}")

The path is baked in at compile time. If the build directory moves or the binary is installed elsewhere, the pipeline will fail. Consider a fallback chain: env var → CUDA toolkit path → bundled path.

3. Variable shadowing in test assert (minor)

assert result == [pytest.approx(v) for v in expected]

The loop variable v shadows the outer v numpy array (line 616). Harmless but a linter would flag it — rename to e or x.

Non-blocking observations

  • Duplicate kernel source between test/Pipeline/attention.forth and inline in test_kernels.py — could drift apart, but acceptable since the tests serve different purposes (FileCheck vs runtime).
  • Serial reductions (thread 0 loops) are correct for "naive" — parallel reductions would be the natural follow-up.
  • FileCheck test is minimal (only checks gpu.binary @warpforth_module) — reasonable as a pipeline smoke test.

Overall the kernel and test logic look solid. The main concern is the vendored libdevice binary.

@tetsuo-cpp tetsuo-cpp merged commit c6d27e5 into canon Feb 22, 2026
1 check passed
@tetsuo-cpp tetsuo-cpp deleted the flash-attention branch February 22, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Naive attention kernel in Forth

1 participant