diff --git a/CLAUDE.md b/CLAUDE.md index e034783..48c2601 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -87,7 +87,7 @@ uv run ruff format gpu_test/ - **Stack Type**: `!forth.stack` - untyped stack, programmer ensures type safety - **Operations**: All take stack as input and produce stack as output (except `forth.stack`) -- **Supported Words**: literals (integer `42` and float `3.14`), `DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL`, `+ - * / MOD`, `F+ F- F* F/` (float arithmetic), `FEXP FSQRT FLOG FABS FNEG` (float math intrinsics), `FMAX FMIN` (float min/max), `AND OR XOR NOT LSHIFT RSHIFT`, `= < > <> <= >= 0=`, `F= F< F> F<> F<= F>=` (float comparison), `S>F F>S` (int/float conversion), `@ !` (global memory), `F@ F!` (float global memory), `S@ S!` (shared memory), `SF@ SF!` (float shared memory), `I8@ I8! SI8@ SI8!` (i8 memory), `I16@ I16! SI16@ SI16!` (i16 memory), `I32@ I32! SI32@ SI32!` (i32 memory), `HF@ HF! SHF@ SHF!` (f16 memory), `BF@ BF! SBF@ SBF!` (bf16 memory), `F32@ F32! SF32@ SF32!` (f32 memory), `CELLS`, `IF ELSE THEN`, `BEGIN UNTIL`, `BEGIN WHILE REPEAT`, `DO LOOP +LOOP I J K`, `LEAVE UNLOOP EXIT`, `{ a b -- }` (local variables in word definitions), `TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID` (GPU indexing). +- **Supported Words**: literals (integer `42` and float `3.14`), `DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL`, `+ - * / MOD`, `F+ F- F* F/` (float arithmetic), `FEXP FSQRT FLOG FABS FNEG` (float math intrinsics), `FMAX FMIN` (float min/max), `AND OR XOR NOT LSHIFT RSHIFT`, `= < > <> <= >= 0=`, `F= F< F> F<> F<= F>=` (float comparison), `S>F F>S` (int/float conversion), `@ !` (global memory), `F@ F!` (float global memory), `S@ S!` (shared memory), `SF@ SF!` (float shared memory), `I8@ I8! SI8@ SI8!` (i8 memory), `I16@ I16! SI16@ SI16!` (i16 memory), `I32@ I32! SI32@ SI32!` (i32 memory), `HF@ HF! SHF@ SHF!` (f16 memory), `BF@ BF! SBF@ SBF!` (bf16 memory), `F32@ F32! SF32@ SF32!` (f32 memory), `CELLS`, `IF ELSE THEN`, `BEGIN UNTIL`, `BEGIN WHILE REPEAT`, `DO LOOP +LOOP I J K`, `LEAVE UNLOOP EXIT`, `{ a b -- }` (local variables in word definitions), `TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID` (GPU indexing), `BARRIER` (thread block synchronization). - **Float Literals**: Numbers containing `.` or `e`/`E` are parsed as f64 (e.g. `3.14`, `-2.0`, `1.0e-5`, `1e3`). Stored on the stack as i64 bit patterns; F-prefixed words perform bitcast before/after operations. - **Kernel Parameters**: Declared in the `\!` header. `\! kernel ` is required and must appear first. `\! param i64[]` becomes a `memref` argument; `\! param i64` becomes an `i64` argument. `\! param f64[]` becomes a `memref` argument; `\! param f64` becomes an `f64` argument (bitcast to i64 when pushed to stack). Using a param name in code emits `forth.param_ref` (arrays push address; scalars push value). - **Shared Memory**: `\! shared i64[]` or `\! shared f64[]` declares GPU shared (workgroup) memory. Emits a tagged `memref.alloca` at kernel entry; ForthToGPU converts it to a `gpu.func` workgroup attribution. Using the shared name in code pushes its base address onto the stack. Use `S@`/`S!` for i64 or `SF@`/`SF!` for f64 shared accesses. Cannot be referenced inside word definitions. diff --git a/README.md b/README.md new file mode 100644 index 0000000..8ac9021 --- /dev/null +++ b/README.md @@ -0,0 +1,115 @@ +# WarpForth + +An MLIR-based Forth compiler for programming GPU kernels. WarpForth defines a custom MLIR dialect for Forth stack operations and lowers through a pipeline of passes to PTX assembly. + +## Dependencies + +- LLVM/MLIR +- CMake +- C++17 compiler +- CUDA toolkit (for GPU execution) +- [uv](https://github.com/astral-sh/uv) (for Python test tooling) + +## Building + +```bash +# Configure +cmake -B build -G Ninja \ + -DMLIR_DIR=/path/to/llvm/lib/cmake/mlir \ + -DLLVM_DIR=/path/to/llvm/lib/cmake/llvm + +# Build +cmake --build build +``` + +## Quick Start + +Write a naive integer matrix multiply kernel (M=2, N=3, K=4, one thread per output element): + +```forth +\! kernel main +\! param A i64[8] +\! param B i64[12] +\! param C i64[6] + +\ One thread computes C[row, col] where gid = row*N + col. +GLOBAL-ID +DUP 3 / +SWAP 3 MOD +0 +4 0 DO + 2 PICK + I SWAP 4 * + + CELLS A + @ + I 3 * 3 PICK + CELLS B + @ + * + +LOOP +2 PICK 3 * 2 PICK + +CELLS C + ! +``` + +Compile to PTX: + +```bash +./build/bin/warpforthc matmul.forth -o matmul.ptx +``` + +Test on a GPU (A is 2x4 row-major, B is 4x3 row-major, C is 2x3 output): + +```bash +./build/bin/warpforth-runner matmul.ptx \ + --param 'i64[]:1,2,3,4,5,6,7,8' \ + --param 'i64[]:1,2,3,4,5,6,7,8,9,10,11,12' \ + --param 'i64[]:0,0,0,0,0,0' \ + --grid 6,1,1 --block 1,1,1 \ + --output-param 2 --output-count 6 +``` + +## Toolchain + +| Tool | Description | +|------|-------------| +| `warpforthc` | Compiles Forth source to PTX | +| `warpforth-translate` | Translates from Forth source to MLIR and MLIR to PTX assembly | +| `warpforth-opt` | Runs individual MLIR passes or entire pipeline | +| `warpforth-runner` | Executes PTX kernels on a GPU for testing | + +These tools can be composed for debugging or inspecting intermediate stages: + +```bash +./build/bin/warpforth-translate --forth-to-mlir kernel.forth | \ + ./build/bin/warpforth-opt --warpforth-pipeline | \ + ./build/bin/warpforth-translate --mlir-to-ptx +``` + +## Language Reference + +WarpForth supports stack operations, integer and float arithmetic, control flow, global and shared memory access, reduced-width memory types, user-defined words with local variables, and GPU-specific operations. + +See [docs/language.md](docs/language.md) for the full language reference. + +## Architecture + +WarpForth compiles Forth through a series of MLIR dialect lowerings, each replacing higher-level abstractions with lower-level ones until the program is expressed entirely in LLVM IR and can be handed to the NVPTX backend. + +| Stage | Pass | Description | +|-------|-------------|-------------| +| **Parsing** | `warpforth-translate --forth-to-mlir` | Parses Forth source into the `forth` dialect. The kernel is represented as a series of stack ops on an abstract `!forth.stack` type. | +| **Stack lowering** | `warpforth-opt --convert-forth-to-memref` | The abstract `!forth.stack` type is materialized as a `memref<256xi64>` buffer and `index` pair. Stack ops become explicit loads, stores, and pointer arithmetic. | +| **GPU wrapping** | `warpforth-opt --convert-forth-to-gpu` | Functions are wrapped in a `gpu.module`, the kernel entry point is marked as a `gpu.kernel` and GPU intrinsic words are lowered to `gpu` ops. | +| **NVVM/LLVM lowering** | Standard MLIR passes | GPU→NVVM, math→LLVM intrinsics and NVVM→LLVM. | +| **Code generation** | `warpforth-translate --mlir-to-ptx` | The GPU module is serialized to PTX assembly via LLVM's NVPTX backend. | + +## Demo + +The `demo/` directory contains a GPT-2 text generation demo that routes scaled dot-product attention through a WarpForth-compiled kernel. See [demo/README.md](demo/README.md) for setup instructions. + +## Testing + +```bash +# Run the LIT test suite +cmake --build build --target check-warpforth + +# Run end-to-end GPU tests (requires Vast.ai API key) +VASTAI_API_KEY=xxx uv run pytest -v -m gpu +``` diff --git a/demo/README.md b/demo/README.md index 7da17ba..f1a7224 100644 --- a/demo/README.md +++ b/demo/README.md @@ -18,7 +18,7 @@ A pre-compiled `attention.ptx` is included in this directory. ## Step 2: Upload to GPU Instance ```bash -scp -r demo/ demo/gpt2_generate.py root@HOST:/workspace +scp -r demo/ root@HOST:/workspace ``` ## Step 3: Install Dependencies (Remote) @@ -30,7 +30,7 @@ pip install pycuda transformers ## Step 4: Generate Text (Remote) ```bash -python /workspace/gpt2_generate.py --ptx /workspace/attention.ptx --prompt "The meaning of life is" +python /workspace/demo/gpt2_generate.py --ptx /workspace/demo/attention.ptx --prompt "The meaning of life is" ``` | Flag | Default | Description | diff --git a/docs/language.md b/docs/language.md new file mode 100644 index 0000000..b151bc5 --- /dev/null +++ b/docs/language.md @@ -0,0 +1,310 @@ +# WarpForth Language Reference + +## Kernel Header + +Every WarpForth program begins with header directives prefixed by `\!`. The header is required because Forth's stack-passing convention doesn't provide typed parameter declarations for the kernel interface. It's also used to capture information about the kernel that is better specified in a declarative manner rather than with Forth semantics, such as what shared memory buffers are used by the kernel. + +### Kernel Declaration + +```forth +\! kernel main +``` + +Required. Must appear first. Names the GPU kernel entry point. + +### Parameters + +```forth +\! param DATA i64[256] \ array of 256 i64 → memref<256xi64> +\! param N i64 \ scalar i64 +\! param WEIGHTS f64[128] \ array of 128 f64 → memref<128xf64> +\! param SCALE f64 \ scalar f64 (bitcast to i64 on stack) +``` + +- Array parameters become `memref` arguments. Using the name as a word pushes the base address. +- Scalar parameters become value arguments. Using the name as a word pushes the value. +- `f64` scalars are bitcast to i64 when pushed to the stack; use `F`-prefixed words to operate on them. + +### Shared Memory + +```forth +\! shared SCRATCH i64[64] +\! shared SCORES f64[1024] +``` + +Declares GPU shared memory. Using the name as a word pushes its base address. Access with `S@`/`S!` (i64) or `SF@`/`SF!` (f64). Cannot be referenced inside word definitions. + +## Literals + +### Integer Literals + +Plain numbers are parsed as i64: + +```forth +42 -1 0 255 +``` + +### Float Literals + +Numbers containing `.` or `e`/`E` are parsed as f64 and stored on the stack as i64 bit patterns: + +```forth +3.14 -2.0 1.0e-5 1e3 +``` + +Use `F`-prefixed words (`F+`, `F*`, etc.) to operate on float values. + +## Stack Operations + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `DUP` | `( a -- a a )` | Duplicate top | +| `DROP` | `( a -- )` | Discard top | +| `SWAP` | `( a b -- b a )` | Swap top two | +| `OVER` | `( a b -- a b a )` | Copy second to top | +| `ROT` | `( a b c -- b c a )` | Rotate third to top | +| `NIP` | `( a b -- b )` | Drop second | +| `TUCK` | `( a b -- b a b )` | Copy top below second | +| `PICK` | `( xn ... x0 n -- xn ... x0 xn )` | Copy nth item to top | +| `ROLL` | `( xn ... x0 n -- xn-1 ... x0 xn )` | Move nth item to top | + +## Arithmetic + +### Integer Arithmetic + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `+` | `( a b -- a+b )` | Add | +| `-` | `( a b -- a-b )` | Subtract | +| `*` | `( a b -- a*b )` | Multiply | +| `/` | `( a b -- a/b )` | Divide | +| `MOD` | `( a b -- a%b )` | Modulo | + +### Float Arithmetic + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `F+` | `( a b -- a+b )` | Float add | +| `F-` | `( a b -- a-b )` | Float subtract | +| `F*` | `( a b -- a*b )` | Float multiply | +| `F/` | `( a b -- a/b )` | Float divide | + +### Float Math Intrinsics + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `FEXP` | `( a -- exp(a) )` | Exponential | +| `FSQRT` | `( a -- sqrt(a) )` | Square root | +| `FLOG` | `( a -- log(a) )` | Natural logarithm | +| `FABS` | `( a -- |a| )` | Absolute value | +| `FNEG` | `( a -- -a )` | Negate | +| `FMAX` | `( a b -- max(a,b) )` | Maximum | +| `FMIN` | `( a b -- min(a,b) )` | Minimum | + +## Bitwise Operations + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `AND` | `( a b -- a&b )` | Bitwise AND | +| `OR` | `( a b -- a\|b )` | Bitwise OR | +| `XOR` | `( a b -- a^b )` | Bitwise XOR | +| `NOT` | `( a -- ~a )` | Bitwise NOT | +| `LSHIFT` | `( a n -- a<>n )` | Right shift | + +## Comparison + +### Integer Comparison + +All comparisons push 1 (true) or 0 (false). + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `=` | `( a b -- flag )` | Equal | +| `<` | `( a b -- flag )` | Less than | +| `>` | `( a b -- flag )` | Greater than | +| `<>` | `( a b -- flag )` | Not equal | +| `<=` | `( a b -- flag )` | Less or equal | +| `>=` | `( a b -- flag )` | Greater or equal | +| `0=` | `( a -- flag )` | Equal to zero | + +### Float Comparison + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `F=` | `( a b -- flag )` | Float equal | +| `F<` | `( a b -- flag )` | Float less than | +| `F>` | `( a b -- flag )` | Float greater than | +| `F<>` | `( a b -- flag )` | Float not equal | +| `F<=` | `( a b -- flag )` | Float less or equal | +| `F>=` | `( a b -- flag )` | Float greater or equal | + +## Type Conversion + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `S>F` | `( n -- f )` | Integer to float (i64 → f64 bit pattern) | +| `F>S` | `( f -- n )` | Float to integer (f64 bit pattern → i64) | + +## Memory Access + +### Address Arithmetic + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `CELLS` | `( n -- n*8 )` | Convert cell index to byte offset (8 bytes per cell) | + +### Global Memory (i64) + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `@` | `( addr -- value )` | Load i64 from global memory | +| `!` | `( value addr -- )` | Store i64 to global memory | + +### Global Memory (f64) + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `F@` | `( addr -- value )` | Load f64 from global memory (as i64 bit pattern) | +| `F!` | `( value addr -- )` | Store f64 to global memory (from i64 bit pattern) | + +### Shared Memory (i64) + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `S@` | `( addr -- value )` | Load i64 from shared memory | +| `S!` | `( value addr -- )` | Store i64 to shared memory | + +### Shared Memory (f64) + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `SF@` | `( addr -- value )` | Load f64 from shared memory (as i64 bit pattern) | +| `SF!` | `( value addr -- )` | Store f64 to shared memory (from i64 bit pattern) | + +### Reduced-Width Memory + +These words load/store narrower types, converting to/from the stack's native i64. + +**Integer types** — load sign-extends to i64, store truncates from i64: + +| Word | Width | Memory | Description | +|------|-------|--------|-------------| +| `I8@` / `I8!` | 8-bit | Global | Load/store i8 | +| `SI8@` / `SI8!` | 8-bit | Shared | Load/store i8 (shared) | +| `I16@` / `I16!` | 16-bit | Global | Load/store i16 | +| `SI16@` / `SI16!` | 16-bit | Shared | Load/store i16 (shared) | +| `I32@` / `I32!` | 32-bit | Global | Load/store i32 | +| `SI32@` / `SI32!` | 32-bit | Shared | Load/store i32 (shared) | + +**Float types** — load extends to f64 then bitcasts to i64, store bitcasts i64 to f64 then truncates: + +| Word | Width | Memory | Description | +|------|-------|--------|-------------| +| `HF@` / `HF!` | 16-bit | Global | Load/store f16 | +| `SHF@` / `SHF!` | 16-bit | Shared | Load/store f16 (shared) | +| `BF@` / `BF!` | 16-bit | Global | Load/store bf16 | +| `SBF@` / `SBF!` | 16-bit | Shared | Load/store bf16 (shared) | +| `F32@` / `F32!` | 32-bit | Global | Load/store f32 | +| `SF32@` / `SF32!` | 32-bit | Shared | Load/store f32 (shared) | + +## Control Flow + +### Conditionals + +```forth +condition IF + \ executed when condition is nonzero +THEN + +condition IF + \ true branch +ELSE + \ false branch +THEN +``` + +### Post-Test Loop + +```forth +BEGIN + \ loop body +condition UNTIL \ exits when condition is nonzero +``` + +### Pre-Test Loop + +```forth +BEGIN condition WHILE + \ loop body +REPEAT +``` + +### Counted Loop + +```forth +limit start DO + \ loop body — I is the current index +LOOP + +limit start DO + \ loop body +n +LOOP \ increment index by n instead of 1 +``` + +| Word | Description | +|------|-------------| +| `I` | Current loop index (innermost loop) | +| `J` | Index of next outer loop | +| `K` | Index of second outer loop | +| `LEAVE` | Exit the innermost loop immediately | +| `UNLOOP` | Discard loop parameters before `EXIT` | +| `EXIT` | Return from the current word | + +## User-Defined Words + +```forth +: square DUP * ; +: add3 + + ; +``` + +### Local Variables + +```forth +: dot-product { a-addr b-addr n -- } + 0 + n 0 DO + I CELLS a-addr + @ + I CELLS b-addr + @ + * + + LOOP +; +``` + +`{ name1 name2 ... -- }` at the start of a word definition binds read-only locals. Values are popped from the stack in reverse name order. Locals work across all control flow structures and map directly to GPU registers. + +## GPU Operations + +### Thread and Block Indexing + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `TID-X` | `( -- id )` | Thread index in X dimension | +| `TID-Y` | `( -- id )` | Thread index in Y dimension | +| `TID-Z` | `( -- id )` | Thread index in Z dimension | +| `BID-X` | `( -- id )` | Block index in X dimension | +| `BID-Y` | `( -- id )` | Block index in Y dimension | +| `BID-Z` | `( -- id )` | Block index in Z dimension | +| `BDIM-X` | `( -- dim )` | Block dimension in X | +| `BDIM-Y` | `( -- dim )` | Block dimension in Y | +| `BDIM-Z` | `( -- dim )` | Block dimension in Z | +| `GDIM-X` | `( -- dim )` | Grid dimension in X | +| `GDIM-Y` | `( -- dim )` | Grid dimension in Y | +| `GDIM-Z` | `( -- dim )` | Grid dimension in Z | +| `GLOBAL-ID` | `( -- id )` | `BID-X * BDIM-X + TID-X` | + +### Synchronization + +| Word | Stack Effect | Description | +|------|-------------|-------------| +| `BARRIER` | `( -- )` | Thread block barrier (`__syncthreads`) |