Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ uv run ruff format gpu_test/

- **Stack Type**: `!forth.stack` - untyped stack, programmer ensures type safety
- **Operations**: All take stack as input and produce stack as output (except `forth.stack`)
- **Supported Words**: literals (integer `42` and float `3.14`), `DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL`, `+ - * / MOD`, `F+ F- F* F/` (float arithmetic), `FEXP FSQRT FLOG FABS FNEG` (float math intrinsics), `FMAX FMIN` (float min/max), `AND OR XOR NOT LSHIFT RSHIFT`, `= < > <> <= >= 0=`, `F= F< F> F<> F<= F>=` (float comparison), `S>F F>S` (int/float conversion), `@ !` (global memory), `F@ F!` (float global memory), `S@ S!` (shared memory), `SF@ SF!` (float shared memory), `I8@ I8! SI8@ SI8!` (i8 memory), `I16@ I16! SI16@ SI16!` (i16 memory), `I32@ I32! SI32@ SI32!` (i32 memory), `HF@ HF! SHF@ SHF!` (f16 memory), `BF@ BF! SBF@ SBF!` (bf16 memory), `F32@ F32! SF32@ SF32!` (f32 memory), `CELLS`, `IF ELSE THEN`, `BEGIN UNTIL`, `BEGIN WHILE REPEAT`, `DO LOOP +LOOP I J K`, `LEAVE UNLOOP EXIT`, `{ a b -- }` (local variables in word definitions), `TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID` (GPU indexing).
- **Supported Words**: literals (integer `42` and float `3.14`), `DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL`, `+ - * / MOD`, `F+ F- F* F/` (float arithmetic), `FEXP FSQRT FLOG FABS FNEG` (float math intrinsics), `FMAX FMIN` (float min/max), `AND OR XOR NOT LSHIFT RSHIFT`, `= < > <> <= >= 0=`, `F= F< F> F<> F<= F>=` (float comparison), `S>F F>S` (int/float conversion), `@ !` (global memory), `F@ F!` (float global memory), `S@ S!` (shared memory), `SF@ SF!` (float shared memory), `I8@ I8! SI8@ SI8!` (i8 memory), `I16@ I16! SI16@ SI16!` (i16 memory), `I32@ I32! SI32@ SI32!` (i32 memory), `HF@ HF! SHF@ SHF!` (f16 memory), `BF@ BF! SBF@ SBF!` (bf16 memory), `F32@ F32! SF32@ SF32!` (f32 memory), `CELLS`, `IF ELSE THEN`, `BEGIN UNTIL`, `BEGIN WHILE REPEAT`, `DO LOOP +LOOP I J K`, `LEAVE UNLOOP EXIT`, `{ a b -- }` (local variables in word definitions), `TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID` (GPU indexing), `BARRIER` (thread block synchronization).
- **Float Literals**: Numbers containing `.` or `e`/`E` are parsed as f64 (e.g. `3.14`, `-2.0`, `1.0e-5`, `1e3`). Stored on the stack as i64 bit patterns; F-prefixed words perform bitcast before/after operations.
- **Kernel Parameters**: Declared in the `\!` header. `\! kernel <name>` is required and must appear first. `\! param <name> i64[<N>]` becomes a `memref<Nxi64>` argument; `\! param <name> i64` becomes an `i64` argument. `\! param <name> f64[<N>]` becomes a `memref<Nxf64>` argument; `\! param <name> f64` becomes an `f64` argument (bitcast to i64 when pushed to stack). Using a param name in code emits `forth.param_ref` (arrays push address; scalars push value).
- **Shared Memory**: `\! shared <name> i64[<N>]` or `\! shared <name> f64[<N>]` declares GPU shared (workgroup) memory. Emits a tagged `memref.alloca` at kernel entry; ForthToGPU converts it to a `gpu.func` workgroup attribution. Using the shared name in code pushes its base address onto the stack. Use `S@`/`S!` for i64 or `SF@`/`SF!` for f64 shared accesses. Cannot be referenced inside word definitions.
Expand Down
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# WarpForth

An MLIR-based Forth compiler for programming GPU kernels. WarpForth defines a custom MLIR dialect for Forth stack operations and lowers through a pipeline of passes to PTX assembly.

## Dependencies

- LLVM/MLIR
- CMake
- C++17 compiler
- CUDA toolkit (for GPU execution)
- [uv](https://github.com/astral-sh/uv) (for Python test tooling)

## Building

```bash
# Configure
cmake -B build -G Ninja \
-DMLIR_DIR=/path/to/llvm/lib/cmake/mlir \
-DLLVM_DIR=/path/to/llvm/lib/cmake/llvm

# Build
cmake --build build
```

## Quick Start

Write a naive integer matrix multiply kernel (M=2, N=3, K=4, one thread per output element):

```forth
\! kernel main
\! param A i64[8]
\! param B i64[12]
\! param C i64[6]

\ One thread computes C[row, col] where gid = row*N + col.
GLOBAL-ID
DUP 3 /
SWAP 3 MOD
0
4 0 DO
2 PICK
I SWAP 4 * +
CELLS A + @
I 3 * 3 PICK + CELLS B + @
* +
LOOP
2 PICK 3 * 2 PICK +
CELLS C + !
```

Compile to PTX:

```bash
./build/bin/warpforthc matmul.forth -o matmul.ptx
```

Test on a GPU (A is 2x4 row-major, B is 4x3 row-major, C is 2x3 output):

```bash
./build/bin/warpforth-runner matmul.ptx \
--param 'i64[]:1,2,3,4,5,6,7,8' \
--param 'i64[]:1,2,3,4,5,6,7,8,9,10,11,12' \
--param 'i64[]:0,0,0,0,0,0' \
--grid 6,1,1 --block 1,1,1 \
--output-param 2 --output-count 6
```

## Toolchain

| Tool | Description |
|------|-------------|
| `warpforthc` | Compiles Forth source to PTX |
| `warpforth-translate` | Translates from Forth source to MLIR and MLIR to PTX assembly |
| `warpforth-opt` | Runs individual MLIR passes or entire pipeline |
| `warpforth-runner` | Executes PTX kernels on a GPU for testing |

These tools can be composed for debugging or inspecting intermediate stages:

```bash
./build/bin/warpforth-translate --forth-to-mlir kernel.forth | \
./build/bin/warpforth-opt --warpforth-pipeline | \
./build/bin/warpforth-translate --mlir-to-ptx
```

## Language Reference

WarpForth supports stack operations, integer and float arithmetic, control flow, global and shared memory access, reduced-width memory types, user-defined words with local variables, and GPU-specific operations.

See [docs/language.md](docs/language.md) for the full language reference.

## Architecture

WarpForth compiles Forth through a series of MLIR dialect lowerings, each replacing higher-level abstractions with lower-level ones until the program is expressed entirely in LLVM IR and can be handed to the NVPTX backend.

| Stage | Pass | Description |
|-------|-------------|-------------|
| **Parsing** | `warpforth-translate --forth-to-mlir` | Parses Forth source into the `forth` dialect. The kernel is represented as a series of stack ops on an abstract `!forth.stack` type. |
| **Stack lowering** | `warpforth-opt --convert-forth-to-memref` | The abstract `!forth.stack` type is materialized as a `memref<256xi64>` buffer and `index` pair. Stack ops become explicit loads, stores, and pointer arithmetic. |
| **GPU wrapping** | `warpforth-opt --convert-forth-to-gpu` | Functions are wrapped in a `gpu.module`, the kernel entry point is marked as a `gpu.kernel` and GPU intrinsic words are lowered to `gpu` ops. |
| **NVVM/LLVM lowering** | Standard MLIR passes | GPU→NVVM, math→LLVM intrinsics and NVVM→LLVM. |
| **Code generation** | `warpforth-translate --mlir-to-ptx` | The GPU module is serialized to PTX assembly via LLVM's NVPTX backend. |

## Demo

The `demo/` directory contains a GPT-2 text generation demo that routes scaled dot-product attention through a WarpForth-compiled kernel. See [demo/README.md](demo/README.md) for setup instructions.

## Testing

```bash
# Run the LIT test suite
cmake --build build --target check-warpforth

# Run end-to-end GPU tests (requires Vast.ai API key)
VASTAI_API_KEY=xxx uv run pytest -v -m gpu
```
4 changes: 2 additions & 2 deletions demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ A pre-compiled `attention.ptx` is included in this directory.
## Step 2: Upload to GPU Instance

```bash
scp -r demo/ demo/gpt2_generate.py root@HOST:/workspace
scp -r demo/ root@HOST:/workspace
```

## Step 3: Install Dependencies (Remote)
Expand All @@ -30,7 +30,7 @@ pip install pycuda transformers
## Step 4: Generate Text (Remote)

```bash
python /workspace/gpt2_generate.py --ptx /workspace/attention.ptx --prompt "The meaning of life is"
python /workspace/demo/gpt2_generate.py --ptx /workspace/demo/attention.ptx --prompt "The meaning of life is"
```

| Flag | Default | Description |
Expand Down
Loading