Replace warpforth-runner C++ CLI with Python cuda-python + JSON protocol

## Motivation

The current `warpforth-runner` is a single-file C++ program that:
- Must be SCP'd to the remote GPU host and compiled with `nvcc` before use
- Uses CLI arguments as an ad-hoc protocol (`--param i64[]:1,2,3 --output-param 0 --output-count 3`)
- Outputs bare CSV to stdout (`1,2,3`)
- Only supports reading back a single output parameter per invocation
- Has no structured error reporting — CUDA errors go to stderr as free-form text

This makes the test harness fragile (string building/parsing in `conftest.py`) and hard to extend.

## Proposal

Replace `warpforth-runner` with a Python script using [`cuda-python`](https://nvidia.github.io/cuda-python/) that communicates via JSON on stdin/stdout.

### Input (JSON on stdin)

```json
{
  "ptx_base64": "Ly8gUFRYIHNvdXJjZQ==",
  "kernel": "main",
  "grid": [1, 1, 1],
  "block": [64, 1, 1],
  "params": [
    { "type": "i64[]", "values": [1, 2, 3, 0, 0] },
    { "type": "f64",   "value": 3.14 },
    { "type": "i64[]", "values": [0, 0, 0, 0, 0] }
  ],
  "outputs": [
    { "param": 0, "count": 3 },
    { "param": 2 }
  ]
}
```

- **`ptx_base64`**: PTX source as base64 (avoids JSON escaping issues with special characters in PTX)
- **`params`**: array/scalar declarations with typed values, matching the current `--param` semantics
- **`outputs`**: list of params to read back (supports multiple), with optional count

### Output (JSON on stdout)

Success:
```json
{
  "status": "ok",
  "outputs": [
    { "param": 0, "type": "i64[]", "values": [10, 20, 30] },
    { "param": 2, "type": "i64[]", "values": [42, 43, 44] }
  ]
}
```

Error:
```json
{
  "status": "error",
  "error": "CUDA_ERROR_INVALID_PTX: the provided PTX was invalid"
}
```

### Benefits

- **No remote compilation**: `pip install cuda-python` on the GPU host vs `nvcc` compiling a C++ file
- **Structured protocol**: JSON is self-describing, validatable, and trivially extensible
- **Multiple outputs**: read back any number of params in one invocation
- **Inline PTX**: the kernel source is embedded in the JSON payload — no separate file upload
- **Better errors**: structured `status`/`error` fields instead of parsing stderr
- **Simpler test harness**: `conftest.py` builds/parses dicts instead of string-building CLI args and parsing CSV

### Migration

1. Add `warpforth-runner.py` as a new tool (could live in `gpu_test/` or `tools/`)
2. Update `conftest.py` `KernelRunner` to construct JSON and parse JSON responses
3. Remove the `nvcc` compilation step from the remote setup
4. Drop `tools/warpforth-runner/` (the C++ version) once the Python version is proven
5. Add `cuda-python` to the project's Python dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace warpforth-runner C++ CLI with Python cuda-python + JSON protocol #50

Motivation

Proposal

Input (JSON on stdin)

Output (JSON on stdout)

Benefits

Migration

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Replace warpforth-runner C++ CLI with Python cuda-python + JSON protocol #50

Description

Motivation

Proposal

Input (JSON on stdin)

Output (JSON on stdout)

Benefits

Migration

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions