Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
a40ed4c
Add qd.stream_parallel() context manager for implicit stream parallelism
hughperkins Mar 11, 2026
aa2fa2a
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Mar 12, 2026
be7ad92
Clear stream_parallel_group_id in ForLoopDecoratorRecorder::reset()
hughperkins Mar 12, 2026
ce83281
Reject nested stream_parallel blocks
hughperkins Mar 12, 2026
880abc7
Document stream_parallel launcher design: per-launch streams, shared …
hughperkins Mar 12, 2026
065a3b7
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Mar 12, 2026
cfc6f39
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins Apr 20, 2026
e9ce144
Apply clang-format
hughperkins Apr 20, 2026
007b050
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins Apr 24, 2026
91ca883
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
8cd793c
[Doc] Add stream_parallel() section to streams user guide
hughperkins Apr 28, 2026
e880d07
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
ad720bb
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
6351215
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
470912f
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins May 1, 2026
3b0ba29
Restore deleted comments, fix docstring wrapping, fix per-task adstac…
hughperkins May 1, 2026
1c62eae
Fix clang-format line break in AMDGPU kernel launcher
hughperkins May 1, 2026
e55c84f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
216f7d5
Address Claude review: reject stream_parallel in @qd.func, use non-bl…
hughperkins May 1, 2026
49dc5af
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
d7836e3
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
74604f2
Allow docstrings in stream_parallel kernels, merge base branch updates
hughperkins May 1, 2026
b83b65d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
0c552cd
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
212aeb9
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins May 1, 2026
226c7c5
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
1f471b3
Fix AMDGPU stream flag comment: HIP_STREAM_NON_BLOCKING not CU_STREAM…
hughperkins May 1, 2026
6919fee
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
7b4e2a4
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
88f1bf7
Add stream_parallel_group_id to QD_STMT_DEF_FIELDS for cache key corr…
hughperkins May 1, 2026
ca560b6
Fix clang-format: multi-line QD_STMT_DEF_FIELDS for RangeForStmt and …
hughperkins May 1, 2026
158c8fb
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins May 1, 2026
388a797
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
df0b03a
Fix stream_parallel identity check failing on dual-import-path builds
hughperkins May 1, 2026
acff351
Remove unused ASTResolver import from ast_transformer.py
hughperkins May 1, 2026
70eb471
Fix import sorting in ast_transformer.py
hughperkins May 1, 2026
ebd5e11
Add AST-level fallback for stream_parallel detection
hughperkins May 1, 2026
a6c3852
Add diagnostic info to stream_parallel exclusivity error message
hughperkins May 1, 2026
04e18ba
Merge hp/streams-quadrantsic-2-amdgpu-cpu: resolve streams.md conflict
hughperkins May 1, 2026
3af5bc8
Apply black formatting to function_def_transformer.py
hughperkins May 1, 2026
55b71fb
Merge hp/streams-quadrantsic-2-amdgpu-cpu: integrate adstack bound_ex…
hughperkins May 2, 2026
dbb055c
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins May 2, 2026
af4a306
Skip coverage probes in stream_parallel exclusivity check; restore de…
hughperkins May 2, 2026
c50d034
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 2, 2026
824cabf
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins May 2, 2026
24bc67d
Merge hp/streams-quadrantsic-2-amdgpu-cpu: integrate adstack post-red…
hughperkins May 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 46 additions & 21 deletions docs/source/user_guide/streams.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
# Streams

Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. By creating explicit streams, you can run independent kernels concurrently and control synchronization with events.
Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel.

## Supported platforms

| Backend | Streams | Events | Notes |
|---------|---------|--------|-------|
| CUDA | Yes | Yes | Full concurrent execution |
| AMDGPU | Yes | Yes | Full concurrent execution (requires ROCm >= 5.4) |
| CPU | No-op | No-op | `qd_stream` is silently ignored, kernels run serially |
| Metal | No-op | No-op | `qd_stream` is silently ignored, kernels run serially |
| Vulkan | No-op | No-op | `qd_stream` is silently ignored, kernels run serially |
| Backend | Supported |
|---------|-----------|
| CUDA | Yes |
| AMDGPU | Yes |
| CPU | No-op |
| Metal | No-op |
| Vulkan | No-op |

On backends without native stream support, `create_stream()` and `create_event()` return objects with handle `0`. All stream/event operations become no-ops and kernels run serially. Code written with streams is portable across all backends in the sense that it will run without modifications, but serially.
On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends it will run without modifications, but serially.

## Creating and using streams
## Stream parallelism

Inside a `@qd.kernel`, each `with qd.stream_parallel():` block runs on its own GPU stream.

```python
import quadrants as qd
Expand All @@ -24,17 +26,40 @@ qd.init(arch=qd.cuda)
N = 1024
a = qd.field(qd.f32, shape=(N,))
b = qd.field(qd.f32, shape=(N,))
c = qd.field(qd.f32, shape=(N,))

@qd.kernel
def fill_a():
for i in range(N):
a[i] = 1.0
def compute_ab():
with qd.stream_parallel():
for i in range(N):
a[i] = compute_a(i)
with qd.stream_parallel():
for j in range(N):
b[j] = compute_b(j)

@qd.kernel
def fill_b():
def combine():
for i in range(N):
b[i] = 2.0
c[i] = a[i] + b[i]

compute_ab() # the two stream_parallel blocks run concurrently
combine() # runs after compute_ab() returns — a[] and b[] are ready
```

Consecutive `with qd.stream_parallel():` blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns.

### Restrictions

- All top-level statements in a kernel must be either all `stream_parallel` blocks or all regular statements. Mixing the two at the top level is a compile-time error.
- Nesting `stream_parallel` blocks is not supported.

## Explicit streams

For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly.

### Creating and using streams

```python
s1 = qd.create_stream()
s2 = qd.create_stream()

Expand All @@ -50,7 +75,7 @@ s2.destroy()

Pass `qd_stream=` to any kernel call to launch it on that stream. Kernels on different streams may execute concurrently. Call `synchronize()` to block until all work on a stream completes.

## Events
### Events

Events let you express dependencies between streams without full synchronization.

Expand Down Expand Up @@ -84,7 +109,7 @@ s2.destroy()

`e.record(stream)` captures the point in `stream`'s execution. `e.wait(qd_stream=stream)` makes `stream` wait until the recorded point is reached. If `qd_stream` is omitted, the default stream waits.

## Context managers
### Context managers

Streams and events support `with` blocks for automatic cleanup:

Expand All @@ -95,11 +120,11 @@ with qd.create_stream() as s:
# s.destroy() called automatically
```

## PyTorch interop (CUDA)
### PyTorch interop (CUDA)

When mixing Quadrants kernels with PyTorch operations on CUDA, both frameworks must use the same stream to avoid race conditions. Without explicit stream management, Quadrants and PyTorch may launch work on different streams with no ordering guarantees, leading to intermittent data corruption.

### Running Quadrants kernels on PyTorch's stream
#### Running Quadrants kernels on PyTorch's stream

```python
import torch
Expand All @@ -115,7 +140,7 @@ apply_actions_kernel(qd_stream=stream)

Wrap PyTorch's raw `CUstream` pointer in a Quadrants `Stream` object. Do **not** call `destroy()` on this wrapper — PyTorch owns the underlying stream.

### Running PyTorch operations on a Quadrants stream
#### Running PyTorch operations on a Quadrants stream

```python
qd_stream = qd.create_stream()
Expand All @@ -136,4 +161,4 @@ qd_stream.destroy()
- **Not compatible with graphs.** Do not pass `qd_stream` to a kernel decorated with `graph=True`.
- **Not compatible with autodiff.** Do not pass `qd_stream` to a kernel that uses reverse-mode or forward-mode differentiation, or inside a `qd.ad.Tape` context.
- **`qd.sync()` only waits on the default stream.** It does not drain explicit streams. Call `stream.synchronize()` on each stream you need to wait for.
- **No automatic synchronization.** You are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input.
- **No automatic synchronization with explicit streams.** When using explicit streams, you are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input. `stream_parallel` handles this automatically.
32 changes: 29 additions & 3 deletions python/quadrants/lang/ast/ast_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,11 @@ def build_AnnAssign(ctx: ASTTransformerFuncContext, node: ast.AnnAssign):

@staticmethod
def build_assign_annotated(
ctx: ASTTransformerFuncContext, target: ast.Name, value, is_static_assign: bool, annotation: Type
ctx: ASTTransformerFuncContext,
target: ast.Name,
value,
is_static_assign: bool,
annotation: Type,
):
"""Build an annotated assignment like this: target: annotation = value.

Expand Down Expand Up @@ -165,7 +169,10 @@ def build_Assign(ctx: ASTTransformerFuncContext, node: ast.Assign) -> None:

@staticmethod
def build_assign_unpack(
ctx: ASTTransformerFuncContext, node_target: list | ast.Tuple, values, is_static_assign: bool
ctx: ASTTransformerFuncContext,
node_target: list | ast.Tuple,
values,
is_static_assign: bool,
):
"""Build the unpack assignments like this: (target1, target2) = (value1, value2).
The function should be called only if the node target is a tuple.
Expand Down Expand Up @@ -591,7 +598,8 @@ def build_Return(ctx: ASTTransformerFuncContext, node: ast.Return) -> None:
else:
raise QuadrantsSyntaxError("The return type is not supported now!")
ctx.ast_builder.create_kernel_exprgroup_return(
expr.make_expr_group(return_exprs), _qd_core.DebugInfo(ctx.get_pos_info(node))
expr.make_expr_group(return_exprs),
_qd_core.DebugInfo(ctx.get_pos_info(node)),
)
else:
ctx.return_data = node.value.ptr
Expand Down Expand Up @@ -1520,6 +1528,24 @@ def build_Continue(ctx: ASTTransformerFuncContext, node: ast.Continue) -> None:
ctx.ast_builder.insert_continue_stmt(_qd_core.DebugInfo(ctx.get_pos_info(node)))
return None

@staticmethod
def build_With(ctx: ASTTransformerFuncContext, node: ast.With) -> None:
if len(node.items) != 1:
raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports a single context manager")
item = node.items[0]
Comment on lines +1533 to +1535
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Not clear what items is.

if item.optional_vars is not None:
raise QuadrantsSyntaxError("'with ... as ...' is not supported in Quadrants kernels")
if not isinstance(item.context_expr, ast.Call):
raise QuadrantsSyntaxError("'with' in Quadrants kernels requires a call expression")
if not FunctionDefTransformer._is_stream_parallel_with(node, ctx.global_vars):
raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports qd.stream_parallel()")
Comment on lines +1538 to +1541
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 qd.stream_parallel() context manager accepts but silently drops user-supplied args/kwargs. with qd.stream_parallel(123, num_streams=4): compiles and runs the same as the no-arg form because build_With (ast_transformer.py:1540-1543) checks the call resolves to stream_parallel but never inspects item.context_expr.args / .keywords, and the runtime contextmanager is never invoked (so the TypeError its zero-parameter signature would raise never fires). Fix: after the resolve_to check, raise QuadrantsSyntaxError if item.context_expr.args or item.context_expr.keywords is non-empty — a one-line addition mirroring the other validation checks already in build_With.

Extended reasoning...

What the bug is

build_With at python/quadrants/lang/ast/ast_transformer.py:1533-1547 performs five structural checks on a with statement: single context manager, no as binding, the context expression must be an ast.Call, the call must resolve to qd.stream_parallel, and the enclosing function must be a kernel. After those pass it dispatches begin_stream_parallel() / end_stream_parallel() on the C++ ASTBuilder. It never inspects item.context_expr.args or item.context_expr.keywords.

Meanwhile stream_parallel in python/quadrants/lang/stream.py:131-138 is declared as a zero-parameter @contextmanager:

@contextmanager
def stream_parallel():
    yield

So the only legal call shape is qd.stream_parallel(). Anywhere else in Python, qd.stream_parallel(123, num_streams=4) would raise TypeError: stream_parallel() takes 0 positional arguments but 1 was given at runtime — but inside a kernel the AST transformer intercepts the call at compile time, replaces it with begin_stream_parallel() / end_stream_parallel() IR calls, and the contextmanager body never executes. The arguments are simply never evaluated.

Step-by-step proof

@qd.kernel
def k():
    with qd.stream_parallel(123, num_streams=4):   # silently accepted
        for i in range(N):
            a[i] = 1.0
    with qd.stream_parallel():                     # also accepted
        for j in range(N):
            b[j] = 2.0
  1. build_FunctionDef runs _validate_stream_parallel_exclusivity. _is_stream_parallel_with returns True for both with statements (it only checks single context manager, that context_expr is an ast.Call, and that the func resolves to stream_parallel — args/keywords are not inspected). Validation passes.
  2. build_stmts walks each ast.With. build_With checks len(node.items) == 1 ✓, optional_vars is None ✓, isinstance(item.context_expr, ast.Call) ✓, ASTResolver.resolve_to(item.context_expr.func, stream_parallel, ...) ✓, ctx.is_kernel ✓. It then calls ctx.ast_builder.begin_stream_parallel(), recurses into the body, and calls ctx.ast_builder.end_stream_parallel().
  3. item.context_expr.args == [ast.Constant(123)] and item.context_expr.keywords == [ast.keyword(arg='num_streams', value=ast.Constant(4))] are present in the AST but never read.
  4. The kernel compiles cleanly and runs identically to the no-arg form. The user gets stream-parallel execution as if they had written qd.stream_parallel()num_streams=4 did nothing.

Why nothing else catches it

The contextmanager body never executes at Python runtime (the AST transformer rewrites the call before Func.__call__ would otherwise dispatch into Python land), so the natural TypeError Python would raise for too-many-arguments never fires. _validate_stream_parallel_exclusivity shares the _is_stream_parallel_with helper with build_With and inherits the same blind spot. There is no test in tests/python/test_streams.py that exercises a misshapen stream_parallel(...) call.

Impact

User-facing UX papercut. A user who reasonably guesses at an API like num_streams=N, priority=..., or a positional stream-count argument gets no compile-time error and no runtime error — their argument is silently a no-op. They may well believe their argument is in effect and chase a phantom bug elsewhere. No silent miscompilation, no incorrect runtime behavior, no perf regression — the kernel still does what stream_parallel() with no args would do.

Fix

One-line addition in build_With after the resolve_to check, consistent with the surrounding validation style:

if item.context_expr.args or item.context_expr.keywords:
    raise QuadrantsSyntaxError("qd.stream_parallel() takes no arguments")

Three independent verifiers reviewed this and confirmed it is correct, real, and a real (if minor) UX gap; all three suggested nit severity given there is no correctness or performance impact.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applies — build_With (ast_transformer.py:1533-1549) still does not inspect item.context_expr.args or item.context_expr.keywords, so with qd.stream_parallel(123, num_streams=4): continues to compile and run identically to the no-arg form. One-line fix after the resolve_to check, mirroring the surrounding validation style: if item.context_expr.args or item.context_expr.keywords: raise QuadrantsSyntaxError("qd.stream_parallel() takes no arguments").

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applies — build_With (ast_transformer.py:1531-1547) still does not inspect item.context_expr.args or item.context_expr.keywords. with qd.stream_parallel(123, num_streams=4): continues to compile and run identically to the no-arg form. One-line fix after the _is_stream_parallel_with check, mirroring the surrounding validation style: if item.context_expr.args or item.context_expr.keywords: raise QuadrantsSyntaxError("qd.stream_parallel() takes no arguments").

if not ctx.is_kernel:
raise QuadrantsSyntaxError("qd.stream_parallel() can only be used inside @qd.kernel, not @qd.func")
ctx.ast_builder.begin_stream_parallel()
build_stmts(ctx, node.body)
Comment on lines +1531 to +1545
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 with qd.stream_parallel(): placed inside an if/while/for body inside a kernel compiles cleanly and silently runs serially on the default stream — _validate_stream_parallel_exclusivity only walks top-level node.body and never recurses into nested container statements, while build_With accepts the with-block at any nesting level. The for-loops inside become part of an enclosing IfStmt/WhileStmt/RangeForStmt that the offloader bundles into a serial OffloadedStmt with stream_parallel_group_id=0, so the user gets no concurrency and no error. Fix: either have build_With reject non-root nesting, or extend _validate_stream_parallel_exclusivity to recursively reject any nested stream_parallel it finds — both are one-helper changes consistent with the validators stated 'all top-level statements' contract.

Extended reasoning...

What the bug is

FunctionDefTransformer._validate_stream_parallel_exclusivity (function_def_transformer.py:467-477) iterates only node.body (the top-level kernel statements). _is_stream_parallel_with matches isinstance(stmt, ast.With) only — ast.If / ast.For / ast.While / ast.With (other than the stream_parallel one itself) all return False and are not descended into. Meanwhile build_With (ast_transformer.py:1533-1547) only checks (a) single context manager, (b) no as, (c) call expression, (d) resolves to stream_parallel, (e) ctx.is_kernel. There is no check that the with-block sits at the kernel root.

Result: a with qd.stream_parallel(): placed inside any non-root container body is silently accepted, stamps non-zero stream_parallel_group_id on its inner for-loops, and then loses all concurrency at offload time.

Step-by-step proof

@qd.kernel
def k(cond: qd.i32):
    if cond:
        with qd.stream_parallel():
            for i in range(N):
                a[i] = 1.0
            for j in range(N):
                b[j] = 2.0
  1. build_FunctionDef runs _validate_stream_parallel_exclusivity(node.body, ...). node.body = [ast.If]. _is_stream_parallel_with(ast.If) returns False. has_sp = False → validation returns OK.
  2. build_If opens begin_frontend_if, then build_stmts walks the if-body.
  3. The if-body contains an ast.With. build_With runs unconditionally:
    • begin_stream_parallel()current_stream_parallel_group_id_ = 1
    • build_stmts builds the two for-loops; begin_frontend_range_for (frontend_ir.cpp:1395) writes for_loop_dec_.config.stream_parallel_group_id = 1 onto the FrontendForStmt.
    • end_stream_parallel() resets to 0.
  4. Lowered IR: kernel root contains a single IfStmt; inside its true_statements are two RangeForStmts carrying stream_parallel_group_id=1 (lower_ast.cpp:294 propagates).
  5. Offloader::run (offload.cpp:90-158) walks root-block statements. The IfStmt fails the RangeForStmt / StructForStmt / MeshForStmt casts at lines 93/162/186, falls into the else at line 156, and the entire IfStmt (with its embedded RangeForStmts) is moved into pending_serial_statements — a serial OffloadedStmt with stream_parallel_group_id=0 (default-init at statements.h:1357).
  6. Codegen at codegen_cuda.cpp:641 / codegen_amdgpu.cpp:354 copies stmt->stream_parallel_group_id (= 0) onto current_task->stream_parallel_group_id.
  7. Launcher at runtime/cuda/kernel_launcher.cpp:60 sees task.stream_parallel_group_id == 0 and takes the default-stream branch.

The two for-loops execute serially on the default stream — exactly the behavior stream_parallel was used to avoid. No error, no warning.

Same trigger on other containers

Identical silent-serialization manifests for any non-root container:

  • while cond: with qd.stream_parallel(): for i in range(N): ... — the WhileStmt is the top-level statement and gets bundled serial.
  • for i in range(N): with qd.stream_parallel(): for j in range(M): ... — outer RangeForStmt is at root; the inner stream_parallel for-loops are part of the outer's body and never become their own offloaded tasks.

Why nothing else catches it

  • _validate_stream_parallel_exclusivity only walks node.body (top-level). It never recurses into ast.If/ast.For/ast.While/ast.With bodies.
  • build_With checks the call site is stream_parallel and ctx.is_kernel, but not the structural location.
  • begin_stream_parallel only rejects the nested-stream_parallel-within-stream_parallel case (frontend_ir.h:1027-1029).

The validator's stated contract ('all top-level statements must be stream_parallel blocks if any are') was designed to catch this class of misuse, but its implementation only enforces the rule at depth 0.

Severity rationale

Marking normal. The validator was specifically introduced in this PR to give clear compile-time errors for misuse patterns that would otherwise be silently confusing. A new user who reasonably writes if use_parallel: with qd.stream_parallel(): ... or for outer in range(M): with qd.stream_parallel(): ... (a natural attempt at conditional parallelism) gets no error and no parallelism — the exact UX failure mode the validator was supposed to prevent. One could note that for-loops nested in non-static control flow would never be parallel offloaded tasks anyway, so the user's underlying intent is doomed regardless; but the validator's job is to catch the misuse with a clear error, and it doesn't.

Fix

Two reasonable options, both one helper function:

(a) Have build_With raise QuadrantsSyntaxError when called from anything other than the kernel-root build_stmts invocation. Could be tracked with a depth counter on the AST context, or by passing a 'top-level' flag from FunctionDefTransformer.build_FunctionDef.

(b) Strengthen _validate_stream_parallel_exclusivity to recursively walk container statements (ast.If / ast.For / ast.While / ast.With bodies) and reject any non-root stream_parallel it finds.

Either brings the runtime behavior into line with the validator's documented contract.

ctx.ast_builder.end_stream_parallel()
Comment on lines +1544 to +1546
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Non-for statements (e.g. a[0] = 1.0, qd.deactivate(snode, [k]), counter[None] = 0) placed directly inside with qd.stream_parallel(): silently race with sibling for-loops. The non-for statement is bundled into a serial OffloadedStmt with default stream_parallel_group_id=0 and dispatched on active_stream, while the sibling for-loop runs on a fresh CU_STREAM_NON_BLOCKING per-group stream that has no event handoff back to the active stream — so the for-loop can begin reading what the assignment is supposed to write before the assignment retires. Fix: in build_With, walk node.body and reject anything that is not a for-loop (mirroring the existing kernel-body validator), or stamp the active group_id onto every emitted root_block statement while inside a stream_parallel block.

Extended reasoning...

Bug\n\nbuild_With (python/quadrants/lang/ast/ast_transformer.py:1546-1548) does not introspect node.body and does not open a new IR scope — begin_stream_parallel only flips a counter on the ASTBuilder. So statements inside the with body are inserted directly into the kernel root_block. The frontend group-id stamping the PR added lives only in the four begin_frontend_*_for methods (frontend_ir.cpp:1395/1409/1423/1439), which means non-for statements (FrontendAssignStmt, FrontendSNodeOpStmt, etc.) never carry a stream_parallel_group_id — that field does not exist on those statement types.\n\nIn Offloader::run (quadrants/transforms/offload.cpp:90-158), only RangeForStmt / StructForStmt / MeshForStmt become standalone OffloadedStmts; everything else falls into the else at line 155-157 and is moved into pending_serial_statements — a serial OffloadedStmt constructed with default-initialized stream_parallel_group_id=0 (statements.h:1370). Sibling RangeForStmt at line 129 propagates group_id correctly. Codegen at codegen_cuda.cpp:641 / codegen_amdgpu.cpp:354 copies that 0 onto the OffloadedTask, and the launcher at runtime/cuda/kernel_launcher.cpp:55-95 (AMDGPU twin) takes the default-stream branch for group_id=0 and the per-group stream branch for the for-loop.\n\nThe two streams are independent: the per-group stream is created with CU_STREAM_NON_BLOCKING (line 80), which does NOT implicit-sync with the legacy NULL stream nor with arbitrary user-created streams. The launcher records no event from active_stream and inserts no stream_wait_event on the new stream — so the for-loop on s_K can begin executing before the serial assignment on active_stream finishes.\n\n## Distinct from existing PR-timeline bugs\n\n- Bug #11 (for-with-break): trigger requires a break inside the for-loop, which causes lower_ast to emit AllocaStmt+WhileStmt at root_block. No assignment-style trigger.\n- Bug #6 (strictly_serialized): trigger is a top-level RangeForStmt with strictly_serialized=true that fails the !s->strictly_serialized predicate at offload.cpp:93. The cast succeeds; the predicate fails. Different upstream path.\n- Bug #13 (non-static if/while wrapping for-loop): trigger is an IfStmt/WhileStmt at root_block whose body contains a for-loop. The cast at line 93 fails because of TYPE (IfStmt/WhileStmt), and the BUNDLE drops the inner for's group_id. Bug 13's proposed fix is to recursively scan the bundle for an inner for-loop and propagate that for-loop's group_id onto the bundle. That fix does not help here: the offending statement is itself the bundle entry (Assignment, SNodeOp, etc.) — there IS NO inner for-loop to read group_id from.\n- Bug #16 (qd.deactivate gc tasks): trigger is qd.deactivate INSIDE a for-loop, producing gc auxiliary tasks via insert_gc. This bug is qd.deactivate (or any non-for statement) at the with-body level, NOT inside a for-loop.\n\nThe shared root cause across these bugs is that pending_serial_statements always defaults to stream_parallel_group_id=0, but the user-reachable trigger here (plain non-for statement directly in the with-body) is not covered by any of those bugs' fix proposals.\n\n## Step-by-step proof\n\npython\n@qd.kernel\ndef k():\n with qd.stream_parallel():\n a[0] = 1.0 # FrontendAssignmentStmt at root_block, NO group_id\n for i in range(N):\n b[i] = a[0] * 2 # range_for, group_id=1, reads a[0]\n\n\n1. _validate_stream_parallel_exclusivity (function_def_transformer.py:472) walks node.body == [ast.With] — single with qd.stream_parallel():, all top-level entries match. Validation passes.\n2. build_With (ast_transformer.py:1533-1548) calls begin_stream_parallel() (counter→1), then build_stmts(ctx, node.body) which walks [ast.Assign, ast.For] at the SAME scope as kernel root. build_Assign emits a FrontendAssignmentStmt directly into root_block — no group_id stamping. build_For reaches begin_frontend_range_for which DOES stamp stream_parallel_group_id=1 onto the FrontendForStmt.\n3. After lowering: root_block = [FrontendAssignment, RangeForStmt(group_id=1)]. The FrontendAssignment has no stream_parallel_group_id field at all.\n4. Offloader::run iterates root_block. The FrontendAssignment fails every for-loop cast → falls into the else at offload.cpp:155-157 → moved into pending_serial_statements. RangeForStmt hits offload.cpp:93assemble_serial_statements flushes the serial OffloadedStmt (group_id=0) into root_block, then constructs a fresh range_for OffloadedStmt with group_id=1.\n5. Final OffloadedTask list: [serial(group=0, [Assignment]), range_for(group=1, [for-body])].\n6. Launcher walk:\n - i=0, group=0 → default-stream branch, launches serial on active_stream (async).\n - i=1, group=1 → enters else branch. Creates s_1 with CU_STREAM_NON_BLOCKING (line 80). Sets stream to s_1, launches range_for on s_1, syncs, destroys.\n7. Race: s_1 has no implicit dependency on active_stream (NON_BLOCKING semantics), and the launcher inserts no event handoff. The range_for on s_1 can begin reading a[0] before the serial task on active_stream finishes writing 1.0.\n\n## Reachable user patterns\n\n- a[0] = some_value before a for-loop (initialize accumulator, then iterate)\n- counter[None] = 0 (global atomic-store followed by parallel reduction)\n- qd.deactivate(snode, [k]) (plain snode op without a wrapping if/while)\n- qd.append(snode, [...], val)\n- A function call like qd.sync() or any custom @qd.func that performs a global store\n\nIn each case the user is inside with qd.stream_parallel(): and reasonably expects safe sequencing. The validator's contract ('all top-level statements... must be stream_parallel blocks') is enforced at the kernel-body level but not at the with-body level. tests/python/test_streams.py only exercises with-bodies that contain only for-loops, so this regression is untested.\n\n## Fix\n\n**(a) Conservative, one-liner**: in build_With, after the stream_parallel call resolves, walk node.body and reject anything that is not an ast.For. Symmetric with the kernel-body validator already introduced in this PR; the docs already imply only for-loops are intended inside with qd.stream_parallel():.\n\n**(b) Structural**: when emitting any root_block statement while current_stream_parallel_group_id_ != 0, propagate the group_id onto a new field on the statement (or split bundles in the offloader by surrounding group_id). More invasive, but preserves the legality of mixing assignments and for-loops in a single with block.

return None

@staticmethod
def build_Pass(ctx: ASTTransformerFuncContext, node: ast.Pass) -> None:
return None
Expand Down
Loading
Loading