[Perf] Streams 1-4#410
Conversation
Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.
Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).
Introduces stream_parallel() for running top-level for-loop blocks on separate GPU streams. The AST transformer maps 'with qd.stream_parallel()' blocks to stream-parallel group IDs, which propagate through IR lowering and offloading to the CUDA/AMDGPU kernel launchers. Each unique group ID gets its own stream at launch time. Includes validation that all top-level kernel statements must be stream_parallel blocks (no mixing), and offline cache key support.
- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize
…quadrantsic-2-amdgpu-cpu
Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.
Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.
Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.
…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.
…adrantsic-3-stream-parallel # Conflicts: # python/quadrants/lang/stream.py
Prevents stale group IDs from leaking if insert_for is called after a path that set a non-zero stream_parallel_group_id, matching the reset pattern of all other ForLoopConfig fields.
Add an error check in begin_stream_parallel() to prevent nesting, which would produce undefined group ID semantics.
…context safety Add comments explaining that streams are created/destroyed per launch (stream pooling as future optimization), and that RuntimeContext sharing across concurrent streams is safe because kernels only read from it.
This reverts commit 60d015b.
…adrantsic-3-stream-parallel
…adrantsic-3-stream-parallel
Replace per-launch stream_create/stream_destroy with acquire_stream/ release_stream on CUDAContext and AMDGPUContext. Streams are cached in a pool and reused across invocations, avoiding the driver-level overhead of stream creation (~5-50us) on every kernel launch in hot loops.
…ms-quadrantsic-4-stream-pool
Calls a stream_parallel kernel 5 times in a loop to verify that pooled streams are correctly reused with correct results each iteration.
|
Review from Opus 4.6: What it does: Introduces a stream pool in both CUDAContext and AMDGPUContext so that GPU streams are reused across kernel launches rather than created and destroyed every time.
Concerns / suggestions:
Overall, it's a solid, focused change. The main actionable item is making sure pooled streams get destroyed when the context shuts down. |
|
Response from the original Opus that wrote this PR:
I think I might get some opinions from a fresh Opus. |
|
The reviewer Opus said the following when I asked "what are your thougths on concern 1?": Honestly, on reflection it's probably overstated. The number of streams acquired per launch is determined by the number of distinct stream_parallel groups in a kernel — that's |
|
So both opuses concur that the only concern is the destructor, concern 2. |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
74% | 180,353,363,411,505,517 |
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 86% · Overall: 74% · 42 lines, 6 missing
Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…quadrantsic-2-amdgpu-cpu
…ucer ctx param Base branch added &ctx parameter to ensure_per_task_float_heap_post_reducer, moved cap_blocks outside allocas scope, and added safety comments in both CUDA and AMDGPU kernel launchers. Integrated into prepare_task lambda. Co-authored-by: Cursor <cursoragent@cursor.com>
…ms-quadrantsic-4-stream-pool
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fee0868fc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (auto &[sid, s] : stream_by_id) { | ||
| CUDADriver::get_instance().stream_synchronize(s); | ||
| } | ||
| for (auto &[sid, s] : stream_by_id) { | ||
| CUDAContext::get_instance().release_stream(s); |
There was a problem hiding this comment.
Restore and release stream pool resources on launch failure
The stream-parallel path does not guard cleanup with try/finally, so if cuda_module->launch(...) or stream_synchronize(...) throws (e.g., invalid launch config or device assert), execution exits before restoring active_stream and before returning acquired streams to the pool. After that first failure, later kernels in the same thread can run on a stale stream and leaked pooled streams accumulate, which makes subsequent launches behave unpredictably.
Useful? React with 👍 / 👎.
| for (auto &[sid, s] : stream_by_id) { | ||
| AMDGPUDriver::get_instance().stream_synchronize(s); | ||
| } | ||
| for (auto &[sid, s] : stream_by_id) { | ||
| AMDGPUContext::get_instance().release_stream(s); |
There was a problem hiding this comment.
Guard AMDGPU stream-parallel cleanup with exception safety
This AMDGPU stream-parallel section has the same exception-safety gap as CUDA: if a launch or stream_synchronize fails, control flow skips releasing stream_by_id entries and skips restoring active_stream. That leaves thread-local stream state corrupted for future kernels and leaks pooled HIP streams after any runtime launch error.
Useful? React with 👍 / 👎.
| if prog is None: | ||
| try: | ||
| return impl.get_runtime().prog | ||
| except Exception: |
There was a problem hiding this comment.
Avoid destroying stale handles through a new runtime Program
When the owning Program has been collected (e.g., after qd.reset()), _destroy_prog() falls back to impl.get_runtime().prog and destroys the old handle through whichever backend is currently active. If the arch changed, this can either leak the original GPU stream/event (CPU fallback no-op) or call destroy on an incompatible backend handle, which can raise driver errors during cleanup.
Useful? React with 👍 / 👎.
Wrap the launch+synchronize section in try/catch so that acquired streams are returned to the pool and active_stream is restored even when a launch or stream_synchronize throws. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
74% | 180,353,363,411,505,517 |
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 86% · Overall: 67% · 42 lines, 6 missing
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac4b825074
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| offloaded_struct_for->is_bit_vectorized = for_stmt->is_bit_vectorized; | ||
| offloaded_struct_for->num_cpu_threads = std::min(for_stmt->num_cpu_threads, config.cpu_max_num_threads); | ||
| offloaded_struct_for->mem_access_opt = mem_access_opt; | ||
| offloaded_struct_for->stream_parallel_group_id = for_stmt->stream_parallel_group_id; |
There was a problem hiding this comment.
Propagate stream group to struct_for prelude offloads
This change only tags the final struct_for offload with stream_parallel_group_id, but the generated prelude tasks (clear_list/listgen in the same function) remain at group 0. In the new CUDA/AMDGPU launchers, group-0 tasks run on the active/default stream while nonzero groups run on pooled non-blocking streams, so a sparse loop inside with qd.stream_parallel(): can enqueue struct_for before its list generation has completed, producing races or empty/stale iteration lists. The prelude offloads need the same group id as their struct_for.
Useful? React with 👍 / 👎.
| CUDAContext::get_instance().make_current(); | ||
| CUDADriver::get_instance().context_synchronize(); |
There was a problem hiding this comment.
Keep qd.sync scoped to stream semantics
Switching runtime synchronization to context_synchronize()/device_synchronize() makes qd.sync() block all streams, not just default-stream work. With explicit streams, this introduces a global barrier that stalls unrelated user-stream kernels and removes expected overlap, which is a behavior/performance regression for stream-based workflows. Synchronization here should stay stream-scoped (or the public semantics and all callers must be adjusted consistently).
Useful? React with 👍 / 👎.
The prepare_task lambda extraction dropped several non-obvious comments explaining adstack gate roles, lazy-claim buffer rationale, device-side reducer mechanics, shape-entry unit-stability, and grid-dim capping rationale. Restore them. Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🔴 python/quadrants/lang/ast/ast_transformer.py |
59% | 1531-1532,1534,1537,1539,1541,1543 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
52% | 29,35,466-467,471,474,478-480,483-487,489-490,493-494,504-505,510,516-518 |
🔴 python/quadrants/lang/ast/symbol_resolver.py |
0% | 59-60,66-67,69-70,72-79,81-84,86-89 |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
45% | 1-2,4,7,11,18,22-23,26,30,32,36,39,47-50,53,64,66-72,74,77,81,88,92-93,96,100,102,105,108,113,118,122,130-133,136,147,149-155,157,160,164,171,178-179,185,188 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 314,330,345-346,489-493 |
Diff coverage: 80% · Overall: 67% · 624 lines, 126 missing
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🔴 python/quadrants/lang/ast/ast_transformer.py |
59% | 1531-1532,1534,1537,1539,1541,1543 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
52% | 29,35,466-467,471,474,478-480,483-487,489-490,493-494,504-505,510,516-518 |
🔴 python/quadrants/lang/ast/symbol_resolver.py |
0% | 59-60,66-67,69-70,72-79,81-84,86-89 |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
45% | 1-2,4,7,11,18,22-23,26,30,32,36,39,47-50,53,64,66-72,74,77,81,88,92-93,96,100,102,105,108,113,118,122,130-133,136,147,149-155,157,160,164,171,178-179,185,188 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 314,330,345-346,489-493 |
Diff coverage: 80% · Overall: 74% · 624 lines, 126 missing
…c-4-stream-pool # Conflicts: # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp
…pool' into hp/streams-quadrantsic-4-stream-pool
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e329a5b16
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if isinstance(func_node, ast.Attribute) and func_node.attr == "stream_parallel": | ||
| return True | ||
| if isinstance(func_node, ast.Name) and func_node.id == "stream_parallel": | ||
| return True |
There was a problem hiding this comment.
Restrict stream_parallel detection to Quadrants symbols
The name-based fallback here accepts any with ...stream_parallel(): call even when it resolves to a non-Quadrants symbol, because it returns True purely on attr == "stream_parallel" / id == "stream_parallel". This can silently reinterpret user-defined context managers as Quadrants stream-parallel blocks instead of raising a syntax error, changing kernel behavior and skipping user context-manager logic; the check should only pass when resolution proves the symbol is quadrants.stream_parallel.
Useful? React with 👍 / 👎.
| #if defined(QD_WITH_CUDA) | ||
| CUDADriver::get_instance().stream_synchronize(nullptr); | ||
| CUDAContext::get_instance().make_current(); | ||
| CUDADriver::get_instance().context_synchronize(); |
There was a problem hiding this comment.
Keep qd.sync scoped to the default stream
This switches synchronization from stream-level to context/device-wide, so qd.sync() now blocks on work from all streams instead of just the default stream. That introduces a global barrier and can eliminate intended overlap for explicit streams (e.g., unrelated kernels on separate streams are forced to complete), which regresses stream-based execution behavior and contradicts the new runtime_ops.sync() contract.
Useful? React with 👍 / 👎.
PR change report (
|
| File | LoC | Added | Removed |
|---|---|---|---|
tests/python/test_streams.py |
413 | +413 | |
quadrants/program/program_stream.cpp |
151 | +151 | |
python/quadrants/lang/stream.py |
111 | +111 | |
quadrants/runtime/cuda/kernel_launcher.cpp |
323 | +89 | -32 |
quadrants/runtime/amdgpu/kernel_launcher.cpp |
307 | +76 | -23 |
python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
435 | +68 | -2 |
python/quadrants/lang/ast/ast_transformer.py |
1272 | +28 | -3 |
python/quadrants/lang/kernel.py |
562 | +24 | -3 |
quadrants/rhi/amdgpu/amdgpu_context.h |
111 | +24 | |
python/quadrants/lang/ast/symbol_resolver.py |
48 | +23 | |
quadrants/program/program_stream.h |
21 | +21 | |
quadrants/ir/statements.h |
1307 | +20 | -3 |
quadrants/rhi/cuda/cuda_context.h |
115 | +18 | -1 |
quadrants/python/export_stream.cpp |
17 | +17 | |
quadrants/ir/frontend_ir.h |
806 | +12 | |
quadrants/codegen/llvm/llvm_compiled_data.h |
96 | +9 | -3 |
quadrants/rhi/amdgpu/amdgpu_context.cpp |
166 | +7 | -2 |
quadrants/rhi/amdgpu/amdgpu_profiler.cpp |
181 | +6 | -4 |
quadrants/python/export_lang.cpp |
1048 | +6 | -3 |
quadrants/rhi/cuda/cuda_context.cpp |
128 | +6 | -1 |
quadrants/ir/frontend_ir.cpp |
1405 | +6 | |
quadrants/rhi/amdgpu/amdgpu_device.cpp |
138 | +5 | -3 |
quadrants/program/program.h |
222 | +5 | |
tests/python/test_api.py |
460 | +5 | |
tests/python/test_cache.py |
210 | +4 | -4 |
quadrants/runtime/llvm/llvm_runtime_executor.cpp |
600 | +4 | -2 |
quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h |
58 | +4 | -1 |
quadrants/python/export.h |
24 | +4 | |
quadrants/ir/statements.cpp |
392 | +3 | |
quadrants/rhi/cuda/cuda_driver_functions.inc.h |
69 | +3 | |
quadrants/transforms/lower_ast.cpp |
423 | +3 | |
python/quadrants/lang/__init__.py |
51 | +2 | |
quadrants/transforms/offload.cpp |
602 | +2 | |
tests/python/test_perf_dispatch.py |
418 | +1 | -1 |
quadrants/analysis/gen_offline_cache_key.cpp |
562 | +1 | |
quadrants/codegen/amdgpu/codegen_amdgpu.cpp |
429 | +1 | |
quadrants/codegen/cuda/codegen_cuda.cpp |
628 | +1 | |
quadrants/program/program.cpp |
403 | +1 | |
python/quadrants/lang/runtime_ops.py |
4 |
Total: 39 file(s) changed, +1184 -91 code lines.
Persistent scratch buffers (result_buffer, arg_buffer, runtime_context) must use nullptr (default stream) for malloc_async/mem_free_async so the operations serialize with all non-blocking streams. Using active_stream caused use-after-free when the active stream changed between launches.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 python/quadrants/lang/__init__.py |
100% | |
🔴 python/quadrants/lang/ast/ast_transformer.py |
71% | 1534,1537,1539,1541,1543 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
72% | 471,474,478-480,483-487,510,516-518 |
🔴 python/quadrants/lang/ast/symbol_resolver.py |
9% | 66-67,69-70,72-79,81-84,86-89 |
🔴 python/quadrants/lang/kernel.py |
67% | 573,581,586,599,668,673 |
🔴 python/quadrants/lang/stream.py |
63% | 30,36,45-51,59-62,66-72,100,105,128-134,142-145,149-155,185 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
88% | 14-18,23-27,253,309-310,312-314,316-320,325-326,328-330,332-335,340-341,343-346,348-351,465,489-493 |
Diff coverage: 79% · Overall: 79% · 624 lines, 132 missing
PR change report (
|
| File | LoC | Added | Removed |
|---|---|---|---|
tests/python/test_streams.py |
413 | +413 | |
quadrants/program/program_stream.cpp |
151 | +151 | |
python/quadrants/lang/stream.py |
111 | +111 | |
quadrants/runtime/cuda/kernel_launcher.cpp |
322 | +83 | -27 |
quadrants/runtime/amdgpu/kernel_launcher.cpp |
306 | +70 | -18 |
python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
435 | +68 | -2 |
python/quadrants/lang/ast/ast_transformer.py |
1272 | +28 | -3 |
python/quadrants/lang/kernel.py |
562 | +24 | -3 |
quadrants/rhi/amdgpu/amdgpu_context.h |
111 | +24 | |
python/quadrants/lang/ast/symbol_resolver.py |
48 | +23 | |
quadrants/program/program_stream.h |
21 | +21 | |
quadrants/ir/statements.h |
1307 | +20 | -3 |
quadrants/rhi/cuda/cuda_context.h |
115 | +18 | -1 |
quadrants/python/export_stream.cpp |
17 | +17 | |
quadrants/ir/frontend_ir.h |
806 | +12 | |
quadrants/codegen/llvm/llvm_compiled_data.h |
96 | +9 | -3 |
quadrants/rhi/amdgpu/amdgpu_context.cpp |
166 | +7 | -2 |
quadrants/rhi/amdgpu/amdgpu_profiler.cpp |
181 | +6 | -4 |
quadrants/python/export_lang.cpp |
1048 | +6 | -3 |
quadrants/rhi/cuda/cuda_context.cpp |
128 | +6 | -1 |
quadrants/ir/frontend_ir.cpp |
1405 | +6 | |
quadrants/rhi/amdgpu/amdgpu_device.cpp |
138 | +5 | -3 |
quadrants/program/program.h |
222 | +5 | |
tests/python/test_api.py |
460 | +5 | |
tests/python/test_cache.py |
210 | +4 | -4 |
quadrants/runtime/llvm/llvm_runtime_executor.cpp |
600 | +4 | -2 |
quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h |
58 | +4 | -1 |
quadrants/python/export.h |
24 | +4 | |
quadrants/ir/statements.cpp |
392 | +3 | |
quadrants/rhi/cuda/cuda_driver_functions.inc.h |
69 | +3 | |
quadrants/transforms/lower_ast.cpp |
423 | +3 | |
python/quadrants/lang/__init__.py |
51 | +2 | |
quadrants/transforms/offload.cpp |
602 | +2 | |
tests/python/test_perf_dispatch.py |
418 | +1 | -1 |
quadrants/analysis/gen_offline_cache_key.cpp |
562 | +1 | |
quadrants/codegen/amdgpu/codegen_amdgpu.cpp |
429 | +1 | |
quadrants/codegen/cuda/codegen_cuda.cpp |
628 | +1 | |
quadrants/program/program.cpp |
403 | +1 | |
python/quadrants/lang/runtime_ops.py |
4 |
Total: 39 file(s) changed, +1172 -81 code lines.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 python/quadrants/lang/__init__.py |
100% | |
🔴 python/quadrants/lang/ast/ast_transformer.py |
71% | 1534,1537,1539,1541,1543 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
72% | 471,474,478-480,483-487,510,516-518 |
🔴 python/quadrants/lang/ast/symbol_resolver.py |
9% | 66-67,69-70,72-79,81-84,86-89 |
🔴 python/quadrants/lang/kernel.py |
67% | 573,581,586,599,668,673 |
🔴 python/quadrants/lang/stream.py |
63% | 30,36,45-51,59-62,66-72,100,105,128-134,142-145,149-155,185 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
88% | 14-18,23-27,253,309-310,312-314,316-320,325-326,328-330,332-335,340-341,343-346,348-351,465,489-493 |
Diff coverage: 79% · Overall: 79% · 624 lines, 132 missing
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough