Skip to content

[Perf] Streams 1-4#410

Open
hughperkins wants to merge 199 commits intomainfrom
hp/streams-quadrantsic-4-stream-pool
Open

[Perf] Streams 1-4#410
hughperkins wants to merge 199 commits intomainfrom
hp/streams-quadrantsic-4-stream-pool

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Introduces qd.create_stream() and qd.create_event() for launching
kernels on separate CUDA streams with event-based synchronization.
The qd_stream kwarg on kernel calls routes the launch to a specific
stream. Non-CUDA backends return no-op handles (0). Routes kernel
launcher memory ops through the active stream.
Mirrors the CUDA stream implementation for HIP: adds stream_ member
to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/
mem_free_async to HIP driver functions, and AMDGPU branches in all
Program stream/event methods. Converts AMDGPU kernel launcher to use
async memory operations through the active stream. CPU backend
returns 0 handles (no-op).
Introduces stream_parallel() for running top-level for-loop blocks on
separate GPU streams. The AST transformer maps 'with qd.stream_parallel()'
blocks to stream-parallel group IDs, which propagate through IR lowering
and offloading to the CUDA/AMDGPU kernel launchers. Each unique group ID
gets its own stream at launch time. Includes validation that all top-level
kernel statements must be stream_parallel blocks (no mixing), and offline
cache key support.
- Make CUDAContext::stream_ thread_local for thread-safety
- Convert sync memcpy_host_to_device to async on active_stream
- Use weakref in Stream/Event __del__ to safely handle interpreter shutdown
- Add __enter__/__exit__ context manager support for Stream and Event
- Use consistent qd_stream parameter naming in Event.record and Event.wait
- Add handle==0 guard to stream_synchronize
Batch the device_result_buffer free into the stream pipeline before the
sync barrier, matching the CUDA kernel launcher's ordering for
consistency and marginal performance improvement.
Use memcpy_host_to_device_async for external array transfers so they
are properly ordered on the active stream, matching the CUDA launcher.
Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in
CI under contention, and print actual timings for diagnostics.
…ead_local

Mirror the CUDA fixes: guard stream_synchronize against handle==0 to
avoid unintentional default stream sync, and make AMDGPUContext::stream_
thread_local for thread-safety.
…adrantsic-3-stream-parallel

# Conflicts:
#	python/quadrants/lang/stream.py
Prevents stale group IDs from leaking if insert_for is called after a
path that set a non-zero stream_parallel_group_id, matching the reset
pattern of all other ForLoopConfig fields.
Add an error check in begin_stream_parallel() to prevent nesting, which
would produce undefined group ID semantics.
…context safety

Add comments explaining that streams are created/destroyed per launch
(stream pooling as future optimization), and that RuntimeContext sharing
across concurrent streams is safe because kernels only read from it.
Replace per-launch stream_create/stream_destroy with acquire_stream/
release_stream on CUDAContext and AMDGPUContext. Streams are cached in
a pool and reused across invocations, avoiding the driver-level overhead
of stream creation (~5-50us) on every kernel launch in hot loops.
@hughperkins hughperkins changed the base branch from main to hp/streams-quadrantsic-3-stream-parallel March 12, 2026 01:35
Calls a stream_parallel kernel 5 times in a loop to verify that pooled
streams are correctly reused with correct results each iteration.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Review from Opus 4.6:

What it does: Introduces a stream pool in both CUDAContext and AMDGPUContext so that GPU streams are reused across kernel launches rather than created and destroyed every time.
The kernel launchers now call acquire_stream() / release_stream() instead of stream_create() / stream_destroy().
What's good:

  1. Clear motivation. The previous branch even had a comment saying "a stream pool could reduce overhead for hot loops" — this PR implements exactly that and removes the TODO
    comment. Nice follow-through.
  2. Symmetry. The CUDA and AMDGPU implementations are perfectly mirrored, which is easy to review and maintain.
  3. Thread safety. Both acquire_stream and release_stream are guarded by lock_, which is correct since the context is a singleton shared across threads.
  4. Test coverage. The test runs 5 iterations of a kernel with two stream_parallel blocks, which exercises the pool reuse path.

Concerns / suggestions:

  1. Unbounded pool growth. Streams are never destroyed — release_stream just pushes back into the vector forever. If a spike causes many concurrent streams to be allocated,
    they'll remain in the pool indefinitely. Consider adding a high-water mark or trimming idle streams, or at minimum destroying them in the context destructor.
  2. No cleanup in destructor. The stream_pool_ vector will leak GPU streams when the context is destroyed. You should iterate and destroy all pooled streams in the destructor.
  3. Lock contention. Every acquire_stream / release_stream takes the global lock_ mutex. In the hot loop case this PR is optimizing for, the acquire and release happen close
    together per launch. This is probably fine for now, but if you ever acquire/release streams from many threads concurrently, a lock-free pool or thread-local free-lists would
    be better. Not needed now, just worth noting.
  4. std::vector as a stack is fine but pop_back() + push_back() means LIFO reuse, which is actually good — recently used streams are more likely to be warm in driver caches.
  5. Test could be stronger. The test validates correctness but doesn't actually verify that streams are being reused (e.g., that no new streams are created on the 2nd+
    iteration). A unit test that directly calls acquire_stream / release_stream and checks pool size would give more confidence the pooling logic itself works. Though I
    understand that may be harder to do from Python.

Overall, it's a solid, focused change. The main actionable item is making sure pooled streams get destroyed when the context shuts down.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Response from the original Opus that wrote this PR:

  1. Unbounded pool growth -- In practice this is bounded by the number of unique stream_parallel group IDs per kernel, which is typically 2-4. The pool won't grow beyond that
    peak. Not a real concern.
  2. No cleanup in destructor -- Technically valid, but both CUDAContext and AMDGPUContext are singletons allocated with new and intentionally never deleted (the CUDA destructor
    is even commented out). The OS reclaims GPU resources on process exit. That said, adding cleanup is trivially cheap and good hygiene -- I could add it.
  3. Lock contention -- Reviewer already acknowledges it's fine for now. Agreed.
  4. LIFO reuse -- Just an observation, and a positive one.
  5. Stronger test -- Fair point but acquire_stream/release_stream are C++ methods not exposed via pybind, so testing pool internals from Python isn't practical. The correctness
    test (5 iterations with correct results) is the right level of testing from this side.

I think I might get some opinions from a fresh Opus.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

The reviewer Opus said the following when I asked "what are your thougths on concern 1?":

Honestly, on reflection it's probably overstated. The number of streams acquired per launch is determined by the number of distinct stream_parallel groups in a kernel — that's
a compile-time property, typically small (single digits). The pool will quickly stabilize at the high-water mark and stay there. There's no mechanism that would cause a
"spike" of stream allocation that then subsides, since the acquire/release pattern is synchronous within each launch (acquire all, launch, sync, release all).
GPU streams are also lightweight when idle — they're essentially just driver-side command queues, not large resource allocations.
So the "unbounded growth" framing is misleading. The real actionable concern is just concern 2 — destructor cleanup. You should still destroy the pooled streams when the
context shuts down to avoid leaking driver handles. But worrying about pool size management or high-water trimming is over-engineering for this use case.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

So both opuses concur that the only concern is the destructor, concern 2.

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 01:51
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (736545f63)

File Coverage Missing
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 74% 180,353,363,411,505,517
🟢 tests/python/test_streams.py 100%

Diff coverage: 86% · Overall: 74% · 42 lines, 6 missing

Full annotated report

hughperkins and others added 6 commits May 2, 2026 15:04
Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…ucer ctx param

Base branch added &ctx parameter to ensure_per_task_float_heap_post_reducer,
moved cap_blocks outside allocas scope, and added safety comments in both
CUDA and AMDGPU kernel launchers. Integrated into prepare_task lambda.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins hughperkins changed the title [Perf] Streams 4 [Perf] Streams 1-4 May 3, 2026
@hughperkins hughperkins changed the base branch from hp/streams-quadrantsic-3-stream-parallel to main May 3, 2026 08:54
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fee0868fc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +172 to +176
for (auto &[sid, s] : stream_by_id) {
CUDADriver::get_instance().stream_synchronize(s);
}
for (auto &[sid, s] : stream_by_id) {
CUDAContext::get_instance().release_stream(s);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore and release stream pool resources on launch failure

The stream-parallel path does not guard cleanup with try/finally, so if cuda_module->launch(...) or stream_synchronize(...) throws (e.g., invalid launch config or device assert), execution exits before restoring active_stream and before returning acquired streams to the pool. After that first failure, later kernels in the same thread can run on a stale stream and leaked pooled streams accumulate, which makes subsequent launches behave unpredictably.

Useful? React with 👍 / 👎.

Comment on lines +153 to +157
for (auto &[sid, s] : stream_by_id) {
AMDGPUDriver::get_instance().stream_synchronize(s);
}
for (auto &[sid, s] : stream_by_id) {
AMDGPUContext::get_instance().release_stream(s);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard AMDGPU stream-parallel cleanup with exception safety

This AMDGPU stream-parallel section has the same exception-safety gap as CUDA: if a launch or stream_synchronize fails, control flow skips releasing stream_by_id entries and skips restoring active_stream. That leaves thread-local stream state corrupted for future kernels and leaks pooled HIP streams after any runtime launch error.

Useful? React with 👍 / 👎.

Comment on lines +46 to +49
if prog is None:
try:
return impl.get_runtime().prog
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid destroying stale handles through a new runtime Program

When the owning Program has been collected (e.g., after qd.reset()), _destroy_prog() falls back to impl.get_runtime().prog and destroys the old handle through whichever backend is currently active. If the arch changed, this can either leak the original GPU stream/event (CPU fallback no-op) or call destroy on an incompatible backend handle, which can raise driver errors during cleanup.

Useful? React with 👍 / 👎.

Wrap the launch+synchronize section in try/catch so that acquired
streams are returned to the pool and active_stream is restored even
when a launch or stream_synchronize throws.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Coverage Report (8fee0868f)

File Coverage Missing
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 74% 180,353,363,411,505,517
🟢 tests/python/test_streams.py 100%

Diff coverage: 86% · Overall: 67% · 42 lines, 6 missing

Full annotated report

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac4b825074

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

offloaded_struct_for->is_bit_vectorized = for_stmt->is_bit_vectorized;
offloaded_struct_for->num_cpu_threads = std::min(for_stmt->num_cpu_threads, config.cpu_max_num_threads);
offloaded_struct_for->mem_access_opt = mem_access_opt;
offloaded_struct_for->stream_parallel_group_id = for_stmt->stream_parallel_group_id;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate stream group to struct_for prelude offloads

This change only tags the final struct_for offload with stream_parallel_group_id, but the generated prelude tasks (clear_list/listgen in the same function) remain at group 0. In the new CUDA/AMDGPU launchers, group-0 tasks run on the active/default stream while nonzero groups run on pooled non-blocking streams, so a sparse loop inside with qd.stream_parallel(): can enqueue struct_for before its list generation has completed, producing races or empty/stale iteration lists. The prelude offloads need the same group id as their struct_for.

Useful? React with 👍 / 👎.

Comment on lines +199 to +200
CUDAContext::get_instance().make_current();
CUDADriver::get_instance().context_synchronize();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep qd.sync scoped to stream semantics

Switching runtime synchronization to context_synchronize()/device_synchronize() makes qd.sync() block all streams, not just default-stream work. With explicit streams, this introduces a global barrier that stalls unrelated user-stream kernels and removes expected overlap, which is a behavior/performance regression for stream-based workflows. Synchronization here should stay stream-scoped (or the public semantics and all callers must be adjusted consistently).

Useful? React with 👍 / 👎.

The prepare_task lambda extraction dropped several non-obvious comments
explaining adstack gate roles, lazy-claim buffer rationale, device-side
reducer mechanics, shape-entry unit-stability, and grid-dim capping
rationale. Restore them.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Coverage Report (65d5cb92d)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🔴 python/quadrants/lang/ast/ast_transformer.py 59% 1531-1532,1534,1537,1539,1541,1543
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 52% 29,35,466-467,471,474,478-480,483-487,489-490,493-494,504-505,510,516-518
🔴 python/quadrants/lang/ast/symbol_resolver.py 0% 59-60,66-67,69-70,72-79,81-84,86-89
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 45% 1-2,4,7,11,18,22-23,26,30,32,36,39,47-50,53,64,66-72,74,77,81,88,92-93,96,100,102,105,108,113,118,122,130-133,136,147,149-155,157,160,164,171,178-179,185,188
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 314,330,345-346,489-493

Diff coverage: 80% · Overall: 67% · 624 lines, 126 missing

Full annotated report

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Coverage Report (6f84bcd10)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🔴 python/quadrants/lang/ast/ast_transformer.py 59% 1531-1532,1534,1537,1539,1541,1543
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 52% 29,35,466-467,471,474,478-480,483-487,489-490,493-494,504-505,510,516-518
🔴 python/quadrants/lang/ast/symbol_resolver.py 0% 59-60,66-67,69-70,72-79,81-84,86-89
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 45% 1-2,4,7,11,18,22-23,26,30,32,36,39,47-50,53,64,66-72,74,77,81,88,92-93,96,100,102,105,108,113,118,122,130-133,136,147,149-155,157,160,164,171,178-179,185,188
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 314,330,345-346,489-493

Diff coverage: 80% · Overall: 74% · 624 lines, 126 missing

Full annotated report

…c-4-stream-pool

# Conflicts:
#	quadrants/runtime/amdgpu/kernel_launcher.cpp
#	quadrants/runtime/cuda/kernel_launcher.cpp
…pool' into hp/streams-quadrantsic-4-stream-pool
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e329a5b16

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +483 to +486
if isinstance(func_node, ast.Attribute) and func_node.attr == "stream_parallel":
return True
if isinstance(func_node, ast.Name) and func_node.id == "stream_parallel":
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict stream_parallel detection to Quadrants symbols

The name-based fallback here accepts any with ...stream_parallel(): call even when it resolves to a non-Quadrants symbol, because it returns True purely on attr == "stream_parallel" / id == "stream_parallel". This can silently reinterpret user-defined context managers as Quadrants stream-parallel blocks instead of raising a syntax error, changing kernel behavior and skipping user context-manager logic; the check should only pass when resolution proves the symbol is quadrants.stream_parallel.

Useful? React with 👍 / 👎.

#if defined(QD_WITH_CUDA)
CUDADriver::get_instance().stream_synchronize(nullptr);
CUDAContext::get_instance().make_current();
CUDADriver::get_instance().context_synchronize();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep qd.sync scoped to the default stream

This switches synchronization from stream-level to context/device-wide, so qd.sync() now blocks on work from all streams instead of just the default stream. That introduces a global barrier and can eliminate intended overlap for explicit streams (e.g., unrelated kernels on separate streams are forced to complete), which regresses stream-based execution behavior and contradicts the new runtime_ops.sync() contract.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

PR change report (b5554ca62)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
tests/python/test_streams.py 413 +413
quadrants/program/program_stream.cpp 151 +151
python/quadrants/lang/stream.py 111 +111
quadrants/runtime/cuda/kernel_launcher.cpp 323 +89 -32
quadrants/runtime/amdgpu/kernel_launcher.cpp 307 +76 -23
python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 435 +68 -2
python/quadrants/lang/ast/ast_transformer.py 1272 +28 -3
python/quadrants/lang/kernel.py 562 +24 -3
quadrants/rhi/amdgpu/amdgpu_context.h 111 +24
python/quadrants/lang/ast/symbol_resolver.py 48 +23
quadrants/program/program_stream.h 21 +21
quadrants/ir/statements.h 1307 +20 -3
quadrants/rhi/cuda/cuda_context.h 115 +18 -1
quadrants/python/export_stream.cpp 17 +17
quadrants/ir/frontend_ir.h 806 +12
quadrants/codegen/llvm/llvm_compiled_data.h 96 +9 -3
quadrants/rhi/amdgpu/amdgpu_context.cpp 166 +7 -2
quadrants/rhi/amdgpu/amdgpu_profiler.cpp 181 +6 -4
quadrants/python/export_lang.cpp 1048 +6 -3
quadrants/rhi/cuda/cuda_context.cpp 128 +6 -1
quadrants/ir/frontend_ir.cpp 1405 +6
quadrants/rhi/amdgpu/amdgpu_device.cpp 138 +5 -3
quadrants/program/program.h 222 +5
tests/python/test_api.py 460 +5
tests/python/test_cache.py 210 +4 -4
quadrants/runtime/llvm/llvm_runtime_executor.cpp 600 +4 -2
quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h 58 +4 -1
quadrants/python/export.h 24 +4
quadrants/ir/statements.cpp 392 +3
quadrants/rhi/cuda/cuda_driver_functions.inc.h 69 +3
quadrants/transforms/lower_ast.cpp 423 +3
python/quadrants/lang/__init__.py 51 +2
quadrants/transforms/offload.cpp 602 +2
tests/python/test_perf_dispatch.py 418 +1 -1
quadrants/analysis/gen_offline_cache_key.cpp 562 +1
quadrants/codegen/amdgpu/codegen_amdgpu.cpp 429 +1
quadrants/codegen/cuda/codegen_cuda.cpp 628 +1
quadrants/program/program.cpp 403 +1
python/quadrants/lang/runtime_ops.py 4

Total: 39 file(s) changed, +1184 -91 code lines.

Full per-function report

Persistent scratch buffers (result_buffer, arg_buffer, runtime_context)
must use nullptr (default stream) for malloc_async/mem_free_async so
the operations serialize with all non-blocking streams. Using
active_stream caused use-after-free when the active stream changed
between launches.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Coverage Report (b5554ca62)

File Coverage Missing
🟢 python/quadrants/lang/__init__.py 100%
🔴 python/quadrants/lang/ast/ast_transformer.py 71% 1534,1537,1539,1541,1543
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 72% 471,474,478-480,483-487,510,516-518
🔴 python/quadrants/lang/ast/symbol_resolver.py 9% 66-67,69-70,72-79,81-84,86-89
🔴 python/quadrants/lang/kernel.py 67% 573,581,586,599,668,673
🔴 python/quadrants/lang/stream.py 63% 30,36,45-51,59-62,66-72,100,105,128-134,142-145,149-155,185
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 88% 14-18,23-27,253,309-310,312-314,316-320,325-326,328-330,332-335,340-341,343-346,348-351,465,489-493

Diff coverage: 79% · Overall: 79% · 624 lines, 132 missing

Full annotated report

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

PR change report (8a7cdd795)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
tests/python/test_streams.py 413 +413
quadrants/program/program_stream.cpp 151 +151
python/quadrants/lang/stream.py 111 +111
quadrants/runtime/cuda/kernel_launcher.cpp 322 +83 -27
quadrants/runtime/amdgpu/kernel_launcher.cpp 306 +70 -18
python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 435 +68 -2
python/quadrants/lang/ast/ast_transformer.py 1272 +28 -3
python/quadrants/lang/kernel.py 562 +24 -3
quadrants/rhi/amdgpu/amdgpu_context.h 111 +24
python/quadrants/lang/ast/symbol_resolver.py 48 +23
quadrants/program/program_stream.h 21 +21
quadrants/ir/statements.h 1307 +20 -3
quadrants/rhi/cuda/cuda_context.h 115 +18 -1
quadrants/python/export_stream.cpp 17 +17
quadrants/ir/frontend_ir.h 806 +12
quadrants/codegen/llvm/llvm_compiled_data.h 96 +9 -3
quadrants/rhi/amdgpu/amdgpu_context.cpp 166 +7 -2
quadrants/rhi/amdgpu/amdgpu_profiler.cpp 181 +6 -4
quadrants/python/export_lang.cpp 1048 +6 -3
quadrants/rhi/cuda/cuda_context.cpp 128 +6 -1
quadrants/ir/frontend_ir.cpp 1405 +6
quadrants/rhi/amdgpu/amdgpu_device.cpp 138 +5 -3
quadrants/program/program.h 222 +5
tests/python/test_api.py 460 +5
tests/python/test_cache.py 210 +4 -4
quadrants/runtime/llvm/llvm_runtime_executor.cpp 600 +4 -2
quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h 58 +4 -1
quadrants/python/export.h 24 +4
quadrants/ir/statements.cpp 392 +3
quadrants/rhi/cuda/cuda_driver_functions.inc.h 69 +3
quadrants/transforms/lower_ast.cpp 423 +3
python/quadrants/lang/__init__.py 51 +2
quadrants/transforms/offload.cpp 602 +2
tests/python/test_perf_dispatch.py 418 +1 -1
quadrants/analysis/gen_offline_cache_key.cpp 562 +1
quadrants/codegen/amdgpu/codegen_amdgpu.cpp 429 +1
quadrants/codegen/cuda/codegen_cuda.cpp 628 +1
quadrants/program/program.cpp 403 +1
python/quadrants/lang/runtime_ops.py 4

Total: 39 file(s) changed, +1172 -81 code lines.

Full per-function report

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Coverage Report (8a7cdd795)

File Coverage Missing
🟢 python/quadrants/lang/__init__.py 100%
🔴 python/quadrants/lang/ast/ast_transformer.py 71% 1534,1537,1539,1541,1543
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py 72% 471,474,478-480,483-487,510,516-518
🔴 python/quadrants/lang/ast/symbol_resolver.py 9% 66-67,69-70,72-79,81-84,86-89
🔴 python/quadrants/lang/kernel.py 67% 573,581,586,599,668,673
🔴 python/quadrants/lang/stream.py 63% 30,36,45-51,59-62,66-72,100,105,128-134,142-145,149-155,185
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 88% 14-18,23-27,253,309-310,312-314,316-320,325-326,328-330,332-335,340-341,343-346,348-351,465,489-493

Diff coverage: 79% · Overall: 79% · 624 lines, 132 missing

Full annotated report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants