perf(diffusion): FLUX.1 performance optimization kernels (#187) #189
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Changes
CUDA Kernels (
native/ops/nn/diffusion/flux_kernels.cuh)layer_norm_simple_kernel- Simple layer normalization (no affine)modulate_kernel- AdaLN modulation:y = x * (1 + scale) + shiftgated_residual_kernel- Gated residual:y = residual + gate * valuescale_tensor_kernel- Element-wise scalingconcat_axis1_kernel/split_axis1_kernel- Axis-1 concatenation/splittingadd_broadcast_kernel- Broadcasting additionlayer_norm_modulate_fused_kernel- Fused LayerNorm + modulationcuBLAS Integration (
native/ops/matmul/batched.cu)batched_matmul_fp32using cuBLAS sgemm_strided_batchedPython API Updates (
src/pygpukit/diffusion/models/flux/ops.py)Test plan
tests/test_flux_kernels.py)Benchmark
Run
python -m pytest tests/test_flux_kernels.py -vto verify:Closes #187
🤖 Generated with Claude Code