Release v0.2.19 #178

m96-chan · 2026-01-01T01:02:01Z

Summary

Release v0.2.19 with FLUX.1 image generation, lazy model loading, and infrastructure improvements.

New Features

FLUX.1 Image Generation

Full FLUX.1-schnell transformer implementation (19 joint + 38 single blocks)
Flow matching Euler scheduler
GPU-native operations (transpose, batched matmul, RoPE)
RoPE frequency caching for efficiency

Lazy Model Loading with Streaming

StreamingStrategy.EAGER - Load all at once (default)
StreamingStrategy.PROGRESSIVE - Load during first forward
StreamingStrategy.LAYER_BY_LAYER - Minimal memory usage

cuBLAS Dynamic Loader

Runtime DLL loading without compile-time CUDA Toolkit
Auto-detection of cuBLASLt versions (13/12/11)
Graceful fallback to native kernels

C++ Kernel Profiler

Built-in CUDA kernel profiling
Minimal overhead timing
Per-kernel statistics

HuggingFace T5 Encoder

Sharded safetensors support
Full T5 encoder implementation
Automatic encoder detection

DiT Architecture

PixArt transformer with AdaLN-Zero
Self/cross attention with GQA
Patch/timestep/2D sincos embeddings
GEGLU FFN

New GPU Operations

transpose_4d_0213, transpose_3d_012
gpu_batched_matmul, gpu_softmax
gpu_apply_rope
cross_attention, conv2d, group_norm

Files Changed

New: src/pygpukit/diffusion/ - Image generation module
New: native/ops/nn/diffusion/ - CUDA kernels for diffusion
Modified: src/pygpukit/llm/ - Streaming strategies
Modified: native/core/ - cuBLAS loader, profiler

Test Plan

FLUX.1 image generation produces correct output
Ruff lint passes
Mypy type check passes
cmake-check passes

🤖 Generated with Claude Code

## cuBLAS Dynamic Loader (Issue #134) - Dynamic loading of cuBLAS library (cublas64_13.dll / libcublas.so) - Supports GEMM: sgemm, dgemm, hgemm, gemm_ex (mixed precision) - Supports GEMV: sgemv, dgemv - Row-major convenience wrappers for Python API - Python bindings: cublas_is_available, cublas_get_version, cublas_test_* ## C++ Kernel Profiler (Issue #150) - Native C++ profiler using CUDA Driver API (cuEvent*) - ScopedTimer class for RAII-based timing - KernelProfiler for aggregating multiple kernel records - Python bindings with automatic native backend detection - Chrome trace export support Test results (RTX 5090, CUDA 13.1): - cuBLAS loaded: cublas64_13.dll v13.2.0 - SGEMM/HGEMM/DGEMM: all pass - Profiler: native C++ backend active 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add memory-mapped model loading with on-demand GPU loading for large models (70B+). ## Core Implementation (Rust) - LazyTensor: GPU caching with LRU eviction - LazyModelLoader: Multi-file SafeTensors loader with memory budgeting - TensorState enum: OnDisk, Loading, OnGpu, Evicted - Layer management: get_layer_tensors, layer_size, is_layer_loaded, layer_state ## Loading Strategies (Python) - SimpleStreaming: Load/unload each layer (minimal VRAM) - SlidingWindow: Keep N layers, prefetch ahead (balanced) - AutoLRU: Automatic LRU eviction (best performance) ## API - LazyModelLoader(memory_budget, enable_eviction) - LayerStreamingContext for managed streaming - create_streaming_context() factory function ## Usage ```python loader = LazyModelLoader(memory_budget=8 * 1024**3) loader.load_file("model.safetensors") with LayerStreamingContext(loader, SlidingWindow(4), num_layers=32) as ctx: for i in range(32): ctx.prepare(i) hidden = layers[i](hidden) ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused imports (F401) - Fix f-string without placeholders (F541) - Organize imports (I001) - Remove unnecessary mode argument (UP015) - Fix redefinition of unused import (F811) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests that require native CUDA module are now skipped when running in CI environment without GPU support. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

) Implements complete diffusion model support for text-to-image generation: Models: - DiT (Diffusion Transformer) with AdaLN conditioning - SD3Transformer (MMDiT architecture) - FluxTransformer with guidance embedding - VAE encoder/decoder with SafeTensors loading Schedulers: - EulerDiscreteScheduler (SDXL-style) - DDIMScheduler (deterministic/stochastic) - FlowMatchingScheduler (Rectified Flow for SD3/Flux) Operations: - GroupNorm (CPU fallback) - Cross-Attention (non-causal) - Conv2D / Conv2DTranspose (im2col) - AdaLN / AdaLN-Zero - Sinusoidal timestep embedding Text Encoders: - CLIPTextEncoder (OpenCLIP-style) - T5Encoder (T5-XXL for SD3/Flux) Pipeline: - Text2ImagePipeline with unified interface - Demo mode (works without model weights) - Batch generation support Example: - examples/image_generate.py with CLI interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix variable shadowing issue where input_ids/attention_mask were first defined as lists then reassigned to numpy arrays, confusing mypy. - Add explicit type annotations for input_ids and attention_mask - Rename intermediate list variables to ids_list and mask_list 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implement CUDA kernels for diffusion model operations: - GroupNorm: F32/BF16/FP16 variants for VAE/UNet - AdaLN/AdaLN-Zero: Adaptive Layer Normalization for DiT - Cross-Attention: Non-causal attention for text-to-image - Conv2D: im2col, col2im, 1x1 and 3x3 direct convolutions Files added: - native/ops/nn/diffusion/: groupnorm, adaln, cross_attention, conv2d kernels - native/bindings/nn/diffusion.cpp: pybind11 bindings Python ops updated to use native kernels when available: - group_norm.py, adaln.py, cross_attention.py, conv2d.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix out_channels from 4 to 8 for PixArt-Sigma (noise + variance) - Add transformer subdirectory detection for HuggingFace diffusers format - Add sharded T5 encoder detection with fallback to random embeddings - Extract first 4 channels from 8-channel noise prediction Tested with PixArt-Sigma-XL-2-512-MS: - 10 steps in 24.49s (2.449s/step) - Output: output/pixart_test.png 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…support - Add HFT5Encoder class using transformers library for proper T5 encoding - Support sharded safetensors loading via Python safetensors library - Auto-detect tokenizer in parent/tokenizer directory - CPU fallback when PyTorch doesn't support GPU (e.g., RTX 5090) - Update pipeline to prefer HFT5Encoder over simple T5Encoder Tested with PixArt-Sigma + T5-XXL: - T5 encoder on CPU (PyTorch lacks SM120 support) - Diffusion model on GPU via PyGPUkit - 20 steps in 55.9s (2.795s/step) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add _batched_matmul_loop() for when CUTLASS fails (SM120) - Use batched_matmul in T5 self-attention (80s -> 30s) - Remove HFT5Encoder (PyTorch dependency) - T5 now uses native GPU matmul operations Performance (RTX 5090, SM120): - T5-XXL encoding: 80s -> 30s (2.7x speedup) - batched_matmul [64,512,64]@[64,64,512]: 45ms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements FLUX.1-schnell text-to-image generation: - FluxTransformer with 19 joint + 38 single blocks - Joint attention (image-text cross-attention) - Single attention (self-attention on concatenated sequence) - Flow matching Euler scheduler - GPU-native ops for linear, transpose, matmul, softmax Optimizations: - GPU-native transpose_4d_0213 (18x faster than numpy) - GPU-native transpose_3d_012 for K^T (22x faster) - RoPE frequency caching to avoid recomputation Known limitations: - Modulation, layer_norm, gated_residual use numpy fallback - Generation time ~420s (vs ~3s diffusers) - needs broadcast kernels 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused N variable in dit/model.py - Fix unused conditioning variable in dit/adaln.py - Remove unused imports in flux/blocks.py - Remove unused x_np in flux/model.py - Add DiT transformer components (PixArt architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The files use #pragma once but had orphaned #endif statements causing compilation errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

GPUArray uses nbytes() method, not size_bytes(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use the project's device_memset wrapper for CUDA API abstraction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 13 commits December 31, 2025 10:44

fix(lint): resolve ruff B027 and UP037 errors in streaming.py

aa8fd7e

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

style: apply ruff format to streaming.py

c82155a

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan changed the title ~~feat(diffusion): add image generation module for SD3, Flux, PixArt~~ feat(diffusion): add FLUX.1 image generation module Jan 1, 2026

m96-chan self-assigned this Jan 1, 2026

m96-chan and others added 3 commits January 2, 2026 03:57

fix(cmake): remove orphaned #endif in diffusion kernels

5006afa

The files use #pragma once but had orphaned #endif statements causing compilation errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix(cmake): use nbytes() instead of size_bytes() in diffusion.inl

aa61015

GPUArray uses nbytes() method, not size_bytes(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix(cmake): use device_memset wrapper instead of cudaMemset

5a8a98c

Use the project's device_memset wrapper for CUDA API abstraction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit 0b5c13e into main Jan 1, 2026
13 checks passed

m96-chan changed the title ~~feat(diffusion): add FLUX.1 image generation module~~ Release v0.2.19 Jan 1, 2026

m96-chan mentioned this pull request Jan 1, 2026

perf(diffusion): FLUX.1 transformer performance optimization #187

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.2.19 #178

Release v0.2.19 #178

Uh oh!

m96-chan commented Jan 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Release v0.2.19 #178

Release v0.2.19 #178

Uh oh!

Conversation

m96-chan commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Features

FLUX.1 Image Generation

Lazy Model Loading with Streaming

cuBLAS Dynamic Loader

C++ Kernel Profiler

HuggingFace T5 Encoder

DiT Architecture

New GPU Operations

Files Changed

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m96-chan commented Jan 1, 2026 •

edited

Loading