feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching by forkni · Pull Request #20 · dotsimulate/StreamDiffusion

forkni · 2026-06-13T02:38:23Z

Changes

10084bf perf: eliminate per-frame D2H syncs + surface silent PIL round-trips in preprocessors
e26843b fix: harden UNet cache key + dedicated TRT preprocessor stream
43b3070 feat: wire --save-goldens to ab_bench.py
e72e746 perf: Phase 3 TRT buffer LRU cache, direct GPU alloc, nvJPEG GPU decode
ac4adf6 perf: Phase 2 IPA event-sync+cache, ESRGAN lock shrink, CN antialias
cb69a49 perf: Phase 1 quick wins + Step 0 benchmark harness
78a343c test: add manual GPU smoke test for self-building TRT preprocessors
f20d70c fix: align web demo CN preprocessors (scribble_tensorrt, tile passthrough)
2d6ab7a fix: suppress TRT logger-mismatch warning in all build and load paths
a456bb2 fix: harden CN TRT preprocessors — NormalBae fallback init, dynamic shape alignment, error diagnostics, profile support
30a3458 feat: add GPU-native CN preprocessors (hed/scribble/normal_bae TRT), residency guard, autopreprocess registry
52b54c7 feat: add scribble preprocessor, GPU-native standard_lineart, lazy RealESRGAN init
251066a refactor: share one TRT build-log filter to drop logger-mismatch warning
131c211 fix: filter benign TRT myelin tactic-skip spam from VAE engine builds
816b9fe docs: add cn_cache_interval to td_config example (backend already wired)
68535fc perf: add --cn-cache-interval flag to offline benchmark
f94f1c9 perf: add CN residual caching with live cn_cache_interval control
d194bdc fix: bypass offline orchestrator in CN benchmark activation
76a77f0 perf: add ControlNet profiler regions and CN-enabled benchmark mode

Branch

feat/cn-trt-preprocessors-trt-perf -> SDTD_031_dev

- controlnet_module.py: import gpu_profiler; wrap per-CN prep in profiler.region('cn.prep') and the engine forward call (TRT + PyTorch paths) in profiler.region('cn.forward') — no-op when GPU_PROFILER unset - profile_nsys.py: add --cn-scale FLOAT flag; when > 0 activates the first registered ControlNet at that scale with a dummy gray image via update_control_image/update_controlnet_scale so the offline benchmark can measure the ~13ms/frame CN cost hypothesis (run A=baseline, run B=--cn-scale 0.35)

update_control_image_efficient bails when _preprocessing_orchestrator is None (offline benchmark mode). Directly inject a dummy fp16 CUDA tensor into controlnet_images[0] so the hook's 'img is not None' gate passes. Result: cn.forward now measures correctly at ~12ms P50 (13ms wall-clock Δ matching the live-TD FPS drop from 27→20FPS).

interval=1 (default): disabled, CN runs every frame (no behaviour change). interval=N: CN forward runs once, residuals reused for N-1 subsequent frames. Invalidation keys: _images_version (bumped by image updates) + scale tuple hash (checked per-frame to handle direct controlnet_scales[i] writes that bypass the setter, e.g. from stream_parameter_updater live-update path). Changes: - controlnet_module.py: 5 cache fields in __init__ (already applied), install() reset, set_cn_cache_interval() setter, cache hit/write in _unet_hook. - config.py: cn_cache_interval param_map (default 1). - wrapper.py: cn_cache_interval kwarg in __init__/_load_model/update_stream_params. - stream_parameter_updater.py: cn_cache_interval in update_stream_params, delegates to _get_controlnet_pipeline().set_cn_cache_interval(). Local-only TD plumbing (td_manager.py, td_osc_handler.py, StreamDiffusionExt__td.py) also updated but not committed (gitignored files).

…alESRGAN init A) new ScribblePreprocessor subclasses HEDPreprocessor with scribble=True; registered as 'scribble' in core registry + __all__; two missing xinsir SDXL CNs (depth_xinsir_sdxl, scribble_sdxl) added to controlnet_registry.yaml B) standard_lineart: extract _compute_lineart_hwc helper, add _process_tensor_core override that stays on CUDA end-to-end (no PIL round-trip); numerically identical to PIL path (diff=0.0) C) RealESRGANProcessor: remove eager _ensure_model_ready() from __init__; add _model_ready flag + double-checked _ensure_loaded_once(); constructor now 1ms (was blocking download+TRT build)

…residency guard, autopreprocess registry

…hape alignment, error diagnostics, profile support

- _BuildLogFilter: add _BENIGN_WARN tuple to drop 'logger passed into createInferBuilder differs' notices; track in suppressed_warn counter; emit one-line summary at end of FP16 and FP8 build paths. - _ensure_build_logger_registered(): new idempotent helper that creates a throwaway trt.Builder(BUILD_TRT_LOGGER) once at import time so BUILD_TRT_LOGGER wins the global singleton slot before polygraphy's TRT_LOGGER (load path) or any fresh trt.Logger() (standalone tools) can register first. Guards against no-CUDA / TRT init failure with try/except. - compile_depth_anything_tensorrt / compile_raft_tensorrt: import and use BUILD_TRT_LOGGER instead of constructing fresh trt.Logger() objects, eliminating leak site 3 at the source.

…ough)

- Add examples/benchmark/ab_bench.py: warmup+timed frames, torch.cuda.Event timing, GPU_PROFILER region stats, JSON keyed by git-SHA + config-hash; supports bare-pipeline and full config (CN/IPA/ESRGAN) modes - pipeline.py: replace per-frame torch.cat stock_noise rotation with preallocated ping-pong buffers; precompute expanded timestep tensors for TCD/non-batched sequential loop (eliminates t.view(1).repeat per step) - processors/base.py: eliminate hidden D2H sync in validate_tensor_input (.max() > 1.0 → dtype-based uint8 check before .to() cast); add profiler regions proc.core / proc.tensor_to_pil / proc.pil_to_tensor - sfast/__init__.py: gate enable_cuda_graph off when TRT UNet is active (detect via duck-type dump_profile attribute) - ipadapter_embedding.py: add ipa.clip_encode / ipa.sync profiler regions - realesrgan_trt.py: add esrgan.infer / esrgan.sync profiler regions

ipadapter_embedding.py: - Replace blocking _ipadapter_stream.synchronize() with CUDA-event record + current_stream().wait_event() — GPU-side dependency only; CPU thread no longer stalls waiting for CLIP encode to finish - Add per-preprocessor embedding cache (_last_input_ptr / _cached_embeds): same style-image tensor reused across streaming frames skips CLIP re-encode and the GPU→CPU tensor_to_pil round-trip entirely - Lazy-allocate torch.cuda.Event in __init__ comment / actual alloc on first use realesrgan_trt.py (RealESRGANEngine.infer): - Move set_tensor_address calls inside _inference_lock (was incorrectly outside the lock, creating a race window if two threads called infer simultaneously) - Remove torch.cuda.synchronize() — GPU stream ordering serialises downstream .clamp()/.clone() calls that the caller enqueues on the same stream; the global CPU-sync blocked the thread for the full TRT inference duration - Shrink profiler.region('esrgan.infer') to cover only execute_async_v3 enqueue (no longer wraps the synchronize); 'esrgan.sync' region removed with the sync base.py (_ensure_target_size_tensor): - Add antialias=True to F.interpolate bilinear resize — applies Gaussian pre-filter when downscaling ControlNet conditioning maps; no-op on upscale

trt_base.py (TensorRTEngine): - allocate_buffers: torch.empty(...).to(device=device) → torch.empty(..., device=device) — eliminates CPU alloc + H2D copy for every engine I/O tensor at startup or after dynamic realloc - infer(): replace per-shape-change alloc-and-discard with a small LRU cache (_buf_cache, capacity 4). Fast path (common streaming case: fixed resolution) is a dict-comprehension equality check — zero malloc overhead. Slow path (resolution-switch) checks the LRU before torch.empty: a cache hit restores pre-allocated GPU tensors and skips both malloc and cudaFree. Cache miss falls through to the existing realloc logic and stores the result in the LRU. Evicts the LRU entry (OrderedDict.popitem(last=False)) when over capacity. demo/realtime-img2img/util.py (bytes_to_pt): - decode_jpeg(byte_tensor, device='cuda') when CUDA is available — routes through torchvision nvJPEG, which decodes the JPEG payload directly to a CUDA tensor; eliminates the CPU decode + H2D DMA transfer on every input frame - Fallback to CPU decode_jpeg on non-CUDA machines (unchanged behaviour) item 8 (static_shapes): already implemented via config.py:138 'static_shapes' flag wired through builder.py + models.py get_minmax_dims; verification task is to run the existing guard test with static_shapes: true on the GPU box.

Adds two new flags to the run() entrypoint: --save-goldens capture first N output frames and write them as <sha>_<cfg_hash>_golden_NN.png alongside the JSON --n-golden-frames N how many frames to capture (default 5) Internally _run_loop now accepts n_capture and returns a (frame_times, captured) tuple; _to_pil() converts PIL / numpy / torch.Tensor outputs to a saveable PIL Image with fallback warnings for unknown types. Usage on GPU box after the antialias or any visual-output change: # before commit: python examples/benchmark/ab_bench.py --save-goldens # after commit: python examples/benchmark/ab_bench.py --save-goldens # diff: open <sha_before>_*_golden_00.png vs <sha_after>_*_golden_00.png

Finding A (engine_manager.py + wrapper.py): - get_engine_path gains build_static_batch param; UNET prefix now appends --sbatch{0|1} so a static-batch and a dynamic-batch engine are never silently co-located in the same directory. Stale dynamic engine reuse (the cause of the l2tc VALIDATE FAIL warning) is prevented without requiring a manual cache wipe. - UNet get_engine_path call in wrapper passes build_static_batch=True to match the hardcoded build opts at wrapper.py:2183-2184. - Fix misleading log (was reporting self.static_shapes for the UNet, which the build opts ignore; now reports actual build_static_batch / build_ dynamic_shape values). Finding B (trt_base.py): - TensorRTEngine.activate() creates a dedicated torch.cuda.Stream instead of caching current_stream().cuda_stream. execute_async_v3 now runs on the dedicated stream, eliminating the per-frame implicit cudaStreamSynchronize TRT inserts when stream handle 0x0 is used. - Event-based cross-stream sync added in infer(): _pre_exec_event gates the dedicated stream on the preceding copy_() calls (run on current stream); _post_exec_event gates the current stream on execute completion before _postprocess reads the output tensors. record_stream marks output buffers live on the current stream for allocator safety. - _process_tensor_core stops forcing current_stream().cuda_stream onto infer(); the engine uses its own _cuda_stream (dedicated stream handle). - Fixes scribble, HED, depth, and pose preprocessors simultaneously (shared TensorRTEngine base). temporal_net and realesrgan have their own engine wrappers and are not affected by this change.

…in preprocessors Tier 1 (standard_lineart.py): Replace the torch.any / boolean-index / float(torch.median) pattern in _compute_lineart_hwc with torch.nanmedian over a where-masked tensor. normalization_factor is now a 0-dim CUDA tensor — no .item() / host sync per frame. Numerics identical: nanmedian over unmasked pixels == median(intensity[threshold_mask]); empty-mask case still floors to 16 via nan_to_num. Tier 2 (depth.py, hed.py): Both override _process_tensor_core but still round-trip through PIL every frame (HF depth pipeline / controlnet_aux require PIL input), so the base.py one-time [GPU-residency] warning never fired. Import _pil_fallback_warned + _base_logger from base.py and emit the warning. Points users to depth_tensorrt / hed_tensorrt. No behavior change. Also: fix pre-existing ruff F821 in ab_bench.py — add TYPE_CHECKING guard for PIL.Image and drop redundant explicit string quotes (from __future__ import annotations already defers all annotations).

forkni added 19 commits June 12, 2026 22:12

perf: add --cn-cache-interval flag to offline benchmark

68535fc

docs: add cn_cache_interval to td_config example (backend already wired)

816b9fe

fix: filter benign TRT myelin tactic-skip spam from VAE engine builds

131c211

refactor: share one TRT build-log filter to drop logger-mismatch warning

251066a

feat: add GPU-native CN preprocessors (hed/scribble/normal_bae TRT), …

30a3458

…residency guard, autopreprocess registry

fix: harden CN TRT preprocessors — NormalBae fallback init, dynamic s…

a456bb2

…hape alignment, error diagnostics, profile support

fix: align web demo CN preprocessors (scribble_tensorrt, tile passthr…

f20d70c

…ough)

test: add manual GPU smoke test for self-building TRT preprocessors

78a343c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching#20

feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching#20
forkni wants to merge 19 commits into
SDTD_031_devfrom
feat/cn-trt-preprocessors-trt-perf

forkni commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forkni commented Jun 13, 2026

Changes

Branch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant