feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching#20
Open
forkni wants to merge 19 commits into
Open
feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching#20forkni wants to merge 19 commits into
forkni wants to merge 19 commits into
Conversation
- controlnet_module.py: import gpu_profiler; wrap per-CN prep in
profiler.region('cn.prep') and the engine forward call (TRT + PyTorch
paths) in profiler.region('cn.forward') — no-op when GPU_PROFILER unset
- profile_nsys.py: add --cn-scale FLOAT flag; when > 0 activates the
first registered ControlNet at that scale with a dummy gray image via
update_control_image/update_controlnet_scale so the offline benchmark
can measure the ~13ms/frame CN cost hypothesis (run A=baseline,
run B=--cn-scale 0.35)
update_control_image_efficient bails when _preprocessing_orchestrator is None (offline benchmark mode). Directly inject a dummy fp16 CUDA tensor into controlnet_images[0] so the hook's 'img is not None' gate passes. Result: cn.forward now measures correctly at ~12ms P50 (13ms wall-clock Δ matching the live-TD FPS drop from 27→20FPS).
interval=1 (default): disabled, CN runs every frame (no behaviour change). interval=N: CN forward runs once, residuals reused for N-1 subsequent frames. Invalidation keys: _images_version (bumped by image updates) + scale tuple hash (checked per-frame to handle direct controlnet_scales[i] writes that bypass the setter, e.g. from stream_parameter_updater live-update path). Changes: - controlnet_module.py: 5 cache fields in __init__ (already applied), install() reset, set_cn_cache_interval() setter, cache hit/write in _unet_hook. - config.py: cn_cache_interval param_map (default 1). - wrapper.py: cn_cache_interval kwarg in __init__/_load_model/update_stream_params. - stream_parameter_updater.py: cn_cache_interval in update_stream_params, delegates to _get_controlnet_pipeline().set_cn_cache_interval(). Local-only TD plumbing (td_manager.py, td_osc_handler.py, StreamDiffusionExt__td.py) also updated but not committed (gitignored files).
…alESRGAN init A) new ScribblePreprocessor subclasses HEDPreprocessor with scribble=True; registered as 'scribble' in core registry + __all__; two missing xinsir SDXL CNs (depth_xinsir_sdxl, scribble_sdxl) added to controlnet_registry.yaml B) standard_lineart: extract _compute_lineart_hwc helper, add _process_tensor_core override that stays on CUDA end-to-end (no PIL round-trip); numerically identical to PIL path (diff=0.0) C) RealESRGANProcessor: remove eager _ensure_model_ready() from __init__; add _model_ready flag + double-checked _ensure_loaded_once(); constructor now 1ms (was blocking download+TRT build)
…residency guard, autopreprocess registry
…hape alignment, error diagnostics, profile support
- _BuildLogFilter: add _BENIGN_WARN tuple to drop 'logger passed into createInferBuilder differs' notices; track in suppressed_warn counter; emit one-line summary at end of FP16 and FP8 build paths. - _ensure_build_logger_registered(): new idempotent helper that creates a throwaway trt.Builder(BUILD_TRT_LOGGER) once at import time so BUILD_TRT_LOGGER wins the global singleton slot before polygraphy's TRT_LOGGER (load path) or any fresh trt.Logger() (standalone tools) can register first. Guards against no-CUDA / TRT init failure with try/except. - compile_depth_anything_tensorrt / compile_raft_tensorrt: import and use BUILD_TRT_LOGGER instead of constructing fresh trt.Logger() objects, eliminating leak site 3 at the source.
- Add examples/benchmark/ab_bench.py: warmup+timed frames, torch.cuda.Event timing, GPU_PROFILER region stats, JSON keyed by git-SHA + config-hash; supports bare-pipeline and full config (CN/IPA/ESRGAN) modes - pipeline.py: replace per-frame torch.cat stock_noise rotation with preallocated ping-pong buffers; precompute expanded timestep tensors for TCD/non-batched sequential loop (eliminates t.view(1).repeat per step) - processors/base.py: eliminate hidden D2H sync in validate_tensor_input (.max() > 1.0 → dtype-based uint8 check before .to() cast); add profiler regions proc.core / proc.tensor_to_pil / proc.pil_to_tensor - sfast/__init__.py: gate enable_cuda_graph off when TRT UNet is active (detect via duck-type dump_profile attribute) - ipadapter_embedding.py: add ipa.clip_encode / ipa.sync profiler regions - realesrgan_trt.py: add esrgan.infer / esrgan.sync profiler regions
ipadapter_embedding.py:
- Replace blocking _ipadapter_stream.synchronize() with CUDA-event record +
current_stream().wait_event() — GPU-side dependency only; CPU thread no longer
stalls waiting for CLIP encode to finish
- Add per-preprocessor embedding cache (_last_input_ptr / _cached_embeds):
same style-image tensor reused across streaming frames skips CLIP re-encode
and the GPU→CPU tensor_to_pil round-trip entirely
- Lazy-allocate torch.cuda.Event in __init__ comment / actual alloc on first use
realesrgan_trt.py (RealESRGANEngine.infer):
- Move set_tensor_address calls inside _inference_lock (was incorrectly outside
the lock, creating a race window if two threads called infer simultaneously)
- Remove torch.cuda.synchronize() — GPU stream ordering serialises downstream
.clamp()/.clone() calls that the caller enqueues on the same stream; the
global CPU-sync blocked the thread for the full TRT inference duration
- Shrink profiler.region('esrgan.infer') to cover only execute_async_v3 enqueue
(no longer wraps the synchronize); 'esrgan.sync' region removed with the sync
base.py (_ensure_target_size_tensor):
- Add antialias=True to F.interpolate bilinear resize — applies Gaussian
pre-filter when downscaling ControlNet conditioning maps; no-op on upscale
trt_base.py (TensorRTEngine): - allocate_buffers: torch.empty(...).to(device=device) → torch.empty(..., device=device) — eliminates CPU alloc + H2D copy for every engine I/O tensor at startup or after dynamic realloc - infer(): replace per-shape-change alloc-and-discard with a small LRU cache (_buf_cache, capacity 4). Fast path (common streaming case: fixed resolution) is a dict-comprehension equality check — zero malloc overhead. Slow path (resolution-switch) checks the LRU before torch.empty: a cache hit restores pre-allocated GPU tensors and skips both malloc and cudaFree. Cache miss falls through to the existing realloc logic and stores the result in the LRU. Evicts the LRU entry (OrderedDict.popitem(last=False)) when over capacity. demo/realtime-img2img/util.py (bytes_to_pt): - decode_jpeg(byte_tensor, device='cuda') when CUDA is available — routes through torchvision nvJPEG, which decodes the JPEG payload directly to a CUDA tensor; eliminates the CPU decode + H2D DMA transfer on every input frame - Fallback to CPU decode_jpeg on non-CUDA machines (unchanged behaviour) item 8 (static_shapes): already implemented via config.py:138 'static_shapes' flag wired through builder.py + models.py get_minmax_dims; verification task is to run the existing guard test with static_shapes: true on the GPU box.
Adds two new flags to the run() entrypoint:
--save-goldens capture first N output frames and write them as
<sha>_<cfg_hash>_golden_NN.png alongside the JSON
--n-golden-frames N how many frames to capture (default 5)
Internally _run_loop now accepts n_capture and returns a (frame_times,
captured) tuple; _to_pil() converts PIL / numpy / torch.Tensor outputs to
a saveable PIL Image with fallback warnings for unknown types.
Usage on GPU box after the antialias or any visual-output change:
# before commit:
python examples/benchmark/ab_bench.py --save-goldens
# after commit:
python examples/benchmark/ab_bench.py --save-goldens
# diff: open <sha_before>_*_golden_00.png vs <sha_after>_*_golden_00.png
Finding A (engine_manager.py + wrapper.py):
- get_engine_path gains build_static_batch param; UNET prefix now appends
--sbatch{0|1} so a static-batch and a dynamic-batch engine are never
silently co-located in the same directory. Stale dynamic engine reuse
(the cause of the l2tc VALIDATE FAIL warning) is prevented without
requiring a manual cache wipe.
- UNet get_engine_path call in wrapper passes build_static_batch=True to
match the hardcoded build opts at wrapper.py:2183-2184.
- Fix misleading log (was reporting self.static_shapes for the UNet, which
the build opts ignore; now reports actual build_static_batch / build_
dynamic_shape values).
Finding B (trt_base.py):
- TensorRTEngine.activate() creates a dedicated torch.cuda.Stream instead
of caching current_stream().cuda_stream. execute_async_v3 now runs on
the dedicated stream, eliminating the per-frame implicit
cudaStreamSynchronize TRT inserts when stream handle 0x0 is used.
- Event-based cross-stream sync added in infer(): _pre_exec_event gates
the dedicated stream on the preceding copy_() calls (run on current
stream); _post_exec_event gates the current stream on execute completion
before _postprocess reads the output tensors. record_stream marks
output buffers live on the current stream for allocator safety.
- _process_tensor_core stops forcing current_stream().cuda_stream onto
infer(); the engine uses its own _cuda_stream (dedicated stream handle).
- Fixes scribble, HED, depth, and pose preprocessors simultaneously
(shared TensorRTEngine base). temporal_net and realesrgan have their
own engine wrappers and are not affected by this change.
…in preprocessors Tier 1 (standard_lineart.py): Replace the torch.any / boolean-index / float(torch.median) pattern in _compute_lineart_hwc with torch.nanmedian over a where-masked tensor. normalization_factor is now a 0-dim CUDA tensor — no .item() / host sync per frame. Numerics identical: nanmedian over unmasked pixels == median(intensity[threshold_mask]); empty-mask case still floors to 16 via nan_to_num. Tier 2 (depth.py, hed.py): Both override _process_tensor_core but still round-trip through PIL every frame (HF depth pipeline / controlnet_aux require PIL input), so the base.py one-time [GPU-residency] warning never fired. Import _pil_fallback_warned + _base_logger from base.py and emit the warning. Points users to depth_tensorrt / hed_tensorrt. No behavior change. Also: fix pre-existing ruff F821 in ab_bench.py — add TYPE_CHECKING guard for PIL.Image and drop redundant explicit string quotes (from __future__ import annotations already defers all annotations).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Branch
feat/cn-trt-preprocessors-trt-perf->SDTD_031_dev