Skip to content

feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching#20

Open
forkni wants to merge 19 commits into
SDTD_031_devfrom
feat/cn-trt-preprocessors-trt-perf
Open

feat: GPU-native CN TRT preprocessors + TRT perf phases 1-3 + CN residual caching#20
forkni wants to merge 19 commits into
SDTD_031_devfrom
feat/cn-trt-preprocessors-trt-perf

Conversation

@forkni

@forkni forkni commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Changes

  • 10084bf perf: eliminate per-frame D2H syncs + surface silent PIL round-trips in preprocessors
  • e26843b fix: harden UNet cache key + dedicated TRT preprocessor stream
  • 43b3070 feat: wire --save-goldens to ab_bench.py
  • e72e746 perf: Phase 3 TRT buffer LRU cache, direct GPU alloc, nvJPEG GPU decode
  • ac4adf6 perf: Phase 2 IPA event-sync+cache, ESRGAN lock shrink, CN antialias
  • cb69a49 perf: Phase 1 quick wins + Step 0 benchmark harness
  • 78a343c test: add manual GPU smoke test for self-building TRT preprocessors
  • f20d70c fix: align web demo CN preprocessors (scribble_tensorrt, tile passthrough)
  • 2d6ab7a fix: suppress TRT logger-mismatch warning in all build and load paths
  • a456bb2 fix: harden CN TRT preprocessors — NormalBae fallback init, dynamic shape alignment, error diagnostics, profile support
  • 30a3458 feat: add GPU-native CN preprocessors (hed/scribble/normal_bae TRT), residency guard, autopreprocess registry
  • 52b54c7 feat: add scribble preprocessor, GPU-native standard_lineart, lazy RealESRGAN init
  • 251066a refactor: share one TRT build-log filter to drop logger-mismatch warning
  • 131c211 fix: filter benign TRT myelin tactic-skip spam from VAE engine builds
  • 816b9fe docs: add cn_cache_interval to td_config example (backend already wired)
  • 68535fc perf: add --cn-cache-interval flag to offline benchmark
  • f94f1c9 perf: add CN residual caching with live cn_cache_interval control
  • d194bdc fix: bypass offline orchestrator in CN benchmark activation
  • 76a77f0 perf: add ControlNet profiler regions and CN-enabled benchmark mode

Branch

feat/cn-trt-preprocessors-trt-perf -> SDTD_031_dev

forkni added 19 commits June 12, 2026 22:12
- controlnet_module.py: import gpu_profiler; wrap per-CN prep in
  profiler.region('cn.prep') and the engine forward call (TRT + PyTorch
  paths) in profiler.region('cn.forward') — no-op when GPU_PROFILER unset
- profile_nsys.py: add --cn-scale FLOAT flag; when > 0 activates the
  first registered ControlNet at that scale with a dummy gray image via
  update_control_image/update_controlnet_scale so the offline benchmark
  can measure the ~13ms/frame CN cost hypothesis (run A=baseline,
  run B=--cn-scale 0.35)
update_control_image_efficient bails when _preprocessing_orchestrator
is None (offline benchmark mode). Directly inject a dummy fp16 CUDA
tensor into controlnet_images[0] so the hook's 'img is not None' gate
passes. Result: cn.forward now measures correctly at ~12ms P50 (13ms
wall-clock Δ matching the live-TD FPS drop from 27→20FPS).
interval=1 (default): disabled, CN runs every frame (no behaviour change).
interval=N: CN forward runs once, residuals reused for N-1 subsequent frames.
Invalidation keys: _images_version (bumped by image updates) + scale tuple hash
(checked per-frame to handle direct controlnet_scales[i] writes that bypass
the setter, e.g. from stream_parameter_updater live-update path).

Changes:
- controlnet_module.py: 5 cache fields in __init__ (already applied), install()
  reset, set_cn_cache_interval() setter, cache hit/write in _unet_hook.
- config.py: cn_cache_interval param_map (default 1).
- wrapper.py: cn_cache_interval kwarg in __init__/_load_model/update_stream_params.
- stream_parameter_updater.py: cn_cache_interval in update_stream_params, delegates
  to _get_controlnet_pipeline().set_cn_cache_interval().

Local-only TD plumbing (td_manager.py, td_osc_handler.py, StreamDiffusionExt__td.py)
also updated but not committed (gitignored files).
…alESRGAN init

A) new ScribblePreprocessor subclasses HEDPreprocessor with scribble=True; registered as 'scribble'
   in core registry + __all__; two missing xinsir SDXL CNs (depth_xinsir_sdxl, scribble_sdxl)
   added to controlnet_registry.yaml
B) standard_lineart: extract _compute_lineart_hwc helper, add _process_tensor_core override that
   stays on CUDA end-to-end (no PIL round-trip); numerically identical to PIL path (diff=0.0)
C) RealESRGANProcessor: remove eager _ensure_model_ready() from __init__; add _model_ready flag
   + double-checked _ensure_loaded_once(); constructor now 1ms (was blocking download+TRT build)
…hape alignment, error diagnostics, profile support
- _BuildLogFilter: add _BENIGN_WARN tuple to drop 'logger passed into
  createInferBuilder differs' notices; track in suppressed_warn counter;
  emit one-line summary at end of FP16 and FP8 build paths.
- _ensure_build_logger_registered(): new idempotent helper that creates a
  throwaway trt.Builder(BUILD_TRT_LOGGER) once at import time so BUILD_TRT_LOGGER
  wins the global singleton slot before polygraphy's TRT_LOGGER (load path)
  or any fresh trt.Logger() (standalone tools) can register first. Guards
  against no-CUDA / TRT init failure with try/except.
- compile_depth_anything_tensorrt / compile_raft_tensorrt: import and use
  BUILD_TRT_LOGGER instead of constructing fresh trt.Logger() objects,
  eliminating leak site 3 at the source.
- Add examples/benchmark/ab_bench.py: warmup+timed frames, torch.cuda.Event
  timing, GPU_PROFILER region stats, JSON keyed by git-SHA + config-hash;
  supports bare-pipeline and full config (CN/IPA/ESRGAN) modes
- pipeline.py: replace per-frame torch.cat stock_noise rotation with
  preallocated ping-pong buffers; precompute expanded timestep tensors
  for TCD/non-batched sequential loop (eliminates t.view(1).repeat per step)
- processors/base.py: eliminate hidden D2H sync in validate_tensor_input
  (.max() > 1.0 → dtype-based uint8 check before .to() cast); add
  profiler regions proc.core / proc.tensor_to_pil / proc.pil_to_tensor
- sfast/__init__.py: gate enable_cuda_graph off when TRT UNet is active
  (detect via duck-type dump_profile attribute)
- ipadapter_embedding.py: add ipa.clip_encode / ipa.sync profiler regions
- realesrgan_trt.py: add esrgan.infer / esrgan.sync profiler regions
ipadapter_embedding.py:
- Replace blocking _ipadapter_stream.synchronize() with CUDA-event record +
  current_stream().wait_event() — GPU-side dependency only; CPU thread no longer
  stalls waiting for CLIP encode to finish
- Add per-preprocessor embedding cache (_last_input_ptr / _cached_embeds):
  same style-image tensor reused across streaming frames skips CLIP re-encode
  and the GPU→CPU tensor_to_pil round-trip entirely
- Lazy-allocate torch.cuda.Event in __init__ comment / actual alloc on first use

realesrgan_trt.py (RealESRGANEngine.infer):
- Move set_tensor_address calls inside _inference_lock (was incorrectly outside
  the lock, creating a race window if two threads called infer simultaneously)
- Remove torch.cuda.synchronize() — GPU stream ordering serialises downstream
  .clamp()/.clone() calls that the caller enqueues on the same stream; the
  global CPU-sync blocked the thread for the full TRT inference duration
- Shrink profiler.region('esrgan.infer') to cover only execute_async_v3 enqueue
  (no longer wraps the synchronize); 'esrgan.sync' region removed with the sync

base.py (_ensure_target_size_tensor):
- Add antialias=True to F.interpolate bilinear resize — applies Gaussian
  pre-filter when downscaling ControlNet conditioning maps; no-op on upscale
trt_base.py (TensorRTEngine):
- allocate_buffers: torch.empty(...).to(device=device) →
  torch.empty(..., device=device) — eliminates CPU alloc + H2D copy for every
  engine I/O tensor at startup or after dynamic realloc
- infer(): replace per-shape-change alloc-and-discard with a small LRU cache
  (_buf_cache, capacity 4).  Fast path (common streaming case: fixed resolution)
  is a dict-comprehension equality check — zero malloc overhead.  Slow path
  (resolution-switch) checks the LRU before torch.empty: a cache hit restores
  pre-allocated GPU tensors and skips both malloc and cudaFree.  Cache miss
  falls through to the existing realloc logic and stores the result in the LRU.
  Evicts the LRU entry (OrderedDict.popitem(last=False)) when over capacity.

demo/realtime-img2img/util.py (bytes_to_pt):
- decode_jpeg(byte_tensor, device='cuda') when CUDA is available — routes
  through torchvision nvJPEG, which decodes the JPEG payload directly to a
  CUDA tensor; eliminates the CPU decode + H2D DMA transfer on every input frame
- Fallback to CPU decode_jpeg on non-CUDA machines (unchanged behaviour)

item 8 (static_shapes): already implemented via config.py:138 'static_shapes'
flag wired through builder.py + models.py get_minmax_dims; verification task
is to run the existing guard test with static_shapes: true on the GPU box.
Adds two new flags to the run() entrypoint:
  --save-goldens        capture first N output frames and write them as
                        <sha>_<cfg_hash>_golden_NN.png alongside the JSON
  --n-golden-frames N   how many frames to capture (default 5)

Internally _run_loop now accepts n_capture and returns a (frame_times,
captured) tuple; _to_pil() converts PIL / numpy / torch.Tensor outputs to
a saveable PIL Image with fallback warnings for unknown types.

Usage on GPU box after the antialias or any visual-output change:
  # before commit:
  python examples/benchmark/ab_bench.py --save-goldens
  # after commit:
  python examples/benchmark/ab_bench.py --save-goldens
  # diff: open <sha_before>_*_golden_00.png vs <sha_after>_*_golden_00.png
Finding A (engine_manager.py + wrapper.py):
- get_engine_path gains build_static_batch param; UNET prefix now appends
  --sbatch{0|1} so a static-batch and a dynamic-batch engine are never
  silently co-located in the same directory.  Stale dynamic engine reuse
  (the cause of the l2tc VALIDATE FAIL warning) is prevented without
  requiring a manual cache wipe.
- UNet get_engine_path call in wrapper passes build_static_batch=True to
  match the hardcoded build opts at wrapper.py:2183-2184.
- Fix misleading log (was reporting self.static_shapes for the UNet, which
  the build opts ignore; now reports actual build_static_batch / build_
  dynamic_shape values).

Finding B (trt_base.py):
- TensorRTEngine.activate() creates a dedicated torch.cuda.Stream instead
  of caching current_stream().cuda_stream.  execute_async_v3 now runs on
  the dedicated stream, eliminating the per-frame implicit
  cudaStreamSynchronize TRT inserts when stream handle 0x0 is used.
- Event-based cross-stream sync added in infer(): _pre_exec_event gates
  the dedicated stream on the preceding copy_() calls (run on current
  stream); _post_exec_event gates the current stream on execute completion
  before _postprocess reads the output tensors.  record_stream marks
  output buffers live on the current stream for allocator safety.
- _process_tensor_core stops forcing current_stream().cuda_stream onto
  infer(); the engine uses its own _cuda_stream (dedicated stream handle).
- Fixes scribble, HED, depth, and pose preprocessors simultaneously
  (shared TensorRTEngine base).  temporal_net and realesrgan have their
  own engine wrappers and are not affected by this change.
…in preprocessors

Tier 1 (standard_lineart.py): Replace the torch.any / boolean-index / float(torch.median)
pattern in _compute_lineart_hwc with torch.nanmedian over a where-masked tensor.
normalization_factor is now a 0-dim CUDA tensor — no .item() / host sync per frame.
Numerics identical: nanmedian over unmasked pixels == median(intensity[threshold_mask]);
empty-mask case still floors to 16 via nan_to_num.

Tier 2 (depth.py, hed.py): Both override _process_tensor_core but still round-trip through
PIL every frame (HF depth pipeline / controlnet_aux require PIL input), so the base.py
one-time [GPU-residency] warning never fired. Import _pil_fallback_warned + _base_logger
from base.py and emit the warning. Points users to depth_tensorrt / hed_tensorrt.
No behavior change.

Also: fix pre-existing ruff F821 in ab_bench.py — add TYPE_CHECKING guard for PIL.Image
and drop redundant explicit string quotes (from __future__ import annotations already
defers all annotations).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant