Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
76a77f0
perf: add ControlNet profiler regions and CN-enabled benchmark mode
forkni Jun 5, 2026
d194bdc
fix: bypass offline orchestrator in CN benchmark activation
forkni Jun 5, 2026
f94f1c9
perf: add CN residual caching with live cn_cache_interval control
forkni Jun 5, 2026
68535fc
perf: add --cn-cache-interval flag to offline benchmark
forkni Jun 5, 2026
816b9fe
docs: add cn_cache_interval to td_config example (backend already wired)
forkni Jun 6, 2026
131c211
fix: filter benign TRT myelin tactic-skip spam from VAE engine builds
forkni Jun 8, 2026
251066a
refactor: share one TRT build-log filter to drop logger-mismatch warning
forkni Jun 8, 2026
52b54c7
feat: add scribble preprocessor, GPU-native standard_lineart, lazy Re…
forkni Jun 8, 2026
30a3458
feat: add GPU-native CN preprocessors (hed/scribble/normal_bae TRT), …
forkni Jun 8, 2026
a456bb2
fix: harden CN TRT preprocessors — NormalBae fallback init, dynamic s…
forkni Jun 8, 2026
2d6ab7a
fix: suppress TRT logger-mismatch warning in all build and load paths
forkni Jun 9, 2026
f20d70c
fix: align web demo CN preprocessors (scribble_tensorrt, tile passthr…
forkni Jun 9, 2026
78a343c
test: add manual GPU smoke test for self-building TRT preprocessors
forkni Jun 10, 2026
cb69a49
perf: Phase 1 quick wins + Step 0 benchmark harness
forkni Jun 10, 2026
ac4adf6
perf: Phase 2 IPA event-sync+cache, ESRGAN lock shrink, CN antialias
forkni Jun 10, 2026
e72e746
perf: Phase 3 TRT buffer LRU cache, direct GPU alloc, nvJPEG GPU decode
forkni Jun 10, 2026
43b3070
feat: wire --save-goldens to ab_bench.py
forkni Jun 10, 2026
e26843b
fix: harden UNet cache key + dedicated TRT preprocessor stream
forkni Jun 10, 2026
10084bf
perf: eliminate per-frame D2H syncs + surface silent PIL round-trips …
forkni Jun 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions configs/td_config.yaml.example
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,12 @@ engine_dir: "engines/td"

# ControlNet configuration (disabled)
use_controlnet: false
# cn_cache_interval: reuse CN residuals every N frames instead of recomputing each frame.
# 1 = disabled (default, always recompute). 2+ = skip forward on intermediate frames.
# Safe to change live; invalidated automatically on control-image or scale change.
# Note: cache key does NOT include t_index_list — avoid changing batch config mid-stream
# while caching is active. Low practical risk but noted.
cn_cache_interval: 1

# IPAdapter configuration (disabled)
use_ipadapter: false
Expand Down
23 changes: 21 additions & 2 deletions demo/realtime-img2img/controlnet_registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ available_controlnets:
- id: "tile_sd15"
name: "Tile/Feedback"
model_id: "lllyasviel/control_v11f1e_sd15_tile"
default_preprocessor: "feedback"
default_preprocessor: "passthrough"
default_scale: 0.6
description: "Uses image feedback for enhanced details"
preprocessor_params:
Expand Down Expand Up @@ -116,8 +116,27 @@ available_controlnets:
- id: "tile_sdxl"
name: "Tile/Feedback"
model_id: "xinsir/controlnet-tile-sdxl-1.0"
default_preprocessor: "feedback"
default_preprocessor: "passthrough"
default_scale: 0.6
description: "Uses image feedback for enhanced details (SDXL)"
preprocessor_params:
image_resolution: 512

- id: "depth_xinsir_sdxl"
name: "Depth Detection (xinsir)"
model_id: "xinsir/controlnet-depth-sdxl-1.0"
default_preprocessor: "depth_tensorrt"
default_scale: 0.8
description: "Estimates depth information from images — xinsir SDXL variant"
preprocessor_params:
detect_resolution: 518
image_resolution: 512

- id: "scribble_sdxl"
name: "Scribble"
model_id: "xinsir/controlnet-scribble-sdxl-1.0"
default_preprocessor: "scribble_tensorrt"
default_scale: 0.8
description: "Produces sketch-like scribble edge conditioning (SDXL)"
preprocessor_params:
image_resolution: 512
32 changes: 21 additions & 11 deletions demo/realtime-img2img/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,21 +29,31 @@ def bytes_to_pil(image_bytes: bytes) -> Image.Image:

def bytes_to_pt(image_bytes: bytes) -> torch.Tensor:
"""
Convert JPEG/PNG bytes directly to PyTorch tensor using torchvision

Convert JPEG bytes directly to a GPU float32 tensor via torchvision nvJPEG.

Decodes on CUDA when available (nvJPEG path), eliminating the CPU decode +
host→device DMA transfer that the CPU path incurs. Falls back to CPU decode
on machines without CUDA.

Args:
image_bytes: Raw image bytes (JPEG/PNG format)

image_bytes: Raw JPEG bytes (PNG bytes fall back to CPU automatically
since nvJPEG only handles JPEG)


Returns:
torch.Tensor: Image tensor with shape (C, H, W), values in [0, 1], dtype float32
torch.Tensor: Image tensor with shape (C, H, W), values in [0, 1],
dtype float32, on the same device as the decode.
"""
# Convert bytes to tensor for torchvision
byte_tensor = torch.frombuffer(image_bytes, dtype=torch.uint8)

# Decode JPEG/PNG directly to tensor (C, H, W) format, uint8 [0, 255]
image_tensor = decode_jpeg(byte_tensor)

# Convert to float32 and normalize to [0, 1]

# Decode directly on GPU when CUDA is available — nvJPEG avoids the
# CPU decode + H2D copy incurred by the plain decode_jpeg(byte_tensor) call.
if torch.cuda.is_available():
image_tensor = decode_jpeg(byte_tensor, device="cuda")
else:
image_tensor = decode_jpeg(byte_tensor)

# Normalise to [0, 1] on the decode device (fused kernel on GPU).
image_tensor = image_tensor.float() / 255.0

return image_tensor
Expand Down
Loading