Optimize RFDETR nano segmentation GPU inference pipeline by misrasaurabh1 · Pull Request #810 · codeflash-ai/inference

misrasaurabh1 · 2026-05-24T06:41:52Z

Summary

Adds RFDETR TRT CUDA graph scheduling/warmup improvements and depth-2 frame pipelining so CPU preprocessing/materialization overlaps GPU inference.
Adds an RFDETR workflow fast path for direct local detections and fused dense postprocess on GPU via Triton.
Reduces postprocess/materialization overhead with pinned host buffers, fixed-limit prediction copies, limited mask resize work, float deferred boxes, and tuned mask-resize launch parameters.
Adds an optimization log documenting successful changes, rejected experiments, and profiling artifacts.

Successful checkpoint commits and log sections

15d0b8765 Optimize RFDETR TRT instance segmentation scheduling - log: Async RFDETR Stage Scheduling, RFDETR Instance TRT Graph Cache By Default
69c62eec3 Pipeline workflow inference across frames - log: Pipeline Workflow CPU And GPU Work Across Frames
725cf386f Use direct local workflow detections for RFDETR - log: Direct Local Workflow Detections And Remove Redundant RFDETR GPU Work
6d05b23d0 Fuse RFDETR workflow postprocess - log: Fused RFDETR Dense Postprocess And Pipeline Rebalance
9c1555fca, cc621b7f7, 5f75eb0f9, a27da1ce8, d493fe3f3, 628848e7b, a3dded1b0 RFDETR preprocessing optimizations - log: NumPy RFDETR PIL Tensor Conversion, Direct PIL RFDETR Resize, RFDETR Channel-Wise CHW Normalization, Reusable Pinned RFDETR Preprocess Buffer, Direct NumPy Ufunc RFDETR Channel Fill, Cached RFDETR Normalization Constants, RFDETR Fused CPU Normalization Constants
b5665c07b, bceeabd32, 4cabee945, 7e15da64e workflow fast path / deferred stream waits - log: RFDETR Fast Path Deferred Current-Stream Waits, RFDETR TRT Pre-Request Workflow Fast Path, Single-Step Workflow Runner Fast Path
ac871c4ec, 7a43246c8, f14d629cd, 42bc92948, e2de14fce, d05bb4ebe GPU fused postprocess and pinned materialization - log: RFDETR Pinned Host Detection Materialization, RFDETR Limited Deferred Mask Resize, RFDETR Deferred Float Boxes, RFDETR Seven-Row Deferred Mask Resize, RFDETR Fixed-Limit Pinned Prediction Copy
7a85bcb26, e8bab19b1 TensorRT graph capture warmup/results handling - log: Diagnostic: Current TensorRT Graph Replay Ceiling

The full experiment trail is in development/stream_interface/rfdetr_nano_seg_trt_optimization_log.md.

Performance / validation

Benchmark command:
PYTHONPATH=/app/inference_models python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4 --pipeline_depth 2

Representative accepted checkpoint:
frames=538 elapsed=2.20s fps=244.86

Graph replay ceiling diagnostics:

replay only: 4.066539 ms / 245.91 fps
input copy + graph: 4.076774 ms / 245.29 fps
input copy + graph + output clones: 4.105473 ms / 243.58 fps
model.forward graph path: 4.107662 ms / 243.45 fps

Latest depth-2 nsys artifact:

/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.nsys-rep
/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.sqlite

That trace shows p50 graph end-to-next-start gap 42.496 us and true idle only 4.352 us; CPU work is overlapped and the run is TensorRT CUDA graph-body bound.

Correctness gate for promoted output-changing changes:
classes unchanged, boxes within 5 px, with all-frame graph-vs-standard comparisons used on promoted RFDETR changes.

aseembits93 added 30 commits May 23, 2026 09:59

Record RFDETR persistent cache experiment

1b6a5ef

Record RFDETR query-index empty retest

c13fe64

Record RFDETR limited mask allocation retest

c4af952

Record RFDETR CUDA connections tuning

2f7e3f0

Record RFDETR graph stream priority test

ea7062d

Record RFDETR caller-stream clone test

c4d37b1

Record RFDETR NCU selector analysis

2d7b242

Record RFDETR enqueue profile flag test

0dc6c17

Record RFDETR opt2 TRT rebuild

a8673b2

Record RFDETR preprocessing override test

5232d3e

Record RFDETR highest priority stream test

b457b58

Record RFDETR clean post-stream check

3790576

Record RFDETR empty class filter test

5a6e228

Record RFDETR depth2 graph gap profile

f547904

Record RFDETR graph output pool test

14f6a88

Record RFDETR external graph input test

f492522

Record RFDETR graph input copy flag test

5f7403d

Record RFDETR captured output copy test

429b6f7

Record RFDETR two stage selector test

257b137

Record RFDETR TensorRT NVTX verbosity test

5e3c333

Record RFDETR TRT warmup count test

b5ee9bc

Record RFDETR postprocess stream copy test

234ab7a

Record RFDETR accepted depth2 profile

240ee16

Record RFDETR pooled graph input test

7723869

Record RFDETR alternate plan test

e8610d6

Record RFDETR selector warp test

b7b7e87

Record RFDETR mask resize block test

20dd422

Record RFDETR pinned view test

c52c57c

Record RFDETR caller stream wait test

197f4f7

Record RFDETR postprocess priority test

e7aa886

aseembits93 added 29 commits May 24, 2026 02:02

Record RFDETR vertical resize prototype

a6c7618

Record RFDETR local low-bubble nsys refresh

9a7790c

Record RFDETR H1688 scheduler profile

43b5751

Record RFDETR TRT batch constraint

29af3a6

Record RFDETR graph upload diagnostic

d2c7a87

Record RFDETR graph launch lead diagnostic

891ecd3

Record RFDETR TensorRT TopK scheduler profile

d5a5098

Record RFDETR output clone overlap rejection

22e2328

Record RFDETR TensorRT context knob audit

73ef03c

Record RFDETR 128x64 GEMM scheduler profile

ecbb9ca

Record RFDETR 32x32 GEMM scheduler profile

e046600

Record RFDETR depthwise conv scheduler profile

b279f81

Record RFDETR TensorRT layer profile

3f17259

Record RFDETR fused mask resize profile

72978ff

Record RFDETR fused selector profile

2a22ed9

Record RFDETR preprocess resize profile

94ae33b

Record RFDETR exact-count D2H rejection

2557ba0

Record RFDETR split-k GEMM profile

b86e845

Record RFDETR continued depth-2 nsys trace

42cea10

Record RFDETR implicit GEMM scheduler profile

4be6f8b

Record RFDETR split-k 128 GEMM scheduler profile

20ec002

Record RFDETR 128x64x32 GEMM scheduler profile

5287d39

Record RFDETR indexed implicit GEMM profile

83621e4

Record RFDETR 128x128 NT GEMM scheduler profile

f309e35

Record RFDETR 64x32 TN GEMM scheduler profile

5a9495c

Record RFDETR opt4 TensorRT rebuild result

efb87ff

Refresh RFDETR official package metadata

ef956c1

Record RFDETR sparse TensorRT rebuild result

658b781

Record RFDETR aux0 TensorRT rebuild result

00c6209

misrasaurabh1 requested a review from grzegorz-roboflow as a code owner May 24, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize RFDETR nano segmentation GPU inference pipeline#810

Optimize RFDETR nano segmentation GPU inference pipeline#810
misrasaurabh1 wants to merge 361 commits into
mainfrom
codex/rfdetr-gpu-optimization-checkpoint

misrasaurabh1 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

misrasaurabh1 commented May 24, 2026

Summary

Successful checkpoint commits and log sections

Performance / validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants