Skip to content

Optimize RFDETR nano segmentation GPU inference pipeline#810

Open
misrasaurabh1 wants to merge 361 commits into
mainfrom
codex/rfdetr-gpu-optimization-checkpoint
Open

Optimize RFDETR nano segmentation GPU inference pipeline#810
misrasaurabh1 wants to merge 361 commits into
mainfrom
codex/rfdetr-gpu-optimization-checkpoint

Conversation

@misrasaurabh1
Copy link
Copy Markdown

Summary

  • Adds RFDETR TRT CUDA graph scheduling/warmup improvements and depth-2 frame pipelining so CPU preprocessing/materialization overlaps GPU inference.
  • Adds an RFDETR workflow fast path for direct local detections and fused dense postprocess on GPU via Triton.
  • Reduces postprocess/materialization overhead with pinned host buffers, fixed-limit prediction copies, limited mask resize work, float deferred boxes, and tuned mask-resize launch parameters.
  • Adds an optimization log documenting successful changes, rejected experiments, and profiling artifacts.

Successful checkpoint commits and log sections

  • 15d0b8765 Optimize RFDETR TRT instance segmentation scheduling - log: Async RFDETR Stage Scheduling, RFDETR Instance TRT Graph Cache By Default
  • 69c62eec3 Pipeline workflow inference across frames - log: Pipeline Workflow CPU And GPU Work Across Frames
  • 725cf386f Use direct local workflow detections for RFDETR - log: Direct Local Workflow Detections And Remove Redundant RFDETR GPU Work
  • 6d05b23d0 Fuse RFDETR workflow postprocess - log: Fused RFDETR Dense Postprocess And Pipeline Rebalance
  • 9c1555fca, cc621b7f7, 5f75eb0f9, a27da1ce8, d493fe3f3, 628848e7b, a3dded1b0 RFDETR preprocessing optimizations - log: NumPy RFDETR PIL Tensor Conversion, Direct PIL RFDETR Resize, RFDETR Channel-Wise CHW Normalization, Reusable Pinned RFDETR Preprocess Buffer, Direct NumPy Ufunc RFDETR Channel Fill, Cached RFDETR Normalization Constants, RFDETR Fused CPU Normalization Constants
  • b5665c07b, bceeabd32, 4cabee945, 7e15da64e workflow fast path / deferred stream waits - log: RFDETR Fast Path Deferred Current-Stream Waits, RFDETR TRT Pre-Request Workflow Fast Path, Single-Step Workflow Runner Fast Path
  • ac871c4ec, 7a43246c8, f14d629cd, 42bc92948, e2de14fce, d05bb4ebe GPU fused postprocess and pinned materialization - log: RFDETR Pinned Host Detection Materialization, RFDETR Limited Deferred Mask Resize, RFDETR Deferred Float Boxes, RFDETR Seven-Row Deferred Mask Resize, RFDETR Fixed-Limit Pinned Prediction Copy
  • 7a85bcb26, e8bab19b1 TensorRT graph capture warmup/results handling - log: Diagnostic: Current TensorRT Graph Replay Ceiling

The full experiment trail is in development/stream_interface/rfdetr_nano_seg_trt_optimization_log.md.

Performance / validation

Benchmark command:
PYTHONPATH=/app/inference_models python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4 --pipeline_depth 2

Representative accepted checkpoint:
frames=538 elapsed=2.20s fps=244.86

Graph replay ceiling diagnostics:

  • replay only: 4.066539 ms / 245.91 fps
  • input copy + graph: 4.076774 ms / 245.29 fps
  • input copy + graph + output clones: 4.105473 ms / 243.58 fps
  • model.forward graph path: 4.107662 ms / 243.45 fps

Latest depth-2 nsys artifact:

  • /tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.nsys-rep
  • /tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.sqlite

That trace shows p50 graph end-to-next-start gap 42.496 us and true idle only 4.352 us; CPU work is overlapped and the run is TensorRT CUDA graph-body bound.

Correctness gate for promoted output-changing changes:
classes unchanged, boxes within 5 px, with all-frame graph-vs-standard comparisons used on promoted RFDETR changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants