Optimize RFDETR nano segmentation GPU inference pipeline#810
Open
misrasaurabh1 wants to merge 361 commits into
Open
Optimize RFDETR nano segmentation GPU inference pipeline#810misrasaurabh1 wants to merge 361 commits into
misrasaurabh1 wants to merge 361 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Successful checkpoint commits and log sections
15d0b8765Optimize RFDETR TRT instance segmentation scheduling - log:Async RFDETR Stage Scheduling,RFDETR Instance TRT Graph Cache By Default69c62eec3Pipeline workflow inference across frames - log:Pipeline Workflow CPU And GPU Work Across Frames725cf386fUse direct local workflow detections for RFDETR - log:Direct Local Workflow Detections And Remove Redundant RFDETR GPU Work6d05b23d0Fuse RFDETR workflow postprocess - log:Fused RFDETR Dense Postprocess And Pipeline Rebalance9c1555fca,cc621b7f7,5f75eb0f9,a27da1ce8,d493fe3f3,628848e7b,a3dded1b0RFDETR preprocessing optimizations - log:NumPy RFDETR PIL Tensor Conversion,Direct PIL RFDETR Resize,RFDETR Channel-Wise CHW Normalization,Reusable Pinned RFDETR Preprocess Buffer,Direct NumPy Ufunc RFDETR Channel Fill,Cached RFDETR Normalization Constants,RFDETR Fused CPU Normalization Constantsb5665c07b,bceeabd32,4cabee945,7e15da64eworkflow fast path / deferred stream waits - log:RFDETR Fast Path Deferred Current-Stream Waits,RFDETR TRT Pre-Request Workflow Fast Path,Single-Step Workflow Runner Fast Pathac871c4ec,7a43246c8,f14d629cd,42bc92948,e2de14fce,d05bb4ebeGPU fused postprocess and pinned materialization - log:RFDETR Pinned Host Detection Materialization,RFDETR Limited Deferred Mask Resize,RFDETR Deferred Float Boxes,RFDETR Seven-Row Deferred Mask Resize,RFDETR Fixed-Limit Pinned Prediction Copy7a85bcb26,e8bab19b1TensorRT graph capture warmup/results handling - log:Diagnostic: Current TensorRT Graph Replay CeilingThe full experiment trail is in
development/stream_interface/rfdetr_nano_seg_trt_optimization_log.md.Performance / validation
Benchmark command:
PYTHONPATH=/app/inference_models python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4 --pipeline_depth 2Representative accepted checkpoint:
frames=538 elapsed=2.20s fps=244.86Graph replay ceiling diagnostics:
4.066539 ms/245.91 fps4.076774 ms/245.29 fps4.105473 ms/243.58 fps4.107662 ms/243.45 fpsLatest depth-2 nsys artifact:
/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.nsys-rep/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.sqliteThat trace shows p50 graph end-to-next-start gap
42.496 usand true idle only4.352 us; CPU work is overlapped and the run is TensorRT CUDA graph-body bound.Correctness gate for promoted output-changing changes:
classes unchanged, boxes within 5 px, with all-frame graph-vs-standard comparisons used on promoted RFDETR changes.