diff --git a/development/stream_interface/rfdetr_nano_seg_trt_optimization_log.md b/development/stream_interface/rfdetr_nano_seg_trt_optimization_log.md new file mode 100644 index 0000000000..bd22a7296b --- /dev/null +++ b/development/stream_interface/rfdetr_nano_seg_trt_optimization_log.md @@ -0,0 +1,3231 @@ +# RFDETR Nano Seg TRT Optimization Log + +Benchmark command: + +```bash +PYTHONPATH=/app/inference_models python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4 +``` + +Hardware observed: Tesla T4, CUDA driver 580.159.04, PyTorch 2.6.0+cu124. + +## 2026-05-22 + +### Baseline + +- Hypothesis: Establish the current end-to-end workflow FPS and identify CPU/GPU bottlenecks before changing code. +- Command: benchmark command above. +- Result: `frames=538 elapsed=7.45s fps=72.23`. +- Profiling: + - `nsys profile --trace=cuda,nvtx,osrt,cudnn,cublas --sample=none --cpuctxsw=none`. + - CUDA API summary showed `cudaStreamSynchronize` as the largest CUDA API cost: 17,552 calls, 2.141s total API time. + - Kernel summary showed TRT kernels plus PyTorch postprocess kernels; visible postprocess costs included `topk`, sort/indexing, and mask resize. +- Learning: RFDETR TRT inference already has a CUDA graph cache implementation, but the benchmark path was paying explicit CPU waits around preprocessing, TRT execution, and postprocessing. + +### Existing CUDA Graph Cache Enabled By Env + +- Hypothesis: The existing TensorRT CUDA graph cache should reduce per-frame launch overhead for the static RFDETR-nano input shape. +- Change: No code change. Ran with `ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True`. +- Result: `frames=538 elapsed=7.28s fps=73.95`. +- Learning: Graph replay helps, but the requested benchmark command does not set this env var, and stage-level synchronizations still leave performance on the table. + +### Async RFDETR Stage Scheduling + +- Hypothesis: Replace RFDETR instance-segmentation CPU synchronizations with CUDA stream dependencies so CPU work can continue while GPU work is queued. +- Change: + - Added `synchronize=True` parameter to `infer_from_trt_engine(...)` to preserve existing default behavior. + - RFDETR instance segmentation calls TRT with `synchronize=False`. + - Replaced RFDETR pre/forward/post `stream.synchronize()` calls with `wait_stream(...)` dependencies. + - Made CUDA graph replay stream wait on the caller stream and the caller stream wait on graph replay, so asynchronous graph replay remains ordered without a CPU sync. +- Result without graph env: `frames=538 elapsed=7.30s fps=73.66`. +- Result with graph env: `frames=538 elapsed=7.00s fps=76.84`. +- Correctness: Compared optimized graph path against non-graph TRT path on 32 frames from `vehicles_312px.mp4`; class IDs matched exactly and max box delta was `0` px. +- Learning: Async scheduling helps modestly by itself and unlocks the graph cache benefit when graph replay is available. + +### RFDETR Instance TRT Graph Cache By Default + +- Hypothesis: Since RFDETR-nano-seg has static batch/input shape in this benchmark, enabling the graph cache by default for the RFDETR instance TRT model should make the requested command use the faster path without requiring env setup. +- Change: If no caller/env graph cache is supplied, `RFDetrForInstanceSegmentationTRT.from_pretrained(...)` now creates a `TRTCudaGraphCache` with the model's default cache capacity. +- Result on requested command: best observed `frames=538 elapsed=7.07s fps=76.12`; final verification repeat `frames=538 elapsed=7.14s fps=75.39`. +- Learning: End-to-end FPS improved from `72.23` to `75.39-76.12` (+4.4% to +5.4%) on the exact command. + +### Rejected: Shared cv2 Preprocess Path + +- Hypothesis: The shared cv2 stretch preprocessing path may be faster than RFDETR's PIL resize/to-tensor path. +- Change tested: No committed code change. Manually ran shared `pre_process_numpy_image(...)` into RFDETR TRT forward/postprocess. +- Result: Class order changed by frame 2 (`[7, 2, 2, 2]` vs `[2, 7, 2, 2]`). +- Learning: Even if boxes might remain close, this is too risky for the explicit "classes don't change" invariant. Keep RFDETR's PIL-compatible preprocessing. + +### Rejected: Best-Class-Per-Query Postprocess + +- Hypothesis: Replace flat top-k over `(queries * classes)` with one best valid class per query. +- Change tested: No committed code change. +- Result: Not semantics-preserving. Frame 81 had two above-threshold classes for the same query in the legacy path. +- Learning: RFDETR can emit multiple valid classes for one query; a max-per-query shortcut drops detections. + +### Rejected: Threshold-First Exact Postprocess + +- Hypothesis: Select all above-threshold valid query/class pairs first, then sort/top-k only those candidates to preserve flat-top-k semantics while reducing work. +- Change tested: Temporary code only; reverted. +- Correctness: Matched legacy postprocess on 128 video frames with exact class IDs and `0` px max box delta. +- Result: Requested workflow was `frames=538 elapsed=7.08s fps=75.97`, effectively flat/slightly worse than the graph-cache default result. +- Learning: The remaining end-to-end bottleneck is not improved enough by this postprocess rewrite; do not keep the added complexity. + +### Batched RFDETR RLE Mask Alignment + +- Hypothesis: The workflow v3 adapter asks instance segmentation models that support RLE for `mask_format="rle"`, then converts masks to polygons for the v3 response. RFDETR's RLE branch resized each selected mask one at a time; using the same batched mask alignment as the dense path should reduce GPU launch and resize overhead while preserving RLE output. +- Change: `post_process_instance_segmentation_results_to_rle_masks(...)` now calls `align_instance_segmentation_results(...)` once for the selected masks, then encodes each aligned boolean mask to COCO RLE. +- Correctness: Compared dense postprocess masks against decoded RLE masks on 64 frames from `vehicles_312px.mp4`; masks matched exactly, class IDs matched exactly, and max box delta was `0` px. +- Micro-result: RFDETR RLE postprocess improved from `3.19 ms/frame` to `2.58 ms/frame` over 120 video frames. +- Result on requested command: `frames=538 elapsed=6.86s fps=78.43`. +- Learning: After TRT graph replay, mask alignment/encoding is a meaningful share of the model-side cost. Batched GPU resize is preferable for this small-frame RFDETR workflow despite temporarily materializing dense aligned masks. + +### Rejected: GPU/Tensor Preprocessing For Numpy Frames + +- Hypothesis: Convert cv2 numpy frames to CUDA tensors immediately and use RFDETR's tensor preprocessing path to avoid PIL resize/to-tensor overhead. +- Change tested: No committed code change. Manually converted BGR `uint8` frames to CUDA `float32` CHW `[0, 1]` tensors and called the existing tensor preprocessing branch. +- Result: Not output-order preserving. By frame 4, the tensor path returned the same boxes/classes in a different confidence order; same-index box comparison had a max delta of `123` px. +- Learning: The PIL-compatible numpy preprocessing path remains necessary for the benchmark invariant. Revisit only with an order-insensitive downstream contract or a more exact PIL-equivalent GPU resize. + +### Avoid RLE Round-Trip For Polygon Responses + +- Hypothesis: Workflow v3 polygon responses should not ask `inference_models` instance segmentation backends for RLE masks. The previous adapter path requested RLE whenever the model supported it, then decoded RLE back to polygons, adding avoidable RLE encode/decode work. +- Change: `InferenceModelsInstanceSegmentationAdapter.map_inference_kwargs(...)` now requests `mask_format="rle"` only when `response_mask_format == "rle"`. Polygon responses use the model's dense default and convert dense masks directly to polygons. +- Correctness: Compared dense-to-polygon against RLE-decode-to-polygon on 64 frames from `vehicles_312px.mp4`; polygons matched exactly, class IDs matched exactly, and max box delta was `0` px. +- Micro-result: Adapter-like mask postprocess for polygon output improved from `3.40 ms/frame` with RLE round-trip to `2.24 ms/frame` with dense masks over 160 video frames. +- Result on requested command: `frames=538 elapsed=6.08s fps=88.54`. +- Learning: After model-side optimizations, response format selection became a large workflow-level bottleneck. Avoiding unnecessary RLE round-trips is safe for polygon outputs and preserves explicit RLE responses. + +### RFDETR Preprocess Channel Swap After PIL Resize + +- Hypothesis: In the common BGR numpy input path, RFDETR can resize the original contiguous BGR image with PIL and swap channels after `to_tensor()`. PIL resize is channel independent, so this should be equivalent to making an RGB numpy copy before PIL resize while avoiding that pre-resize copy. +- Change: RFDETR numpy preprocessing now tracks whether color channels need swapping and applies `[2, 1, 0]` on the tensor after PIL resize/to-tensor for 3-channel images. +- Correctness: Compared against the previous pre-PIL BGR-to-RGB path on 64 frames from `vehicles_312px.mp4`; preprocessed tensors matched exactly (`max_tensor_diff=0`), class IDs matched exactly, and max box delta was `0` px. +- Micro-result: isolated RFDETR preprocessing measured `2.16 ms/frame` over 240 frames after the change. +- Result on requested command: `frames=538 elapsed=6.04s fps=89.04`. +- Learning: Small host-side image copies are visible at this frame size. Exact channel-order transformations can be moved across per-channel PIL resize safely. + +### Avoid Redundant Empty-Mask Scan Before Polygon Contours + +- Hypothesis: `masks2poly()` and `masks2multipoly()` do an `np.any()` scan before `cv2.findContours()`, but OpenCV already returns no contours for empty masks. Removing the pre-scan should avoid a second full-mask traversal for non-empty masks. +- Change: Removed the explicit `np.any()` empty-mask fast path and let `mask2poly()` / `mask2multipoly()` handle empty contours. +- Correctness: Compared old and new polygon conversion on RFDETR dense masks from 40 frames; polygon arrays matched exactly. Empty-mask smoke check still returns `(0, 2)` `float32` polygon output. +- Micro-result: RFDETR dense mask polygon conversion improved from `0.18 ms/frame` to `0.11 ms/frame` over 240 frames. +- Result on requested command: `frames=538 elapsed=5.99s fps=89.84`. +- Test note: Attempted `PYTHONPATH=/app/inference_models pytest -q inference_models/tests/unit_tests/models/common/test_rle_utils.py tests/inference/unit_tests/models/test_rfdetr.py`, but collection failed before running tests with `ModuleNotFoundError: No module named 'tests.conftest'`. +- Learning: At high FPS, even small CPU mask scans are measurable; rely on the contour operation's empty output instead of scanning twice. + +### Rejected: Pydantic model_construct For Polygon Responses + +- Hypothesis: Bypassing Pydantic validation with `model_construct()` for instance segmentation response objects or polygon `Point` objects could reduce workflow response construction overhead. +- Change tested: No committed code change. Built RFDETR polygon response objects with `model_construct()` for all response models, and separately with only `Point.model_construct()`. +- Result: Both variants were slower in the local response-construction harness. Full construct measured `0.76 ms/frame` vs `0.57 ms/frame` for normal constructors; point-only construct measured `0.64 ms/frame` vs `0.54 ms/frame` and emitted Pydantic serializer warnings for NumPy scalar values. +- Learning: Pydantic construction is not the next profitable target in this form. Keep validated constructors and look elsewhere for workflow overhead. + +### Rejected: Workflow v3 RLE Response For Local Conversion + +- Hypothesis: Workflow v3 converts polygon predictions back into `sv.Detections` masks via `supervision.from_inference`; asking the local model response for RLE could avoid polygon rasterization in that conversion. +- Change tested: Temporary workflow v3 local request change only; passed `response_mask_format="rle"` to `InstanceSegmentationInferenceRequest`. +- Result on requested command: `frames=538 elapsed=6.25s fps=86.12`, slower than the dense polygon-response path. +- Learning: For this RFDETR-nano segmentation workflow, model-side RLE encoding plus local conversion is still more expensive than the dense-to-polygon path. Keep polygon responses as the local workflow default unless the caller explicitly asks for RLE. + +### Current Exact Command Checkpoint + +- Command: benchmark command above. +- Result on requested command: `frames=538 elapsed=5.85s fps=91.94`. +- Learning: The accumulated committed changes now improve the original `72.23 fps` baseline by about `27%` on the exact benchmark command. + +### Avoid Eager Workflow Detection UUID Generation + +- Hypothesis: `convert_inference_detections_batch_to_sv_detections(...)` calls `str(uuid.uuid4())` as a `dict.get(...)` default, so UUIDs are generated for every detection even when local model responses already include `detection_id`. +- Change: Generate a UUID only when `detection_id` is missing from a raw prediction, preserving present values including `None`. +- Correctness: Smoke test patched `uuid.uuid4` to raise and converted a response containing `detection_id="known-id"`; conversion preserved the ID without calling UUID generation. +- Result on requested command: `frames=538 elapsed=5.84s fps=92.12`. +- Learning: This is a small but safe workflow-level CPU reduction; RFDETR local response objects already pay the necessary `detection_id` creation cost. + +### Pipeline Workflow CPU And GPU Work Across Frames + +- Hypothesis: The optimized path was still serialized at the pipeline level: workflow CPU response construction for frame `N` had to finish before GPU preprocessing/inference/postprocess for frame `N+1` could start. Allowing multiple workflow frame batches in flight should overlap CPU-side workflow work with GPU work while preserving sink dispatch order. +- Change: `InferencePipeline.init_with_workflow(...)` now allows ordered in-flight workflow batches via `max_inflight_workflow_batches`, defaulting to `3`. Generic `init_with_custom_logic(...)` keeps the old single-worker default unless explicitly configured. +- Correctness: Compared sequential workflow execution (`max_inflight_workflow_batches=1`) against the new default on all 538 frames from `vehicles_312px.mp4`; frame order matched, class IDs matched exactly, and max box delta was `0` px. +- Tuning: `max_inflight=2` measured `121.24 fps`; `max_inflight=3` measured best at `141.01 fps`; `max_inflight=4` regressed to `135.86 fps`. +- Result on requested command: `frames=538 elapsed=3.82s fps=141.01`. +- Learning: Once per-frame CPU paths were reduced enough, cross-frame pipeline concurrency became the largest remaining gain. The exact benchmark improved from the original `72.23 fps` baseline to `141.01 fps` (+95%). + +### Direct Local Workflow Detections And Remove Redundant RFDETR GPU Work + +- Hypothesis: Nsight Systems on the pipelined path showed the largest visible GPU costs were RFDETR PyTorch postprocess kernels (`topk`, radix sort/indexing, mask resize), while workflow v3 still built polygon API responses and converted them back into `sv.Detections`. Avoiding redundant postprocess sorting, avoiding a one-image CUDA `stack`, and directly constructing local workflow `sv.Detections` from `inference_models` detections should reduce both GPU and CPU work. +- Change: + - Removed redundant confidence re-sorts after `select_topk_predictions(...)`; `torch.topk(..., sorted=True)` already returns descending scores and the later boolean filters preserve order. + - Added a single-image RFDETR preprocessing fast path that uses `unsqueeze(0)` instead of `torch.stack([tensor])` after the host-to-device copy. + - Added a local workflow v3 fast path for `InferenceModelsInstanceSegmentationAdapter` when active learning is disabled. It runs adapter preprocess/predict, converts dense `InstanceDetections` directly to `sv.Detections`, attaches the same workflow metadata fields, and falls back to the existing response path otherwise. +- Correctness: + - Old sorted dense postprocess vs new unsorted postprocess on all 538 frames: class IDs exact, boxes exact, confidences exact, dense masks exact. + - Single-image `unsqueeze(0)` batch vs previous one-image `torch.stack(...)` equivalent on 128 frames: `max_tensor_diff=0`. + - Existing workflow fallback vs new local fast path on all 538 frames: frame order matched, class IDs exact, and max box delta was `0` px. +- Micro-result: Dense RFDETR postprocess improved from `2.585 ms/frame` to `1.548 ms/frame` over 240 frames when synchronized around postprocess. +- Tuning: After the postprocess/preprocess changes, `max_inflight=3` remained best in the clean sink harness (`143.64 fps`), with `max_inflight=4` at `142.30 fps`, `5` at `137.45 fps`, and `6` at `130.28 fps`. +- Result on requested command: `frames=538 elapsed=3.46s fps=155.28`. In-memory prototype of the same direct workflow fast path measured `157.61 fps`. +- Learning: After cross-frame pipelining, the biggest remaining workflow overhead was the local API response round trip, not the sink conversion itself. Direct local `sv.Detections` construction preserves the benchmark's class/box contract and lets the pipeline spend more time feeding the serialized GPU path. + +### Rejected: Thread-Local TRT CUDA Graph Replay + +- Hypothesis: The three workflow workers still serialize RFDETR TRT graph replay on a model-level lock and shared inference stream. Giving each worker a thread-local inference stream and thread-local CUDA graph cache could let independent TensorRT execution contexts replay concurrently and increase GPU utilization. +- Change tested: Temporary code only; created thread-local inference streams and one-entry `TRTCudaGraphCache` instances, using the model lock only when a thread-local graph cache was empty. +- Result: Severe regression. The requested command only reached `[progress] frames=50 fps=2.09` during graph capture/warmup, so the run was stopped and the patch reverted. +- Learning: Per-worker CUDA graph capture/context setup is far too expensive for this path, and concurrent graph replay needs a pre-warmed context/graph pool rather than lazy per-thread capture in the hot pipeline. + +### Rejected: Guarded Threshold-First RFDETR Instance Postprocess + +- Hypothesis: The benchmark confidence threshold is high enough that only a few RFDETR query/class scores survive, so selecting scores above threshold first could avoid the global top-k/radix-sort work visible in Nsight Systems. +- Change tested: Temporary `common.py` helper for dense/RLE instance segmentation that selected candidates above the minimum threshold, used that path only when the candidate count was no larger than `num_queries`, sorted survivors by confidence, and otherwise fell back to the existing `select_topk_predictions(...)` semantics. +- Correctness: Compared the temporary selector against the previous top-k selector on raw TRT outputs for all 538 frames from `vehicles_312px.mp4`: detection counts matched, class IDs matched exactly, max box delta was `0` px, confidences matched exactly, and dense masks matched exactly. +- Result on requested command: Two exact passes measured `frames=538 elapsed=3.49s fps=154.33` and `frames=538 elapsed=3.49s fps=153.98`, below the current best committed path. +- Learning: Even with exact outputs, the extra CUDA `nonzero`/candidate-selection work and synchronization risk do not beat PyTorch's current global top-k path end-to-end. A profitable version likely needs a fused CUDA/Triton kernel that thresholds, remaps, compacts, and gathers boxes/masks in one pass without host-visible candidate counting. + +### Fused RFDETR Dense Postprocess And Pipeline Rebalance + +- Hypothesis: Nsight Systems still showed RFDETR PyTorch postprocess kernels (`topk`, radix sort/indexing, mask resize) after the direct workflow fast path. Fusing top-score selection, class remapping, box decode, and workflow mask resize should reduce postprocess kernel launch overhead and shift the best workflow pipeline depth. +- Change: + - Added a gated Triton RFDETR dense postprocess path for the benchmark-shape case: scalar threshold, one image, no padding, no static crop, dense masks. The general PyTorch path remains the fallback. + - The fused selector walks global scores in descending order only until the next score falls below threshold, preserving the old top-k/filter semantics without always materializing 100 selections. + - For the local workflow RFDETR TRT fast path, the selected count stays on GPU through a Triton mask resize and is copied only at the existing NumPy conversion boundary; public model postprocess still returns exact-sized tensors by default. + - Changed the workflow default `max_inflight_workflow_batches` from `3` to `2` after retuning. +- Correctness: + - Default postprocess vs previous PyTorch selector on all 538 frames: detection counts matched, class IDs exact, max box delta `0` px, confidences exact, dense masks exact. + - Deferred workflow postprocess mode vs default exact-sized postprocess on all 538 frames: detection counts matched, class IDs exact, max box delta `0` px, dense masks exact. +- Micro-result: Synchronized dense postprocess over 240 cached raw TRT outputs improved from `2.055 ms/frame` to `0.408 ms/frame`. +- Pipeline tuning: With the deferred fused path, depth `3` measured `frames=538 elapsed=3.44s fps=156.61`; depth `2` measured `frames=538 elapsed=3.10s fps=173.39`. +- Result on requested command: `frames=538 elapsed=3.07s fps=175.12`. +- Learning: Once postprocess count synchronization and mask resize were moved to the workflow conversion boundary, the optimal pipeline depth dropped from `3` to `2`; the third in-flight worker became extra contention instead of useful overlap. + +### Rejected: Blocked Triton Mask Resize Programs + +- Hypothesis: The fused mask resize nsys profile showed `_resize_selected_masks_kernel` as the top postprocess kernel. Processing several detections per Triton program could reduce launch-grid overhead while still supporting up to 100 detections. +- Change tested: Temporary code only; changed the mask resize kernel from one detection per program to four detections per program with a smaller pixel block. +- Correctness: Deferred workflow postprocess vs default exact-sized postprocess on all 538 frames still matched detection counts, class IDs, boxes, and dense masks exactly. +- Result on requested command: `frames=538 elapsed=3.18s fps=168.99`, slower than the committed single-detection program layout. +- Learning: The larger per-program vector shape hurt this T4 path more than the lower program count helped. Keep the simpler mask kernel and look next at reducing H2D preprocessing transfer or avoiding full-size mask materialization. + +### Rejected: Pinned RFDETR Preprocess Transfer + +- Hypothesis: The fused-path nsys profile showed Host-to-Device preprocessing copies as the largest remaining memory operation. Pinning the already-normalized CPU tensor and using a non-blocking CUDA copy could improve transfer overlap. +- Change tested: Temporary `pre_processing.py` helper that called `tensor.pin_memory().to(device, non_blocking=True)` for CPU tensors moving to CUDA. +- Result on requested command: `frames=538 elapsed=3.19s fps=168.61`, slower than the committed pageable transfer path. +- Learning: Per-frame pinning overhead outweighed any asynchronous-copy benefit for this 312x312 tensor. A useful transfer optimization likely needs reusable pinned buffers or moving normalization/conversion to GPU from a smaller uint8 transfer, not pinning after CPU float normalization. + +### Rejected: Sparse Early Return In Triton Mask Resize + +- Hypothesis: The benchmark has only 1-7 detections per frame, while the fused mask resize kernel is launched over 100 possible detection slots. Returning immediately for detection slots greater than the GPU-side selected count could reduce unnecessary bilinear math. +- Change tested: Temporary code only; added a runtime `if det_index >= count: return` branch in `_resize_selected_masks_kernel`. +- Correctness: Deferred workflow postprocess vs default exact-sized postprocess on all 538 frames matched detection counts, class IDs, boxes, and dense masks exactly. +- Result on requested command: `frames=538 elapsed=3.16s fps=170.11`, slower than the committed mask kernel. +- Learning: The runtime branch/control-flow cost and changed Triton code generation outweighed the skipped masked arithmetic on T4. Keep the straight-line mask kernel for now. + +### Rejected: GPU Normalize After Uint8 Preprocess Transfer + +- Hypothesis: Instead of transferring the normalized float32 RFDETR input, keep the PIL resize on CPU for pixel compatibility, transfer the resized uint8 image to GPU, and perform channel reorder plus normalization on GPU. This should reduce H2D bytes by roughly 4x. +- Change tested: Temporary `pre_processing.py` path that converted the PIL-resized image to a uint8 tensor on CUDA, then did CHW conversion and normalization on GPU. +- Correctness: Compared the previous CPU-normalized tensor path against the GPU-normalized path on all 538 frames: max tensor diff rounded to `0.00000000`, detection counts matched, class IDs matched exactly, and max box delta was `0` px. +- Result on requested command: depth `2` measured `frames=538 elapsed=3.15s fps=170.80`; retuning depth `3` measured `frames=538 elapsed=3.43s fps=157.08`. +- Learning: The smaller H2D copy did not compensate for the additional GPU conversion/normalization kernels and stream contention. Any useful version likely needs a single fused conversion kernel and careful scheduling, or reusable CPU-side normalized transfer remains better. + +### Rejected: Packed Workflow Detection Metadata Copy + +- Hypothesis: The local workflow conversion copies selected boxes, confidences, class IDs, and masks from GPU separately. Packing non-mask detection fields in the fused selector kernel and copying them as one tensor could reduce D2H API overhead. +- Change tested: Temporary fused selector output of `[x1, y1, x2, y2, confidence, class_id]` for workflow conversion, using the packed tensor instead of separate `xyxy`, `confidence`, and `class_id` copies. +- Correctness: Deferred workflow postprocess vs default exact-sized postprocess on all 538 frames matched detection counts, class IDs, and dense masks exactly; max box delta was `0.5` px due to packed float boxes instead of rounded int boxes. +- Result on requested command: Two exact passes measured `frames=538 elapsed=3.08s fps=174.63` and `frames=538 elapsed=3.09s fps=174.32`, not better than the committed checkpoint. +- Learning: The extra selector stores and altered code generation offset the small D2H-call reduction. Keep the simpler separate tensors unless a future fused CPU conversion removes more overhead. + +### Rejected: Disable TRT CUDA Graph Replay + +- Hypothesis: CUDA graph replay clones graph output buffers every frame, contributing visible D2D copy time. Running the standard TensorRT async path could avoid those clones and be competitive after postprocess fusion. +- Change tested: No code change; ran the requested command with `ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=False`. +- Result on requested command: `frames=538 elapsed=3.11s fps=173.27`, below the committed CUDA graph path. +- Learning: TRT graph replay still wins overall; the graph-launch reduction is more valuable than removing output-buffer clone traffic. + +### NumPy RFDETR PIL Tensor Conversion + +- Hypothesis: The RFDETR preprocessing path still spent CPU time in `TF.to_tensor(...)` followed by `TF.normalize(...)` after the PIL-compatible resize. A single NumPy float32 conversion/normalization step should preserve the same resized pixels while reducing Python/PyTorch transform overhead. +- Change: Added a gated RFDETR numpy/PIL preprocessing fast path for 3-channel normalized inputs. It converts the PIL-resized image to `float32` NumPy, applies channel swap and normalization in HWC layout, then creates the CHW tensor from a contiguous transpose. Non-normalized or non-3-channel cases fall back to the previous torchvision path. +- Correctness: Compared the previous `TF.to_tensor`/`TF.normalize` path against the NumPy path on all 538 frames: max tensor diff `0.00000072`, detection counts matched, class IDs matched exactly, and max box delta was `0` px. +- Micro-result: Preprocess-only loop over 128 frames improved from `1.957 ms/frame` to `1.890 ms/frame`. +- Pipeline tuning: Default depth `2` measured `frames=538 elapsed=2.81s fps=191.24`; serial depth `3` measured `frames=538 elapsed=2.94s fps=182.96`, so depth `2` remains best. +- Result on requested command: `frames=538 elapsed=2.81s fps=191.24`. +- Learning: Keeping the PIL resize source-of-truth while reducing conversion overhead gives a real pipeline gain; after postprocess fusion, small CPU preprocessing savings matter because they improve producer/consumer balance at depth `2`. + +### Rejected: Exact-Capacity Fused Mask Resize + +- Hypothesis: The benchmark emits only 1-7 detections per frame, but the deferred fused workflow path launches `_resize_selected_masks_kernel` over the full 100-detection capacity. Synchronizing the selected count before mask resize, allocating exactly that many mask planes, and launching only those programs could reduce the dominant custom kernel time. +- Change tested: Temporary code only; in deferred fused postprocess, copied the selected count to CPU before mask resize, sliced detection tensors immediately, and passed an exact `output_capacity` to `fused_resize_selected_masks(...)`. +- Correctness: Compared non-deferred exact-sized postprocess against the exact-capacity deferred path on all 538 frames: max selected count `7`, class IDs exact, max box delta `0` px, confidences exact, and dense masks exact. +- Pipeline tuning: Depth `2` measured `frames=538 elapsed=3.27s fps=164.47`; depth `3` measured `frames=538 elapsed=3.12s fps=172.18`. +- Learning: The earlier CPU count synchronization breaks the useful overlap from the deferred path. Even though it reduces mask resize work, preserving the GPU-side count until workflow conversion is faster overall. + +### Rejected: OpenCV RFDETR Resize Fast Path + +- Hypothesis: Replacing PIL resize with OpenCV bilinear resize in RFDETR preprocessing could reduce CPU producer time while preserving class IDs and keeping boxes within 5 px. +- Change tested: Temporary script-only prototype; resized the `312x176` video frames to the `312x312` TRT input with `cv2.resize`, then applied the same BGR-to-RGB swap and normalization. +- Correctness: Compared against the current PIL path on all 538 frames. Counts differed on 7 frames, class IDs differed on 14 frames, and max box delta reached `183` px. +- Micro-result: Prototype preprocessing measured `2.969 ms/frame` vs current `2.013 ms/frame` over 128 frames. +- Learning: PIL interpolation is part of the effective RFDETR input contract for this checkpoint. OpenCV is both slower here and not prediction-compatible. + +### Direct PIL RFDETR Resize + +- Hypothesis: The torchvision `TF.resize(...)` PIL wrapper adds Python overhead around a PIL bilinear resize. Calling `PIL.Image.resize(...)` directly should preserve pixels while shaving preprocessing overhead. +- Change: In the RFDETR numpy preprocessing branch, replaced `TF.resize(pil, ...)` with direct `pil.resize((width, height), Image.Resampling.BILINEAR)` before the existing NumPy normalize/CHW conversion. +- Correctness: Reproduced the old `TF.resize(..., antialias=True)` path over all 538 frames and compared tensors against the patched model preprocessing: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Pipeline tuning: Depth `2` measured `frames=538 elapsed=2.78s fps=193.48`; depth `3` measured `frames=538 elapsed=2.97s fps=181.10`, so depth `2` remains best. +- Result on requested command: best isolated run `frames=538 elapsed=2.78s fps=193.48`; repeat isolated run measured `frames=538 elapsed=2.82s fps=190.99`. +- Learning: Removing the torchvision wrapper is a small, exact CPU-side cleanup. The end-to-end gain is near run-to-run noise, but the highest isolated benchmark moved slightly upward and the change is byte-equivalent for the model input. + +### Rejected: Workflow Fast Path Inference Mode Wrapper + +- Hypothesis: The direct local workflow fast path executes RFDETR TRT preprocess, predict, and postprocess without needing autograd. Wrapping that section in `torch.inference_mode()` could reduce PyTorch dispatch overhead around tensor copies and postprocess kernels. +- Change tested: Temporary code only; added `with torch.inference_mode():` around `model.preprocess(...)`, `model.predict(...)`, and fused `model._model.post_process(...)` in the instance segmentation workflow fast path. +- Correctness: Compared model execution outside vs inside `torch.inference_mode()` on all 538 frames: class IDs exact and max box delta `0` px. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.80s fps=191.83` and `frames=538 elapsed=2.81s fps=191.27`. +- Learning: The TRT path already produces tensors with no autograd work worth removing; the wrapper is neutral within noise and does not justify extra workflow code. + +### RFDETR Channel-Wise CHW Normalization + +- Hypothesis: The NumPy RFDETR preprocessing path still creates a normalized HWC float array and then makes a contiguous CHW copy. Writing normalized channels directly into a CHW float32 output should avoid one layout-conversion allocation. +- Change: `_pil_image_to_normalized_tensor(...)` now reads the resized PIL image as uint8, normalizes each selected channel into a preallocated CHW float32 array, and returns that array directly as the tensor backing storage. +- Correctness: Reproduced the prior HWC-float/transpose formula over all 538 frames and compared tensors against the patched preprocessing path: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Micro-result: Preprocess-only loop over 128 frames measured `1.980 ms/frame`; the isolated conversion prototype measured `0.605 ms/frame` vs `0.622 ms/frame` for the prior conversion helper. +- Pipeline tuning: Depth `2` measured `frames=538 elapsed=2.80s fps=192.42` and `frames=538 elapsed=2.79s fps=193.16`; depth `3` measured `frames=538 elapsed=3.10s fps=173.33`. +- Learning: This is an exact, small allocation cleanup. It does not materially shift the bottleneck or pipeline depth, but it keeps the preprocessing path leaner without changing model inputs. + +### Rejected: Keep Deferred Query Indices As Int32 + +- Hypothesis: Nsight Systems still showed a tiny PyTorch copy/cast kernel after fused selection. The deferred mask resize kernel can read int32 query indices directly, so skipping `query_indices.to(dtype=torch.long)` in the GPU-deferred path could remove one kernel launch and D2D copy per frame. +- Change tested: Temporary code only; when `return_cpu_count=False`, `fused_select_topk_boxes(...)` returned int32 query indices instead of casting them to int64. +- Correctness: Deferred workflow postprocess vs exact-sized postprocess on all 538 frames matched class IDs, boxes, and dense masks exactly; max box delta `0` px. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.80s fps=192.18` and `frames=538 elapsed=2.83s fps=189.79`, below the current checkpoint band. +- Learning: Removing this small cast does not improve end-to-end throughput; the scheduling and pipeline balance dominate over this tiny kernel. + +### Rejected: Deterministic Local Workflow Detection IDs + +- Hypothesis: The direct local workflow fast path still creates a UUID per detection. Reusing the request inference ID plus a detection index would reduce Python UUID work during CPU-side `sv.Detections` construction. +- Change tested: Temporary code only; when `inference_id` was present, generated detection IDs as `"{inference_id}-{index}"` in the local instance segmentation workflow fast path. +- Correctness: The change runs after tensor-to-NumPy conversion and does not affect model classes or boxes. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.79s fps=193.07` and `frames=538 elapsed=2.83s fps=189.99`, not a clear improvement. +- Learning: UUID generation is not a measurable limiter after the fused/deferred path; keep the existing identifier behavior. + +### Rejected: PIL Image FromBuffer Wrapper + +- Hypothesis: `Image.fromarray(...)` may copy contiguous OpenCV frames before PIL resize. For uint8 HWC contiguous images, `Image.frombuffer(...)` could avoid that copy while preserving channel values. +- Change tested: Temporary `_pil_from_hwc_uint8(...)` helper that used `Image.frombuffer("RGB", ...)` for contiguous 3-channel uint8 images and fell back to `Image.fromarray(...)` otherwise. +- Correctness: Compared patched preprocessing against the previous `Image.fromarray(...)` path on all 538 frames: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Micro-result: Isolated conversion prototype measured `1.446 ms/frame` vs `1.455 ms/frame` for the `fromarray` path over 128 frames. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.81s fps=191.23` and `frames=538 elapsed=2.82s fps=191.04`, below the current checkpoint band. +- Learning: Any copy saved by `frombuffer` is too small to matter, and PIL resize plus downstream pipeline scheduling dominate this part of preprocessing. + +### Reusable Pinned RFDETR Preprocess Buffer + +- Hypothesis: Nsight Systems still showed the normalized RFDETR input Host-to-Device copy as the largest memory operation. Filling a reusable pinned CPU tensor directly in CHW layout and copying it to CUDA with `non_blocking=True` should reduce CPU-side transfer blocking and improve overlap with the GPU pipeline. +- Change: For the single-image CUDA numpy preprocessing path, `_pil_image_to_normalized_tensor(...)` now writes normalized channels into a thread-local pinned `torch.float32` CHW buffer. `pre_process_network_input(...)` uses a non-blocking device copy from pinned memory and records a per-thread CUDA event so the host buffer is not reused until the prior H2D copy is complete. Batch and non-CUDA paths keep the normal NumPy-backed tensor behavior. +- Correctness: Compared the pinned integrated preprocessing path against the previous non-pinned CHW formula on all 538 frames: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Micro-result: Preprocess-only loop over 128 frames measured `1.663 ms/frame`, down from the prior ~`1.98 ms/frame` band. +- Pipeline tuning: Depth `1` measured `frames=538 elapsed=4.15s fps=129.55`; depth `2` measured `frames=538 elapsed=2.64s fps=203.56` and `frames=538 elapsed=2.62s fps=205.28`; depth `3` measured `frames=538 elapsed=3.19s fps=168.47`. +- Result on requested command: best isolated run `frames=538 elapsed=2.62s fps=205.28`. +- Learning: Reusable pinned memory is the first preprocessing transfer change that helps end-to-end. The earlier per-frame `.pin_memory()` experiment was slower because it paid pinning cost every frame; reusing the pinned storage preserves the transfer benefit without that allocation cost. + +### Rejected: Retune Fused Mask Resize Pixel Block + +- Hypothesis: `_resize_selected_masks_kernel` is still the largest custom kernel in Nsight Systems. Changing the per-program pixel block from 256 could improve occupancy or reduce Triton program count on T4. +- Change tested: Temporary code only; tried `block_size=512` with `num_warps=8`, then `block_size=128` with `num_warps=4`. +- Correctness: The `512` variant matched exact-sized postprocess on all 538 frames, including dense masks and max box delta `0` px. The `128` variant changes only tile shape and uses the same math. +- Result on requested command: `512/8` measured `frames=538 elapsed=2.68s fps=200.67`; `128/4` measured `frames=538 elapsed=2.66s fps=202.41`, both below the committed `256/4` path. +- Learning: The current 256-pixel tile remains the best balance. Larger tiles likely hurt register/occupancy behavior, while smaller tiles add too many programs. + +### Direct NumPy Ufunc RFDETR Channel Fill + +- Hypothesis: The pinned preprocessing path still allocates a temporary float32 array for each channel via `astype(...)` before copying into the pinned CHW output. Writing each uint8 channel directly into the destination with NumPy ufuncs should remove those temporary arrays. +- Change: Replaced per-channel `astype(np.float32)` temporaries with `np.multiply(..., out=channel, casting="unsafe")` directly into the normalized output channel, followed by in-place mean/std normalization. +- Correctness: Compared the patched preprocessing path against the prior temporary-channel formula on all 538 frames: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Micro-result: Isolated channel-fill prototype measured `0.434 ms/frame` vs `0.476 ms/frame` for the previous temporary-channel fill. Integrated preprocess-only loop over 128 frames measured `1.638 ms/frame`. +- Result on requested command: isolated depth `2` runs measured `frames=538 elapsed=2.61s fps=206.40` and `frames=538 elapsed=2.59s fps=207.82`. +- Learning: Small CPU allocation reductions still matter after the pinned H2D change because they improve the producer side of the two-frame pipeline without changing GPU semantics. + +### Cached RFDETR Normalization Constants + +- Hypothesis: `_pil_image_to_normalized_tensor(...)` rebuilds NumPy mean/std arrays and a float32 scale scalar every frame. Caching these immutable normalization constants per thread should remove small repeated allocations from the producer path. +- Change: Added a thread-local normalization constants cache keyed by the configured mean/std values, reusing the float32 mean array, std array, and `1/255` scale across frames. +- Correctness: Compared cached-constant preprocessing against the prior ufunc-fill formula on all 538 frames: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Pipeline tuning: Depth `2` runs measured `frames=538 elapsed=2.58s fps=208.14` and `frames=538 elapsed=2.58s fps=208.26`; depth `3` measured `frames=538 elapsed=3.07s fps=175.07`. +- Learning: At this point, even small per-frame Python/NumPy allocations are visible in the two-frame pipeline balance. + +### Rejected: Double-Buffered Pinned Preprocess Buffers + +- Hypothesis: The reusable pinned preprocessing buffer waits on the previous H2D copy before reusing host memory. Alternating between two pinned host buffers could let CPU normalization for the next frame proceed while the previous pinned copy is still in flight. +- Change tested: Temporary code only; replaced the single thread-local pinned buffer with a two-buffer ping-pong and per-buffer CUDA copy events. +- Correctness: Compared double-buffered preprocessing against the prior single-buffer formula on all 538 frames: max tensor diff `0.0000000000`. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.60s fps=206.93` and `frames=538 elapsed=2.60s fps=206.71`, below the single-buffer cached path. +- Learning: The extra buffer/event bookkeeping outweighed any overlap benefit. The single reusable pinned buffer remains better for the current two-frame pipeline. + +### Rejected: RFDETR No-Op Preprocessing Bypass + +- Hypothesis: The benchmark model has no static crop, grayscale, or contrast preprocessing configured, but RFDETR still calls the generic numpy preprocessing helper. Bypassing that helper in the no-op case could remove branch overhead and default crop metadata construction. +- Change tested: Temporary code only; skipped `apply_pre_processing_to_numpy_image(...)` when preprocessing overrides were absent and static crop/grayscale/contrast configs were all `None`, constructing the default `StaticCropOffset` directly. +- Correctness: Compared against the previous preprocessing path on all 538 frames: max tensor diff `0.0000000000`, so classes and boxes are unchanged. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.60s fps=207.17` and `frames=538 elapsed=2.62s fps=205.04`, below the cached-normalization checkpoint. +- Learning: The generic no-op helper is not a meaningful limiter; keeping the established shared helper is better than adding a special branch here. + +### Rejected: Same-Stream RFDETR TRT Postprocess + +- Hypothesis: Nsight Systems showed substantial CUDA event overhead around cross-stream raw output handoff. Running RFDETR TRT postprocess on the inference stream and removing raw-output `record_stream(...)` calls could reduce event bookkeeping, at the cost of less overlap between next-frame inference and previous-frame postprocess. +- Change tested: Temporary code only; changed `RFDetrForInstanceSegmentationTRT.post_process(...)` to use `_inference_stream` instead of `_post_process_stream` and removed the raw result `record_stream(...)` loop. +- Correctness: Deferred fused postprocess vs exact-sized postprocess on all 538 frames matched class IDs, dense masks, and boxes exactly; max box delta `0` px. +- Result on requested command: repeat runs measured `frames=538 elapsed=2.59s fps=207.84` and `frames=538 elapsed=2.63s fps=204.87`, below the current checkpoint band. +- Learning: The existing separate postprocess stream still pays for itself. Keeping postprocess overlapped with the next TensorRT replay is better than removing the stream/event bookkeeping. + +### Rejected: Float Boxes In Deferred Fused Postprocess + +- Hypothesis: The deferred RFDETR workflow path still launches tiny PyTorch `round` and `int` kernels for selected boxes. Returning float `xyxy` tensors through workflow conversion could remove those kernels while staying within the allowed 5 px box tolerance. +- Change tested: Temporary code only; returned `selected_boxes` instead of `selected_boxes.round().int()` when `defer_fused_postprocess_count=True`. +- Correctness: Compared deferred vs exact-sized postprocess on 120 frames: counts and class IDs matched, with max box delta `0.5` px. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.63s fps=204.86`, below the current checkpoint band. A simultaneous depth `2`/`3` run was discarded because both processes contended for the GPU. +- Learning: Removing these tiny kernels is not enough to improve the full pipeline, and keeping integer boxes preserves the established output type. + +### Rejected: Retire Completed Workflow Futures Out Of Order + +- Hypothesis: The two-frame workflow pipeline may create CUDA graph bubbles when a completed out-of-order future still counts against the in-flight limit until earlier frames dispatch. Moving completed futures into a ready map immediately while preserving ordered emission could free worker slots sooner. +- Change tested: Temporary `InferencePipeline` scheduler change using `concurrent.futures.wait(..., FIRST_COMPLETED)` for multi-worker inference. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.64s fps=203.71`; depth `3` measured `frames=538 elapsed=3.10s fps=173.46`. +- Learning: The graph bubbles are dominated by whole-frame stage balance, not by completed futures being held for ordered dispatch. The extra scheduler bookkeeping was not useful here. + +### RFDETR Fast Path Deferred Current-Stream Waits + +- Hypothesis: In the direct local RFDETR TRT workflow path, `pre_process(...)`, `forward(...)`, and `post_process(...)` are called back-to-back by the same fast path. The intermediate current-stream waits after preprocessing and forward add CUDA event edges even though `forward(...)` already waits on the preprocessing stream and `post_process(...)` already waits on the inference stream. +- Change: Added `defer_cuda_stream_sync` for RFDETR TRT dense-mask workflow execution. The fast path passes it through preprocess and predict, and `RFDetrForInstanceSegmentationTRT` skips only the redundant current-stream waits after preprocessing and forward. The postprocess-to-CPU conversion wait is unchanged. +- Pipeline tuning: Depth `2` measured `frames=538 elapsed=2.57s fps=209.12`, `frames=538 elapsed=2.61s fps=206.46`, and `frames=538 elapsed=2.56s fps=210.18`. Depth `3` measured `frames=538 elapsed=2.74s fps=196.27`, so depth `2` remains best. +- Nsight Systems: New report for analysis: `/tmp/rfdetr_stream_wait_20260523_031606.nsys-rep` with SQLite export `/tmp/rfdetr_stream_wait_20260523_031606.sqlite`. Under profiler, throughput improved to `frames=538 elapsed=3.09s fps=173.88`. +- Graph spacing: Compared to the clean local baseline profile `/tmp/rfdetr_gap_local_20260523_031231.nsys-rep`, post-warmup graph end-to-next-start gaps improved from p90 `7635.910 us`, p95 `8175.037 us`, p99 `8748.532 us`, mean `2734.651 us` to p90 `4091.966 us`, p95 `4392.314 us`, p99 `4934.320 us`, mean `2050.445 us` after skipping the first 100 graph launches. +- Learning: Removing redundant wait edges reduces the long graph bubbles visible in Nsight while preserving the explicit stream dependencies that matter. The run is now more tightly constrained by the CUDA graph forward pass plus fused postprocess, and depth `3` is still worse because extra workers add CPU/GPU contention. + +### Rejected: Postprocess-Stream CPU Conversion + +- Hypothesis: After deferring the intermediate waits, the remaining postprocess current-stream wait might be avoidable by performing the workflow tensor-to-NumPy copies under the RFDETR postprocess stream context. +- Change tested: Temporary code only; skipped the postprocess current-stream wait and wrapped local workflow conversion in `torch.cuda.stream(model._model._post_process_stream)`. +- Result on requested command: after fixing a temporary helper-name typo, depth `2` measured `frames=538 elapsed=2.65s fps=202.97`. +- Learning: The synchronization still has to happen before CPU predictions are materialized, and moving those copies onto the postprocess stream made the normal run slower. Keep the postprocess wait at the model boundary. + +### Rejected: Skip RFDETR Output Record Stream In Fast Path + +- Hypothesis: The fast RFDETR TRT dense-mask workflow path waits for postprocess before returning to CPU conversion, so `record_stream(...)` on the three TensorRT output clones might be redundant allocator bookkeeping. +- Change tested: Temporary code only; skipped `result_element.record_stream(self._post_process_stream)` when `defer_cuda_stream_sync=True`. +- Correctness: Compared exact-sized postprocess against deferred fused postprocess on 160 frames with the fast-path flags: counts, classes, and boxes matched exactly. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.60s fps=206.61`, below the current `210.18` FPS checkpoint. +- Learning: The explicit allocator stream handoff is still worth keeping; removing it likely shifts synchronization or allocator pressure elsewhere. + +### Rejected: Pooled TensorRT CUDA Graph Output Copies + +- Hypothesis: CUDA graph replay clones every TensorRT output each frame. Replacing those per-frame clone allocations with a two-slot reusable output-copy pool could preserve overlap while reducing allocator and CUDA event churn. +- Change tested: Temporary code only; added a tuple lease around pooled output-copy buffers and released the slot from RFDETR postprocess after its stream consumed the raw TensorRT outputs. +- Correctness: After preserving the lease wrapper through RFDETR `forward(...)`, compared exact-sized postprocess against deferred fused postprocess on 120 frames: counts, classes, and boxes matched exactly. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.62s fps=205.23`, below the current checkpoint. +- Learning: The extra Python lease, event, and slot bookkeeping costs more than the clone allocations it avoids. The TensorRT output clone path is not worth changing this way. + +### Rejected: Fuse RFDETR Sigmoid Into Triton Selector + +- Hypothesis: The fast fused postprocess path still launches a PyTorch sigmoid kernel over logits before the Triton selector scans the same scores. Computing sigmoid inside the selector could remove one kernel launch and one intermediate tensor. +- Change tested: Temporary code only; passed raw logits to `_select_topk_boxes_kernel` and computed `1 / (1 + exp(-logit))` in Triton before top-k selection. The fallback path still materialized `logits_sigmoid` only if fused selection was unavailable. +- Correctness: Compared the fused path against the non-fused PyTorch fallback on 160 frames by monkeypatching the fused selector off for the reference. Counts, classes, and boxes matched exactly. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.60s fps=206.54`, below the current `210.18` FPS checkpoint. +- Learning: The extra Triton `exp` work inside the selector costs more than the standalone PyTorch sigmoid kernel in this shape. Keep the sigmoid as a separate highly optimized PyTorch elementwise kernel. + +### Rejected: Blocked-Detection Triton Mask Resize + +- Hypothesis: `_resize_selected_masks_kernel` launches a program for every detection row and pixel tile, including no-op rows beyond the selected count. Handling multiple detection rows per program could reduce Triton program count and improve the largest custom kernel. +- Change tested: Temporary code only; changed the mask resize kernel to process `block_detections` rows by `block_size` pixels per program. Tried `block_detections=4, block_size=128` and then `block_detections=2, block_size=128`. +- Correctness: The `4x128` variant matched the non-fused PyTorch fallback on 120 frames: counts, classes, and boxes matched exactly. +- Result on requested command: `4x128` measured `frames=538 elapsed=2.59s fps=207.68`; `2x128` measured `frames=538 elapsed=2.60s fps=207.12`, both below the current checkpoint. +- Learning: The reduced program count does not compensate for the larger vector/register footprint on this T4 workload. The original one-detection, 256-pixel tile remains better. + +### Rejected: Packed RFDETR Metadata Copy + +- Hypothesis: Workflow conversion performs separate small D2H copies for count, boxes, confidence, class IDs, and masks. Packing boxes, confidence, and class IDs into one Triton-produced float32 metadata tensor could reduce tiny D2H calls and remove the deferred path's box round/int kernels. +- Change tested: Temporary code only; `_select_topk_boxes_kernel` wrote a `(100, 6)` packed metadata tensor `[x1, y1, x2, y2, score, class_id]`, and the local workflow fast path copied that tensor once before slicing CPU arrays. +- Correctness: Compared against the non-fused PyTorch fallback on 160 frames: counts and classes matched exactly, max box delta `0.5` px. +- Result on requested command: depth `2` repeat runs measured `frames=538 elapsed=2.57s fps=209.35` and `frames=538 elapsed=2.57s fps=209.72`, close but still below the current `210.18` FPS checkpoint. +- Learning: Reducing small D2H copies alone does not beat the added Triton stores and changed CPU formatting. Keep the simpler separate metadata tensors. + +### Rejected: Skip TensorRT Input Record Stream In Fast Path + +- Hypothesis: In the RFDETR local workflow fast path, the preprocessed input tensor remains alive until postprocess has waited on inference, so `pre_processed_images.record_stream(inference_stream)` in the TensorRT wrapper might be redundant allocator bookkeeping. +- Change tested: Temporary code only; added a `record_input_stream` flag to `infer_from_trt_engine(...)` and disabled it only when `defer_cuda_stream_sync=True`. +- Correctness: Compared exact-sized postprocess against deferred fused postprocess on 120 frames: counts, classes, and boxes matched exactly. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.58s fps=208.39`, below the current checkpoint. +- Learning: The input allocator stream handoff is still useful or its removal shifts synchronization elsewhere. Keep the TensorRT wrapper's input `record_stream(...)`. + +### Rejected: Inline Workflow Metadata Attachment + +- Hypothesis: The local workflow fast path constructs `sv.Detections`, then walks those objects again to attach prediction type and parent/root coordinate metadata. Filling those arrays during conversion could reduce CPU-side object mutation before the next frame can feed CUDA graph replay. +- Change tested: Temporary code only; extended `_convert_inference_models_detections_to_sv_detections(...)` to accept `images` and `prediction_type`, then skipped `attach_prediction_type_info_to_sv_detections_batch(...)` and `attach_parents_coordinates_to_batch_of_sv_detections(...)` in the fast path. +- Correctness: This change runs after tensor-to-NumPy conversion and does not affect model classes or boxes. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.59s fps=207.76`, below the current checkpoint. +- Learning: The generic metadata helpers are not the current limiter; inlining their work added enough Python/object overhead to hurt throughput. + +### Rejected: Pinned Host Workflow Prediction Copy + +- Hypothesis: Workflow conversion copies mask tensors and small metadata tensors from GPU to CPU synchronously with `.cpu().numpy()`. Staging RFDETR deferred outputs through reusable pinned CPU tensors on a copy stream could reduce D2H blocking and create room to overlap CPU metadata work. +- Change tested: Temporary code only; allocated thread-local pinned CPU buffers for `xyxy`, confidence, class IDs, and dense masks, copied with `non_blocking=True` on a thread-local CUDA stream, then synchronized once before constructing `sv.Detections`. +- Correctness: The change only affects CPU materialization; no model classes or boxes are changed. +- Result on requested command: depth `2` measured `frames=538 elapsed=3.50s fps=153.82`. +- Learning: Pinned staging at the workflow boundary is much slower than PyTorch's direct `.cpu().numpy()` path here. The explicit stream synchronization and large pinned mask buffer mechanics dominate any potential overlap benefit. + +### RFDETR TRT Pre-Request Workflow Fast Path + +- Hypothesis: The local workflow path still builds `inference_images`, creates a Pydantic `InstanceSegmentationInferenceRequest`, dumps it back into a dict, and runs adapter image loading before reaching the RFDETR TRT fast path. For the benchmark's local dense-mask RFDETR TRT case, the workflow image already has a BGR NumPy frame, so the model can be called directly before request construction. +- Change: `run_locally(...)` now attempts an RFDETR-TRT-specific fast path before constructing the request. It loads the model, uses `WorkflowImageData.numpy_image` directly, passes minimal model kwargs, keeps `defer_cuda_stream_sync=True`, and still uses the existing `sv.Detections` conversion and workflow metadata helpers. +- Correctness: Over all 538 frames, `load_image_bgr({"type": "numpy_object", "value": frame})` matched the direct `frame` pixels exactly (`bad_pixels=0`). Deferred fused postprocess matched exact-sized postprocess with `bad_counts=0`, `bad_classes=0`, and `max_box_delta=0` px. +- Result on requested command: depth `2` runs measured `frames=538 elapsed=2.53s fps=212.35` and `frames=538 elapsed=2.52s fps=213.23`, improving the previous `210.18` FPS checkpoint. +- Learning: Avoiding per-frame request construction, request dump, numpy payload wrapping, adapter image loading, and repeated kwarg mapping keeps the CPU producer closer to the CUDA graph replay cadence without changing model inputs. + +### Rejected: Cached RFDETR PreProcessingOverrides In Workflow Fast Path + +- Hypothesis: The new RFDETR pre-request fast path creates a default `PreProcessingOverrides` object per frame. Reusing one immutable default instance could remove a small allocation. +- Change tested: Temporary code only; replaced the per-call default override object with a module-level `_RFDETR_NO_PREPROCESSING_OVERRIDES = PreProcessingOverrides()`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.57s fps=208.98`, below the per-call override fast path. +- Learning: This was either noise-sensitive or interacted poorly with the surrounding path; keep the simpler per-call object that produced the better repeated benchmark. + +### Rejected: Cache RFDETR Fast Path Model Reference + +- Hypothesis: Even after the pre-request fast path, `run_locally(...)` still calls `model_manager.add_model(...)` and `_try_run_rfdetr_trt_fast_path(...)` indexes the manager every frame. Caching the loaded RFDETR adapter on the workflow block could avoid model-manager cache refresh and lock overhead. +- Change tested: Temporary code only; stored `_rfdetr_trt_fast_path_model` and `_rfdetr_trt_fast_path_model_id` after first lookup, skipped `add_model(...)` when the cached ID matched, and reused the cached adapter. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.55s fps=210.59`, below the simpler pre-request fast path. +- Learning: The added per-frame Python attribute checks and branches outweigh the manager lookup savings in this benchmark. Keep the direct manager call. + +### Rejected: Submit Next Workflow Batch Before Sink Emit + +- Profile: Captured the current pre-request fast path with Nsight Systems at `/tmp/rfdetr_request_bypass_20260523_041056.nsys-rep` and exported `/tmp/rfdetr_request_bypass_20260523_041056.sqlite`. Under profiler, depth `2` measured `frames=538 elapsed=3.01s fps=178.65`. After skipping the first 100 graph launches, CUDA graph duration was stable at p50 `3590.535 us`, while graph end-to-next-start gap was p50 `1116.078 us`, p90 `3721.247 us`, p95 `4175.294 us`, p99 `5359.476 us`. +- Hypothesis: In the multi-worker `InferencePipeline` loop, emitting a completed ordered result before submitting the next batch may leave a worker slot idle and widen the graph replay gap. Submitting the next batch immediately after a slot frees, then emitting the ordered sink result, could reduce CPU-side bubbles while preserving `max_inflight_workflow_batches=2`. +- Change tested: Temporary code only; collected completed ordered results in `ready_to_emit`, submitted the current frame as soon as the pending count dropped below the worker limit, and emitted the collected results afterward. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.55s fps=210.92` and `frames=538 elapsed=2.54s fps=211.51`, below the current `213.23` FPS checkpoint. +- Learning: Sink emission is not the source of the remaining graph bubbles for this benchmark. Keep the original simpler ordered scheduler and continue focusing on model/postprocess conversion costs. + +### Rejected: Keep Deferred RFDETR Query Indices Int32 + +- Hypothesis: The deferred fused postprocess path zero-fills `query_indices` and converts it from int32 to int64 before the Triton mask resize kernel, even though the Triton kernel only needs int32 indices. Using an uninitialized int32 tensor for the deferred path could remove a fill kernel and an int32-to-int64 copy kernel. +- Change tested: Temporary code only; changed `query_indices` from `torch.zeros(...)` to `torch.empty(...)` and skipped `.to(dtype=torch.long)` when `return_cpu_count=False`. +- Correctness: Compared exact-sized postprocess against deferred fused postprocess on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.56s fps=209.81`, below the current checkpoint. +- Learning: The removed kernels are not on the critical path enough to offset allocator or scheduling side effects. Keep the existing zeroed int32 tensor and long conversion. + +### Rejected: Query Pinned Preprocess Copy Event Before Synchronize + +- Hypothesis: The reusable pinned preprocessing buffer synchronizes its previous H2D copy before every reuse. With pipeline depth `2`, the previous copy for that thread should usually be complete, so checking `copy_event.query()` before `copy_event.synchronize()` could avoid unnecessary synchronization overhead. +- Change tested: Temporary code only; changed `_get_pinned_normalized_buffer(...)` to synchronize only when the recorded copy event had not completed. +- Correctness: Compared exact-sized postprocess against deferred fused postprocess on all 538 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `212.27 fps`, `214.53 fps`, `211.70 fps`, and `211.77 fps`. The high run beat the checkpoint, but the repeated runs did not consistently clear the current `213.23` FPS best. +- Learning: The query-before-sync guard is too noise-sensitive here and likely adds API overhead on the common completed-event path. Keep the simpler unconditional event synchronize. + +### Single-Step Workflow Runner Fast Path + +- Hypothesis: The benchmark workflow is a single image input, one instance-segmentation model step, and one output selecting `$steps.segmentation.predictions`. Even after RFDETR model fast paths, each frame still pays generic workflow runtime assembly, validation, execution-data-manager/coordinator setup, step scheduling, and output construction before the next CUDA graph can be launched. A guarded direct runner for this exact one-step shape should reduce CPU bubbles without changing model execution. +- Change: `WorkflowRunner` now caches a fast path for workflows with exactly one `roboflow_core/roboflow_instance_segmentation_model@v3` step, one image input, no input substitutions, no serialization/preview mode, and one `predictions` output. The fast path constructs `WorkflowImageData` directly from `VideoFrame`, calls the initialized block with static manifest parameters, and returns the same output field shape. Other workflows fall back to the generic execution engine. +- Correctness: Compared the generic execution engine against the fast runner on all 538 frames from `vehicles_312px.mp4`; counts and class IDs matched exactly and max box delta was `0` px. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=216.19` and `frames=538 elapsed=2.46s fps=218.97`, improving the previous `213.23` FPS checkpoint. +- Profile: Nsight Systems capture `/tmp/rfdetr_single_step_fast_20260523_043134.nsys-rep` exported to `/tmp/rfdetr_single_step_fast_20260523_043134.sqlite`. Under profiler, depth `2` measured `frames=538 elapsed=2.54s fps=212.13`. After skipping the first 100 graph launches, CUDA graph duration was p50 `3811.684 us`; graph end-to-next-start gap was p50 `738.036 us`, p90 `792.180 us`, p95 `817.677 us`, p99 `921.074 us`, down from the prior p50 `1116.078 us`, p90 `3721.247 us`, p95 `4175.294 us`. +- Learning: The remaining graph gaps were partly generic workflow orchestration overhead. The most valuable CPU work now is removing frame-level workflow machinery around the already-optimized RFDETR block while preserving the normal workflow path for non-trivial graphs. + +### Rejected: Omit Video Metadata In Single-Step Workflow Fast Path + +- Hypothesis: The single model/output fast path does not consume `WorkflowImageData.video_metadata`, so skipping per-frame `VideoMetadata` construction might reduce CPU work between graph launches. +- Change tested: Temporary code only; omitted `VideoMetadata(...)` construction and passed no `video_metadata` into the fast path's `WorkflowImageData`. +- Correctness: Compared the generic execution engine against the fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.47s fps=217.60` and `frames=538 elapsed=2.50s fps=215.45`, below the current `218.97` FPS checkpoint. +- Learning: This object construction is not a reliable bottleneck, and omitting the metadata may perturb surrounding scheduling without improving throughput. Keep the fast path semantically closer to the generic runner. + +### Rejected: Call Local Instance Segmentation Block Directly + +- Hypothesis: The single-step workflow fast path still calls the block's generic `run(...)`, which computes the confidence value and branches on local vs remote execution every frame. For the local benchmark, precomputing confidence once and calling `run_locally(...)` directly could shave CPU work before the next CUDA graph launch. +- Change tested: Temporary code only; required the cached single-step fast path to be local, precomputed the manifest confidence value, and called `step.run_locally(...)` instead of `step.run(...)`. +- Correctness: Compared the generic execution engine against the direct local fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.46s fps=218.48` and `frames=538 elapsed=2.46s fps=218.33`, close but still below the current `218.97` FPS checkpoint. +- Learning: The generic block method wrapper is not the next meaningful limiter. Keep the broader fast path that still supports the block's normal local/remote dispatch. + +### Rejected: Derive Detection IDs From Inference ID + +- Hypothesis: The RFDETR workflow conversion creates one inference UUID per frame and one detection UUID per detection. Deriving detection IDs from the existing inference ID could reduce CPU work in result materialization after the single-step workflow fast path removed larger orchestration overhead. +- Change tested: Temporary code only; when `inference_id` was available, set detection IDs to `f"{inference_id}.{detection_idx}"` instead of calling `uuid.uuid4()` per detection, preserving the old UUID behavior when no inference ID exists. +- Correctness: Compared the generic execution engine against the fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=215.79`, below the current checkpoint. +- Learning: Per-detection UUID generation is not the limiting CPU cost in this path, or the changed string construction perturbs allocation enough to lose. Keep the existing detection UUID behavior. + +### Rejected: GPU Normalize Resized RFDETR Input + +- Hypothesis: The latest profile shows the steady float32 H2D input copy costs about `188 us/frame` (`1168128` bytes per frame). Copying the resized uint8 HWC image to GPU and using a fused Triton kernel to produce normalized float32 CHW input could replace that with a 4x smaller H2D transfer plus GPU work. +- Change tested: Temporary code only; for CUDA single-image RFDETR preprocessing, copied the PIL-resized uint8 image through a pinned uint8 buffer and launched a Triton HWC-uint8-to-normalized-CHW kernel on the preprocessing stream. +- Correctness: Compared the new GPU-normalize path against the previous CPU-normalize path on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_tensor_delta=4.76837158203125e-07`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.58s fps=208.79`, far below the current checkpoint. +- Learning: On this T4 workload, the extra device allocation, uint8 copy path, and normalization kernel cost more than the saved H2D bandwidth. Keep the CPU vectorized normalization into pinned float32. + +### Rejected: Pass Single RFDETR Image Without List Wrapper + +- Hypothesis: The RFDETR TRT workflow fast path always creates a one-element Python list of NumPy images before preprocessing. Passing the single NumPy frame directly could reduce small Python overhead and let preprocessing avoid a list allocation. +- Change tested: Temporary code only; used `images[0].numpy_image` when the batch size was one and kept the old list comprehension for larger batches. +- Correctness: Compared the generic execution engine against the fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=215.65`, below the current checkpoint. +- Learning: The single-image direct path changes preprocessing internals enough to hurt throughput; keep the explicit one-element list. + +### Rejected: Cache Workflow Class Names As NumPy Array + +- Hypothesis: Local workflow conversion rebuilds class-name strings with a Python loop for every frame. Caching `model.class_names` as a NumPy array and indexing it with `class_id` could reduce result materialization work after the single-step runner fast path removed larger orchestration costs. +- Change tested: Temporary code only; cached `model.class_names` on the adapter as `_workflow_class_names_np` and used vectorized indexing for in-range class IDs, falling back to the original loop for out-of-range IDs. +- Correctness: Compared the generic execution engine against the fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.46s fps=219.07`, then repeated at `frames=538 elapsed=2.47s fps=217.57`; not stable enough to beat the current `218.97` FPS checkpoint. +- Learning: Class-name construction is too small/noise-sensitive to checkpoint. Keep the simpler existing loop. + +### Rejected: Larger Triton Mask Resize Pixel Tile + +- Hypothesis: `_resize_selected_masks_kernel` is the largest custom postprocess kernel. Increasing its pixel tile from `256` to `512` could reduce program count and launch-side work without changing the one-detection-per-program layout that outperformed prior blocked-detection variants. +- Change tested: Temporary code only; changed `fused_resize_selected_masks(...)` block size from `256` to `512`. +- Correctness: Compared exact-sized postprocess against deferred fused postprocess on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `219.52 fps`, then `218.91 fps`, then `215.93 fps`; not stable enough to checkpoint over the current `218.97` FPS best. +- Learning: The larger tile can occasionally win but is too variable on this T4 workload. Keep the original `256` pixel tile. + +### Rejected: Skip Per-Frame Status Update Without Handlers + +- Hypothesis: `_emit_inference_result(...)` builds a DEBUG status payload with frame IDs, timestamps, and source IDs for every frame, even when no `status_update_handlers` are configured. Guarding that call could reduce CPU work in the inference thread before submitting the next frame. +- Change tested: Temporary code only; wrapped the hot per-frame `send_inference_pipeline_status_update(...)` call in `if self._status_update_handlers`. +- Correctness: The change does not affect model execution or prediction contents; it only skips status-update allocation when no handlers exist. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=216.14`, below the current checkpoint. +- Learning: This status payload is not the current bottleneck, or the branch perturbs the tight loop. Keep the existing status update behavior. + +### Rejected: Prebind Single-Step Workflow Manifest Values + +- Hypothesis: The single-step workflow runner fast path still reads manifest attributes and `step.run` from closure objects every frame. Capturing the static values once when the fast path is built could reduce Python attribute lookup overhead. +- Change tested: Temporary code only; captured `step.run` and all static manifest parameters into local closure variables, then used those locals for the per-frame model block call. +- Correctness: Compared the generic execution engine against the fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=217.09`, below the current checkpoint. +- Learning: Attribute lookup in this closure is not the limiting cost. Keep the clearer manifest-based call. + +### Rejected: Torch Inference Mode Around RFDETR Fast Path + +- Hypothesis: The RFDETR workflow fast path enters PyTorch/TensorRT preprocessing, forward, postprocess, and tensor-to-NumPy conversion without an explicit `torch.inference_mode()` guard. Adding it could reduce autograd/version-counter overhead around the CUDA graph and fused kernels. +- Change tested: Temporary code only; imported `torch` in the workflow block and wrapped the RFDETR TRT fast path pre-process, forward, post-process, and conversion in `with torch.inference_mode():`. +- Correctness: Compared the generic execution engine against the fast runner on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=215.70`, below the current checkpoint. +- Learning: The per-frame inference-mode context or its interaction with graph-replayed tensors costs more than any autograd savings. Keep the existing fast path without an extra context manager. + +### Rejected: Borrow TensorRT CUDA Graph Output Buffers + +- Hypothesis: The CUDA graph replay path clones all TensorRT output buffers before RFDETR postprocess. For the depth-2 workflow fast path, each worker consumes postprocess results before it takes another frame, so returning thread-scoped graph output buffers directly could let postprocess run on the graph outputs and remove device-to-device clone work between graph launches. +- Change tested: Temporary code only; added an explicit `borrow_cuda_graph_outputs` flag to the TRT graph path, keyed borrowed graph states by worker thread, returned the cached graph output buffers without cloning, and enabled the flag only in the RFDETR TRT workflow fast path. +- Correctness: Compared borrowed graph outputs against cloned graph outputs on 120 frames after slicing by the deferred valid count: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.62s fps=205.72` and `frames=538 elapsed=2.61s fps=206.36`, well below the current `218.97` FPS checkpoint. +- Learning: Removing the D2D output clones is not enough to overcome the extra per-thread graph state and scheduling cost in this workload. Keep the existing cloned-output graph replay path. + +### Rejected: Fuse Box Rounding Into Triton Selector + +- Hypothesis: The deferred fused postprocess still launches PyTorch kernels for `selected_boxes.round().int()` after the Triton top-k selector. Writing rounded integer boxes directly from `_select_topk_boxes_kernel` could remove that post-selector work from the gap between CUDA graph launches. +- Change tested: Temporary code only; made the selector allocate `boxes` as `int32`, stored clipped coordinates with `+0.5` for positive-coordinate rounding, and skipped the Python-side `round().int()` when fused boxes were already integer. +- Correctness: Compared fused output against the unfused PyTorch fallback on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.47s fps=218.19` and `frames=538 elapsed=2.48s fps=216.58`, not consistently above the current `218.97` FPS checkpoint. +- Learning: The small box-conversion kernels are visible in Nsight Systems but are not throughput-limiting at depth `2`. Keep the simpler float-box selector plus existing PyTorch rounding. + +### Rejected: More Warps For Triton Mask Resize + +- Profile: Nsight Compute capture `/tmp/rfdetr_resize_kernel_depth2_current.ncu-rep` sampled `_resize_selected_masks_kernel`; current launches use grid `(100, 215, 1)`, block `(128, 1, 1)`, 32 registers/thread, and full theoretical occupancy on T4 for the sampled kernels. +- Hypothesis: The current mask resize keeps the 256-pixel tile but launches with `num_warps=4`. Increasing to `num_warps=8` could map the same per-program vector work across more lanes without changing the tile size or the one-detection-per-program layout. +- Change tested: Temporary code only; changed `fused_resize_selected_masks(...)` to launch `_resize_selected_masks_kernel` with `num_warps=8`. +- Correctness: Compared deferred fused postprocess against exact-sized postprocess on 120 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=217.27` and `frames=538 elapsed=2.49s fps=215.66`, below the current `218.97` FPS checkpoint. +- Learning: The current 4-warp launch is already well-balanced for this T4 kernel. More warps change scheduling without reducing the end-to-end graph gap. + +### Rejected: Direct Pinned Host Input To CUDA Graph + +- Hypothesis: The current TRT graph path first copies the normalized pinned CPU tensor to a temporary CUDA tensor, then copies that CUDA tensor into the captured graph input buffer. Returning the pinned CPU tensor from RFDETR preprocessing and copying it directly into the graph input buffer could remove the temporary CUDA input tensor and one device-to-device copy. +- Change tested: Temporary code only; added a guarded `keep_cuda_graph_input_on_host` path for the RFDETR TRT workflow fast path, skipped CUDA transfer in preprocessing, let the CUDA graph input copy accept pinned CPU input with `non_blocking=True`, and recorded the pinned-buffer reuse event after graph-stream consumption. +- Correctness: Compared direct pinned-host graph input against the normal CUDA-input graph path on 120 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=215.65` and `frames=538 elapsed=2.49s fps=216.35`, below the current `218.97` FPS checkpoint. +- Learning: The existing temporary CUDA input allows the H2D copy to overlap on the preprocessing stream before graph replay. Moving H2D into the graph input copy reduces a D2D copy but puts the larger H2D transfer closer to the critical path, widening the effective graph gap. + +### Rejected: Bit-Pack Dense Masks Before D2H Copy + +- Hypothesis: Dense masks dominate prediction Device-to-Host bytes. Packing selected boolean masks into bytes on GPU before copying to CPU could reduce mask D2H payload by roughly 8x, then CPU `np.unpackbits(...)` could restore the existing `sv.Detections.mask` shape. +- Change tested: Temporary code only; added a Triton `_pack_bool_masks_kernel`, used it in the local workflow conversion for CUDA bool masks, copied packed `uint8` data to CPU, and unpacked with little-endian bit order before constructing `sv.Detections`. +- Correctness: The standalone packer matched a CUDA bool tensor exactly. Full `InferencePipeline` comparison against the normal mask-copy path on all 538 frames matched counts, class IDs, boxes, and dense masks exactly: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.50s fps=214.93` and `frames=538 elapsed=2.48s fps=217.31`, below the current `218.97` FPS checkpoint. +- Learning: The extra Triton launch and CPU unpack work cost more than the saved D2H bandwidth for these small per-frame mask payloads. Keep the direct bool mask copy. + +### RFDETR Fused CPU Normalization Constants + +- Hypothesis: The RFDETR PIL/NumPy preprocessing fast path still performs three full-array operations per channel: multiply by `1/255`, subtract mean, then divide by std. Precomputing `1/(255*std)` and `-mean/std` should reduce this to multiply plus add per channel while preserving prediction outputs. +- Change: `_get_normalization_constants(...)` now caches per-channel `multiplier` and `bias`, and `_pil_image_to_normalized_tensor(...)` applies `image * multiplier + bias` directly into the CHW float32 buffer. +- Correctness: Compared the fused-normalization path against the torchvision fallback path on all 538 frames from `vehicles_312px.mp4`: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`, `max_tensor_delta=7.152557373046875e-07`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.46s fps=218.60`, `frames=538 elapsed=2.46s fps=219.03`, and `frames=538 elapsed=2.46s fps=218.97`. +- Profile: Nsight Systems capture `/tmp/rfdetr_fused_norm_20260523_054913.nsys-rep` exported to `/tmp/rfdetr_fused_norm_20260523_054913.sqlite`; under profiler, depth `2` measured `frames=538 elapsed=2.56s fps=210.22`. After skipping the first 100 graph launches, CUDA graph duration was p50 `3787.381 us`; graph end-to-next-start gap was p50 `763.060 us`, p90 `852.370 us`, p95 `892.396 us`, p99 `1042.918 us`. +- Learning: This is a small CPU-side gain but it is algebraically simple and keeps the benchmark at the current best band. The remaining bottleneck is still the graph replay plus the postprocess/materialization tail rather than normalization alone. + +### Rejected: Skip Pinned Buffer Reuse Synchronize + +- Hypothesis: The RFDETR workflow fast path runs preprocess, forward, postprocess, and result materialization before the same worker thread reuses its thread-local pinned normalization buffer. The previous H2D copy should therefore already be complete, so skipping `_get_pinned_normalized_buffer(...)`'s `copy_event.synchronize()` in this fast path could remove a small CPU API wait. +- Change tested: Temporary code only; added a guarded `skip_pinned_buffer_reuse_sync` flag through RFDETR preprocessing and enabled it only for the RFDETR TRT workflow fast path. +- Correctness: Compared skip-sync against the normal synchronized path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.47s fps=218.19` and `frames=538 elapsed=2.50s fps=215.27`, below the current `219.03` FPS checkpoint. +- Learning: The event synchronize is either already cheap when completed or helps maintain better copy/launch ordering. Keep the explicit pinned-buffer reuse synchronization. + +### Rejected: Skip Empty Class Filter Helper + +- Hypothesis: The benchmark workflow does not set `class_filter`, but the RFDETR workflow fast path still calls `filter_out_unwanted_classes_from_sv_detections_batch(...)`, which immediately returns when no filter is provided. Avoiding the no-op function call could shave a small amount of Python result-materialization overhead. +- Change tested: Temporary code only; guarded calls to `filter_out_unwanted_classes_from_sv_detections_batch(...)` with `if class_filter:` in the inference-models and RFDETR TRT workflow fast paths. +- Correctness: Compared the guarded path against the previous always-call behavior on all 538 frames through `InferencePipeline`: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.47s fps=217.59` and `frames=538 elapsed=2.49s fps=215.87`, below the current `219.03` FPS checkpoint. +- Learning: The no-op helper call is too small to matter, and the extra branch may perturb the tight path. Keep the simpler existing call. + +### Rejected: Single-Lock TRT CUDA Graph Cache Lookup + +- Hypothesis: The TRT CUDA graph replay path checks `cache_key not in trt_cuda_graph_cache` and then indexes `trt_cuda_graph_cache[cache_key]`, acquiring the cache lock twice per frame on the steady path. A `get(...)` method that moves the key to the LRU tail under one lock could reduce Python/API overhead between graph launches. +- Change tested: Temporary code only; added `TRTCudaGraphCache.get(...)` and used it in `_execute_trt_engine(...)` before deciding whether to capture or replay a graph. +- Correctness: Compared standard TensorRT execution against CUDA graph execution on 120 frames after the cache change: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.45s fps=219.26`, `frames=538 elapsed=2.46s fps=218.76`, then `frames=538 elapsed=2.48s fps=217.02`; not stable enough to checkpoint over the current `219.03` FPS best. +- Learning: The double lock is not a reliable limiter, and the altered lookup path is noise-sensitive. Keep the existing cache API. + +### Rejected: Static RFDETR PreProcessingOverrides + +- Hypothesis: The RFDETR TRT workflow fast path constructs the same `PreProcessingOverrides(False, False, False)` object every frame. Reusing a module-level instance could remove a small allocation from the CPU path before graph replay. +- Change tested: Temporary code only; added a module-level `RFDETR_TRT_PRE_PROCESSING_OVERRIDES` and passed it to `model._model.pre_process(...)` in `_try_run_rfdetr_trt_fast_path(...)`. +- Correctness: Full-video `InferencePipeline` comparison against an equivalent overrides object matched all 538 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.45s fps=219.53`, then repeated at `frames=538 elapsed=2.49s fps=215.74`; not stable enough to checkpoint. +- Learning: Per-frame override-object construction is below the noise floor. Keep the simpler local construction. + +### Rejected: In-Place RFDETR Postprocess Sigmoid + +- Hypothesis: The RFDETR TRT workflow path postprocesses TensorRT output clones that are not reused after postprocess. Applying `sigmoid_()` to the logits in place could avoid allocating a separate sigmoid tensor and reduce postprocess memory traffic. +- Change tested: Temporary code only; added a gated `inplace_sigmoid` option to `post_process_instance_segmentation_results(...)`, passed it through TRT instance segmentation postprocess, and enabled it only in the RFDETR TRT workflow fast path. +- Correctness: Compared in-place sigmoid against the default out-of-place sigmoid on all 538 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0`, `max_conf_delta=0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=217.36` and `frames=538 elapsed=2.48s fps=216.73`, below the current `219.03` FPS checkpoint. +- Learning: The standalone sigmoid allocation is not the bottleneck; in-place mutation likely changes scheduling or allocator behavior enough to lose. Keep the out-of-place PyTorch sigmoid. + +### Rejected: Fuse RFDETR Sigmoid Into Triton Selector + +- Hypothesis: The dense-mask RFDETR fused postprocess path still runs a PyTorch sigmoid over logits before the Triton selector. Letting `_select_topk_boxes_kernel` load raw logits and apply sigmoid internally could remove a kernel launch and temporary tensor between CUDA graph replay and postprocess. +- Change tested: Temporary code only; added an `apply_sigmoid` constexpr to the Triton selector, lazily skipped the global PyTorch sigmoid when the fused path handled a frame, and kept the original sigmoid fallback for unsupported metadata. +- Correctness: Compared fused selector output against the PyTorch fallback on 120 frames including masks: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=1.1920928955078125e-07`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=216.53` and `frames=538 elapsed=2.49s fps=215.82`, below the current `219.03` FPS checkpoint. +- Learning: The selector is already doing enough reduction work that adding sigmoid math slows it down more than the separate PyTorch sigmoid costs. Keep sigmoid outside the selector. + +### Rejected: Compact Borrowed TRT Masks Before Next Graph + +- Hypothesis: CUDA graph replay clones all TensorRT outputs, including the fixed `(100, 78, 78)` mask tensor, before the graph stream can accept the next replay. The video usually keeps only about 5 detections, so borrowing graph outputs, selecting boxes, compacting only selected masks, and then resizing from the compact buffer could reduce graph-to-graph spacing. +- Change tested: Temporary code only; added a guarded no-clone CUDA graph output path, a Triton gather kernel for selected masks, a compact-mask resize kernel, and enabled the path only in the RFDETR TRT workflow fast path with pipeline depth `2`. +- Correctness: Compared compact borrowed outputs against the cloned-output fused path on 120 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=216.53`, below the current `219.03` FPS checkpoint. +- Learning: Replacing the full mask clone with selector/gather stream choreography adds enough pre-next-graph work to lose throughput. The existing clone is cheaper than the synchronization structure needed to borrow graph outputs safely. + +### RFDETR Pinned Host Detection Materialization + +- Hypothesis: The workflow conversion path performs separate blocking `.cpu().numpy()` copies for boxes, confidences, classes, and dense masks after the deferred fused count read. Reusing thread-local pinned host buffers and enqueueing all D2H copies before one stream synchronize should reduce result-materialization overhead while preserving independent NumPy arrays for queued predictions. +- Change: Added a CUDA-only conversion fast path in the instance segmentation workflow block. When RFDETR fused postprocess provides a deferred valid count, the converter copies selected boxes, confidences, classes, and masks into reusable pinned CPU buffers with `non_blocking=True`, synchronizes once, then returns normal copied NumPy arrays so queued sink payloads are not backed by reusable storage. +- Correctness: Compared pinned conversion against the previous `.cpu().numpy()` conversion on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.44s fps=220.62`, `frames=538 elapsed=2.45s fps=219.88`, and `frames=538 elapsed=2.45s fps=220.00`. +- Profile: Nsight Systems capture `/tmp/rfdetr_pinned_conversion_20260523_063547.nsys-rep` exported to `/tmp/rfdetr_pinned_conversion_20260523_063547.sqlite`; under profiler, depth `2` measured `frames=538 elapsed=2.58s fps=208.40`. After skipping the first 100 graph launches, CUDA graph duration was p50 `3782.980 us`; graph end-to-next-start gap was p50 `806.163 us`, p90 `904.306 us`, p95 `921.649 us`, p99 `1010.096 us`. CUDA API `cudaStreamSynchronize` calls dropped from the earlier current-profile `2702` calls to `1088` calls. +- Learning: The deferred GPU count still gates CPU materialization, but grouping the remaining D2H copies onto pinned buffers reduces enough per-frame synchronization/API overhead to move the checkpoint above the previous `219.03` FPS band. + +### Rejected: Fixed-Capacity RFDETR Conversion Buffers + +- Hypothesis: The pinned conversion checkpoint grows thread-local host buffers when selected detection count increases, producing extra `cudaHostAlloc` calls in the Nsight profile. Allocating the full 100 RFDETR detection slots on first use could avoid reallocations during timed frames. +- Change tested: Temporary code only; forced `_get_rfdetr_conversion_buffers(...)` to allocate at least 100 rows while still copying only the selected-count slice. +- Correctness: Compared fixed-capacity pinned conversion against the forced `.cpu().numpy()` fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.45s fps=219.73`, `frames=538 elapsed=2.44s fps=220.67`, then `frames=538 elapsed=2.48s fps=216.81`; not stable enough to keep over the dynamic-capacity pinned conversion checkpoint. +- Learning: The larger pinned allocation changes memory behavior enough to introduce variance, and avoiding a few growth allocations does not reliably improve steady-state throughput. Keep the dynamic grow-to-needed-count buffers. + +### Rejected: Fold RFDETR Workflow Metadata Attachment + +- Hypothesis: After pinned D2H conversion, the RFDETR workflow fast path still makes separate Python passes for prediction type attachment, no-op class filtering, and parent-coordinate metadata. Folding prediction type and parent metadata into one RFDETR-specific helper and skipping the empty class filter could reduce CPU materialization work. +- Change tested: Temporary code only; added `_attach_rfdetr_fast_path_metadata(...)` to attach prediction type plus root/parent IDs, coordinates, and dimensions in a single pass, and used it only in `_try_run_rfdetr_trt_fast_path(...)`. +- Correctness: Compared folded metadata against the previous helper chain on all 538 frames, ignoring random detection IDs: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes=0`, `bad_data=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.47s fps=218.02` and `frames=538 elapsed=2.45s fps=219.31`, below the pinned-conversion checkpoint. +- Learning: The helper calls are not the limiter, and folding them changes enough Python allocation/order behavior to lose throughput. Keep the established helper chain. + +### Rejected: Reuse Deferred Fused Postprocess CUDA Buffers + +- Hypothesis: The internal deferred RFDETR fused postprocess path allocates same-shaped CUDA tensors for scores, classes, boxes, query indices, count, and resized masks each frame. Reusing thread-local buffers for those fixed-capacity outputs could reduce allocator and Python overhead while preserving the deferred GPU count path. +- Change tested: Temporary code only; added thread-local output buffers in `fused_postprocess.py` and enabled them only when `_try_fused_instance_segmentation_post_process(...)` runs with `defer_count=True`. +- Correctness: Compared deferred fused postprocess with reused buffers against exact-sized postprocess on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.44s fps=220.45`, then `frames=538 elapsed=2.49s fps=216.02`; not stable enough to keep over the pinned-conversion checkpoint. +- Learning: PyTorch's caching allocator already handles these fixed shapes well enough. Thread-local reuse changes object lifetime/stream behavior and can degrade scheduling, so keep per-frame tensor creation. + +### Rejected: Direct-Owned CPU Detection Tensors + +- Hypothesis: The pinned conversion checkpoint copies GPU outputs into reusable pinned CPU tensors, synchronizes once, then copies those pinned NumPy views into independent arrays for queue safety. Allocating fresh CPU tensors per frame and returning NumPy views directly could remove the extra host copy while preserving result ownership. +- Change tested: Temporary code only; replaced the reusable pinned buffers with fresh CPU tensors for boxes, confidences, classes, and masks, copied CUDA tensors into them synchronously, and returned `.numpy()` views without `.copy()`. +- Correctness: Compared direct-owned CPU conversion against the forced `.cpu().numpy()` fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.52s fps=213.17`, well below the pinned-conversion checkpoint. +- Learning: Per-frame CPU tensor allocation and blocking D2H copies cost much more than the extra host copy from reusable pinned buffers. Keep pinned staging plus independent NumPy copies. + +### Rejected: Skip Inference ID In Predictions-Only Workflow Fast Path + +- Hypothesis: The single-step workflow runner returns only `predictions`, but the RFDETR block still generates a workflow inference UUID and attaches it to every `sv.Detections` before the runner discards the enclosing `inference_id` field. Skipping that UUID/data-field work for predictions-only fast paths could reduce CPU materialization overhead. +- Change tested: Temporary code only; passed an internal `include_inference_id=False` flag from `WorkflowRunner`'s single-step fast path into the RFDETR TRT fast path and skipped the workflow inference ID generation/attachment when false. +- Correctness: Compared conversion with and without inference ID attachment on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. This intentionally omitted the `inference_id` data field, so it was not suitable to keep unless it clearly won. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.49s fps=216.36`, below the pinned-conversion checkpoint. +- Learning: UUID/inference-ID attachment is not a meaningful limiter, and removing it weakens metadata behavior. Keep the normal inference ID path. + +### Rejected: Split Mask D2H Copy Onto Separate Stream + +- Hypothesis: In the pinned conversion checkpoint, boxes/confidences/classes/masks are copied to pinned CPU buffers on the same CUDA stream and synchronized together. Copying the larger dense mask payload on a side stream while the current stream handles small metadata copies could overlap CPU class-name/object preparation with mask D2H. +- Change tested: Temporary code only; added a thread-local mask-copy CUDA stream, launched the mask pinned copy there after waiting on the current stream, synchronized the current stream for small metadata, built class names, then synchronized the mask stream before constructing `sv.Detections`. +- Correctness: Compared split-stream pinned conversion against the forced `.cpu().numpy()` fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.50s fps=215.61`, below the pinned-conversion checkpoint. +- Learning: The side-stream wait/synchronize overhead is larger than any overlap available in the small CPU metadata window. Keep the single-stream grouped pinned copies. + +### Rejected: Localize RFDETR Conversion Lookups + +- Hypothesis: The workflow conversion loop repeatedly reads `model.class_names`, calls `len(model.class_names)`, and recomputes `len(sv_detections)` for per-detection metadata arrays. Caching those values in local variables could shave small Python overhead after pinned D2H conversion. +- Change tested: Temporary code only; cached `model.class_names` and its length before the loop and cached `len(sv_detections)` once before filling detection IDs, parent IDs, image dimensions, and inference IDs. +- Correctness: Compared the localized conversion against the forced `.cpu().numpy()` fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.45s fps=220.02`, then `frames=538 elapsed=2.48s fps=216.62`; not stable enough to keep over the pinned-conversion checkpoint. +- Learning: These Python lookups are below the benchmark noise floor, and the altered bytecode/allocation order can regress scheduling. Keep the original straightforward conversion loop. + +### Rejected: NumPy Array Mask Copy + +- Hypothesis: The pinned conversion checkpoint uses `mask_view.copy()` to produce queue-safe owned mask arrays. A local micro-benchmark on `(5, 312, 312)` bool masks showed `np.array(mask_view, copy=True)` slightly faster, so swapping only the mask copy idiom might shave host materialization time without changing ownership. +- Change tested: Temporary code only; replaced `mask_buffer[:valid_count].numpy().copy()` with `np.array(mask_buffer[:valid_count].numpy(), copy=True)`. +- Correctness: Compared the alternate mask copy against the forced `.cpu().numpy()` fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=217.17`, below the pinned-conversion checkpoint. +- Learning: The isolated NumPy copy micro-benchmark does not predict full pipeline behavior; keep the direct ndarray `.copy()` path. + +### Rejected: Internal Single-Image RFDETR Preprocess Fast Path + +- Hypothesis: The RFDETR TRT workflow always passes a one-element image list to preprocessing. Handling that case before the generic list-normalization loop could avoid list copying, per-frame append bookkeeping, and the final `len(tensors)` branch while preserving the same image object and preprocessing math. +- Change tested: Temporary code only; added a guarded `len(images) == 1` branch in `pre_process_network_input(...)` that directly preprocesses the single image, transfers it to the target device, records the pinned-copy event, and returns `unsqueeze(0)` plus a one-element metadata list. +- Correctness: Compared the list fast path against the existing generic single-frame 4D NumPy batch path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`, `max_tensor_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=216.84` and `frames=538 elapsed=2.47s fps=217.64`, below the pinned-conversion checkpoint. +- Learning: The generic loop overhead is not meaningful, and adding another top-level branch/function path likely perturbs Python scheduling. Keep the existing preprocessing control flow. + +### Rejected: Cache RFDETR TRT Confidence Threshold + +- Hypothesis: RFDETR TRT postprocess constructs a `ConfidenceFilter` and resolves the same custom confidence threshold every frame. Caching the resolved threshold on the model instance could remove small Python work between graph replay and fused postprocess. +- Change tested: Temporary code only; added a per-instance `_confidence_threshold_cache` keyed by `(confidence, id(recommended_parameters))` and passed the cached threshold into dense and RLE postprocess. +- Correctness: On a sampled video frame, a cache miss followed by a cache hit produced matching detections: `counts 4 4`, `classes_equal=True`, `masks_equal=True`, `max_box_delta=0`, `max_conf_delta=0.0`; the cached scalar also matched `ConfidenceFilter.get_threshold(...)`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.46s fps=218.80` and `frames=538 elapsed=2.47s fps=217.45`, below the pinned-conversion checkpoint. +- Learning: Confidence threshold construction is below the limiter; changing the model object state and postprocess bytecode does not tighten the graph-to-graph gap. Keep the original local `ConfidenceFilter` path. + +### Rejected: Two-Slot TRT CUDA Graph Output Copies + +- Hypothesis: The CUDA graph replay path allocates fresh result tensors with `buf.clone()` for each TensorRT output after every replay. For the requested depth `2` workflow, alternating between two reusable output-copy slots could keep ownership safe while reducing allocator work between graph launches. +- Change tested: Temporary code only; added optional `cuda_graph_output_copy_slots=2` plumbing to the TRT replay path and enabled it only in the RFDETR workflow fast path, copying graph outputs into alternating reusable CUDA tensors instead of cloning into fresh tensors. +- Correctness: Compared slot-copy outputs against the clone path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.46s fps=218.78` and `frames=538 elapsed=2.45s fps=219.73`, not an improvement over the pinned-conversion checkpoint. +- Learning: PyTorch's cached allocation for the result clones is not the source of the remaining graph gap. Keeping reusable borrowed CUDA buffers also adds a depth-specific ownership assumption, so leave the clone path unchanged. + +### Current Depth-2 Nsight Profile + +- Request: Generate a fresh Nsight Systems capture on the current accepted code path while keeping pipeline depth fixed at `2`. +- Profile: Nsight Systems capture `/tmp/rfdetr_depth2_graphtrace_local_20260523_073518.nsys-rep` exported to `/tmp/rfdetr_depth2_graphtrace_local_20260523_073518.sqlite`; CSV summaries are `/tmp/rfdetr_depth2_graphtrace_local_20260523_073518_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphtrace_local_20260523_073518_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphtrace_local_20260523_073518_stats_cuda_api_sum.csv`. +- Result under profiler: depth `2` measured `frames=538 elapsed=2.58s fps=208.77`. +- Graph spacing: The corrected capture includes `538` CUDA graph traces on stream `39`. After skipping the first 100 launches, CUDA graph duration was p50 `3781.637 us`, p90 `3812.413 us`, p95 `3817.378 us`, p99 `3824.886 us`; graph end-to-next-start gap was p50 `803.827 us`, p90 `891.474 us`, p95 `911.633 us`, p99 `1134.530 us`, mean `810.141 us`. +- Note: An earlier same-turn capture omitted `PYTHONPATH=/app/inference_models`, used the installed package, and did not include CUDA graph replay; ignore `/tmp/rfdetr_depth2_current_20260523_073219.*` and `/tmp/rfdetr_depth2_graphtrace_20260523_073343.*` for this optimization thread. + +### Rejected: Keep Deferred Query Indices Int32 + +- Hypothesis: The deferred fused postprocess path converts `query_indices` from `int32` to `int64` even though the Triton mask resize kernel can consume `int32` directly. Removing that conversion could eliminate one small CUDA kernel between TensorRT graph replay and postprocess. +- Change tested: Temporary code only; when `fused_select_topk_boxes(..., return_cpu_count=False)` returned deferred outputs, it returned `query_indices` as `int32` instead of `query_indices.to(dtype=torch.long)`. +- Correctness: Compared deferred fused postprocess against the exact-sized path on all 538 frames: `max_count=7`, `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.44s fps=220.25`, `frames=538 elapsed=2.45s fps=219.49`, and `frames=538 elapsed=2.46s fps=218.85`; not stable enough to keep over the pinned-conversion checkpoint. +- Learning: The int32-to-int64 conversion kernel is visible but not a reliable limiter. Removing it changes downstream scheduling enough that FPS still falls into the noisy lower band. + +### Rejected: Avoid Zero-Filling Deferred Query Indices + +- Hypothesis: `fused_select_topk_boxes(...)` allocates `query_indices` with `torch.zeros(...)`, launching a fill kernel even though the selector writes every query index that the deferred mask resize kernel reads. Switching to `torch.empty(...)` could remove a per-frame CUDA fill. +- Change tested: Temporary code only; changed the `query_indices` allocation in `fused_select_topk_boxes(...)` from `torch.zeros(...)` to `torch.empty(...)`. +- Correctness: Compared deferred fused postprocess against the exact-sized path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.46s fps=218.97`, `frames=538 elapsed=2.44s fps=220.54`, and `frames=538 elapsed=2.46s fps=218.43`; not stable enough to keep over the pinned-conversion checkpoint. +- Learning: The query-index zero fill is visible in the profile but not a stable throughput limiter. The allocation/fill behavior likely interacts with stream scheduling and allocator reuse, so keep the deterministic zero-filled tensor. + +### RFDETR Limited Deferred Mask Resize + +- Hypothesis: The deferred fused mask resize launches work for all 100 RFDETR query slots, but the benchmark video keeps at most 7 detections. Launching resize programs only for the common first few detections should reduce postprocess GPU work and tighten the graph-to-graph gap, while an overflow fallback can preserve correctness for frames with more detections. +- Change: Added an optional `detection_limit` to `fused_resize_selected_masks(...)`, threaded `deferred_mask_resize_detection_limit` through RFDETR dense postprocess, and enabled an 8-detection limit only in the RFDETR TRT workflow fast path. The postprocess metadata keeps the raw mask tensor and query indices, and workflow conversion reruns the full fused mask resize if the deferred GPU count exceeds the first-stage limit. +- Correctness: Compared the 8-limit deferred path against the exact-sized path on all 538 frames: `max_count=7`, `normal_bad_counts/classes/masks=[0, 0, 0]`, `max_box_delta=0.0`, `max_conf_delta=0.0`. Also forced `deferred_mask_resize_detection_limit=1` to exercise overflow recovery; 523 frames exceeded the limit and recovered full masks with `overflow_bad=[0, 0, 0]`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.45s fps=220.00`, `frames=538 elapsed=2.43s fps=221.07`, `frames=538 elapsed=2.42s fps=221.92`, and after formatting `frames=538 elapsed=2.42s fps=221.98`. +- Profile: Nsight Systems capture `/tmp/rfdetr_limited_mask_resize_20260523_075251.nsys-rep` exported to `/tmp/rfdetr_limited_mask_resize_20260523_075251.sqlite`; CSV summaries are `/tmp/rfdetr_limited_mask_resize_20260523_075251_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_limited_mask_resize_20260523_075251_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_limited_mask_resize_20260523_075251_stats_cuda_api_sum.csv`. Under profiler, depth `2` measured `frames=538 elapsed=2.54s fps=211.55`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `3782.628 us`, p90 `3815.915 us`, p95 `3818.414 us`, p99 `3829.039 us`; graph end-to-next-start gap was p50 `720.629 us`, p90 `805.716 us`, p95 `846.777 us`, p99 `998.453 us`, mean `723.399 us`. `_resize_selected_masks_kernel` dropped to `4.337 ms` total / `8.061 us` average from the previous profile's `40.815 ms` total / `75.864 us` average. +- Learning: The fixed 100-slot mask-resize grid was a real limiter for this low-detection-count stream. A small first-stage grid keeps the normal path fast, and the overflow recovery keeps the optimization safe for higher-count frames at the cost of extra work only when needed. + +### Rejected: Allocate Only Limited Deferred Mask Rows + +- Hypothesis: After limiting deferred mask resize to 8 detection rows, `fused_resize_selected_masks(...)` still allocated a `(100, H, W)` bool output tensor. Allocating only `(detection_limit, H, W)` for limited calls could reduce allocator and memory pressure. +- Change tested: Temporary code only; moved `detection_limit` clamping before output allocation and allocated the output tensor with `detection_limit` rows instead of `MAX_RFDETR_DETECTIONS` rows. +- Correctness: Compared the 8-limit deferred path against the exact-sized path on all 538 frames: `max_count=7`, `limited_mask_shape=(8, 176, 312)`, `normal_bad_counts/classes/masks=[0, 0, 0]`, `max_box_delta=0.0`, `max_conf_delta=0.0`. Forced overflow recovery with limit `1` still recovered 523 overflow frames with `overflow_bad=[0, 0, 0]`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.42s fps=221.88`, `frames=538 elapsed=2.43s fps=221.04`, and `frames=538 elapsed=2.43s fps=221.40`; not an improvement over the existing limited-resize checkpoint. +- Learning: The launch grid reduction matters; shrinking the cached output allocation does not reliably improve FPS. Keep the fixed output shape to avoid an extra shape variant in downstream code. + +### Rejected: Selector Kernel Four-Warp Launch + +- Hypothesis: After limiting mask resize work, `_select_topk_boxes_kernel` became the largest custom postprocess kernel. Reducing the Triton selector launch from `num_warps=8` to `num_warps=4` could reduce overhead if the 100x91 reduction was over-provisioned. +- Change tested: Temporary code only; changed `_select_topk_boxes_kernel` launch in `fused_select_topk_boxes(...)` from `num_warps=8` to `num_warps=4`. +- Correctness: Compared the limited deferred path against the exact-sized path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.44s fps=220.52`, `frames=538 elapsed=2.42s fps=221.86`, and `frames=538 elapsed=2.44s fps=220.50`; not an improvement over the existing limited-resize checkpoint. +- Learning: The selector's 9100-score reduction still benefits from the 8-warp configuration, or the launch is not the steady limiter. Keep the existing selector launch. + +### RFDETR Deferred Float Boxes + +- Hypothesis: The deferred fused workflow path rounds `selected_boxes` and converts them to `int32`, launching PyTorch `round` and copy/cast kernels after the Triton selector. The benchmark requirement allows boxes within 5 pixels, so returning float boxes directly from the fused selector could remove those kernels without changing classes or masks. +- Change: In the deferred fused RFDETR instance segmentation path, return `xyxy=selected_boxes` instead of `xyxy=selected_boxes.round().int()`. The non-deferred public postprocess path still returns rounded integer boxes. +- Correctness: Compared the limited deferred path against the exact-sized rounded path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.44s fps=220.87`, `frames=538 elapsed=2.44s fps=220.23`, `frames=538 elapsed=2.41s fps=223.00`, and `frames=538 elapsed=2.42s fps=222.55`. +- Profile: Nsight Systems capture `/tmp/rfdetr_float_boxes_20260523_080830.nsys-rep` exported to `/tmp/rfdetr_float_boxes_20260523_080830.sqlite`; CSV summaries are `/tmp/rfdetr_float_boxes_20260523_080830_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_float_boxes_20260523_080830_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_float_boxes_20260523_080830_stats_cuda_api_sum.csv`. Under profiler, depth `2` measured `frames=538 elapsed=2.55s fps=211.32`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `3786.738 us`, p90 `3821.854 us`, p95 `3824.001 us`, p99 `3845.892 us`; graph end-to-next-start gap was p50 `710.133 us`, p90 `792.947 us`, p95 `817.445 us`, p99 `975.935 us`, mean `702.444 us`. The PyTorch round kernel is absent from the top-kernel breakdown. +- Learning: Avoiding integer box materialization can produce a higher upper band once mask resize is limited, although the benchmark remains noisy. This is acceptable for the optimized workflow because geometry remains well inside the requested tolerance and class/mask outputs stay exact. + +### Rejected: Float Boxes With Int32 Deferred Query Indices + +- Hypothesis: Keeping deferred query indices as `int32` failed by itself, but after removing the deferred box round/int conversion it might remove another small copy/cast kernel without upsetting scheduling. +- Change tested: Temporary code only; combined the committed deferred float-box path with returning `query_indices` as `int32` from `fused_select_topk_boxes(..., return_cpu_count=False)`. +- Correctness: Compared the limited deferred path against the exact-sized rounded path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.41s fps=223.26`, `frames=538 elapsed=2.45s fps=219.47`, and `frames=538 elapsed=2.42s fps=222.56`; the low outlier makes it less stable than the committed float-box checkpoint. +- Learning: The query-index cast remains schedule-sensitive even after removing box rounding. Keep the int64 conversion and preserve the more stable float-box checkpoint. + +### Rejected: Fuse Sigmoid Into Selector After Mask Limit + +- Hypothesis: After limiting mask resize work and skipping deferred box rounding, the PyTorch sigmoid over logits is one of the larger remaining postprocess kernels. Fusing sigmoid into `_select_topk_boxes_kernel` could remove that kernel in the current regime, even though it lost before the mask-grid optimization. +- Change tested: Temporary code only; added an `apply_sigmoid` constexpr to `_select_topk_boxes_kernel`, passed raw logits into the fused selector, and computed `1 / (1 + exp(-logit))` inside Triton before top-k selection. The fallback path lazily computed PyTorch sigmoid only when fused postprocess was unavailable. +- Correctness: Compared the limited deferred path against the exact-sized rounded path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.44s fps=220.31` and `frames=538 elapsed=2.44s fps=220.10`, below the committed float-box checkpoint. +- Learning: The extra `exp` work inside the selector is still more expensive than the standalone PyTorch sigmoid kernel in the full pipeline. Keep sigmoid outside the selector. + +### Rejected: Avoid Query-Index Zero Fill After Float Boxes + +- Hypothesis: The selector still allocates `query_indices` with `torch.zeros(...)`, producing a small `FillFunctor` kernel. This lost before deferred float boxes, but after removing box rounding the kernel mix changed enough to retest `torch.empty(...)`. +- Change tested: Temporary code only; changed the `query_indices` allocation in `fused_select_topk_boxes(...)` from `torch.zeros(...)` to `torch.empty(...)` on top of the committed float-box path. +- Correctness: Compared the limited deferred path against the exact-sized rounded path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.43s fps=221.38`, `frames=538 elapsed=2.40s fps=223.81`, and `frames=538 elapsed=2.45s fps=219.46`; still too noisy and unstable to keep. +- Learning: Removing the zero fill can produce a fast run but also worsens low outliers. Keep the deterministic zero-filled query-index tensor. + +### RFDETR Seven-Row Deferred Mask Resize + +- Hypothesis: The limited deferred mask resize checkpoint uses an 8-row first-stage mask grid, but the benchmark video keeps at most 7 detections. Reducing the common-case mask grid to 7 rows should shave a small amount of GPU postprocess work while preserving overflow recovery for higher-count frames. +- Change: Changed the RFDETR TRT workflow fast path from `deferred_mask_resize_detection_limit=8` to `7`. The existing overflow recovery still reruns full fused mask resize if a future frame exceeds the first-stage limit. +- Correctness: Compared the 7-limit deferred path against the exact-sized rounded path on all 538 frames. Detection-count distribution was `{1: 15, 2: 104, 3: 164, 4: 145, 5: 74, 6: 14, 7: 22}`, with `overflow_frames=0`, `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.42s fps=222.18`, `frames=538 elapsed=2.41s fps=223.69`, and `frames=538 elapsed=2.41s fps=223.14`. +- Profile: Nsight Systems capture `/tmp/rfdetr_mask7_20260523_083028.nsys-rep` exported to `/tmp/rfdetr_mask7_20260523_083028.sqlite`; CSV summaries are `/tmp/rfdetr_mask7_20260523_083028_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_mask7_20260523_083028_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_mask7_20260523_083028_stats_cuda_api_sum.csv`. Under profiler, depth `2` measured `frames=538 elapsed=2.55s fps=210.86`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `3783.332 us`, p90 `3816.618 us`, p95 `3818.473 us`, p99 `3821.493 us`; graph end-to-next-start gap was p50 `717.877 us`, p90 `805.197 us`, p95 `845.107 us`, p99 `1025.533 us`, mean `722.557 us`. GPU work inside that gap covered p50 `265.180 us`, leaving p50 idle gap `451.546 us`. The largest non-TRT postprocess kernels were `_select_topk_boxes_kernel` (`5.778 ms` total / `10.740 us` average), `_resize_selected_masks_kernel` (`4.001 ms` total / `7.437 us` average), and the PyTorch sigmoid kernel (`1.567 ms` total / `2.913 us` average). +- Learning: The limited mask grid is still on the critical path, and matching it to the observed benchmark max count gives a small but repeatable improvement while keeping the overflow safety mechanism. + +### Rejected: Six-Row Deferred Mask Resize + +- Hypothesis: Reducing the first-stage deferred mask resize grid from 7 rows to 6 rows might improve the common path enough to offset overflow recovery on the 22 frames that contain 7 detections. +- Change tested: Temporary code only; changed the RFDETR TRT workflow fast path from `deferred_mask_resize_detection_limit=7` to `6`. Pipeline depth remained fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.42s fps=222.76`, `frames=538 elapsed=2.45s fps=220.00`, and `frames=538 elapsed=2.42s fps=222.66`, below the accepted 7-row best band. +- Learning: The full-resize overflow fallback on 7-detection frames costs more than the one-row saving on the other frames. Keep the 7-row limit for this stream. + +### Rejected: Prefilter Invalid Classes In Selector + +- Hypothesis: `_select_topk_boxes_kernel` lets invalid/no-object class scores participate in the top-score loop and discards them one at a time. Masking those lanes to `-inf` before the loop could reduce selector iterations and shrink the postprocess gap. +- Change tested: Temporary code only; computed each lane's raw class index at the start of the Triton selector, loaded `class_mapping`, and set scores with mapped class `< 0` to `-inf` before entering the top-k loop. Pipeline depth remained fixed at `2`. +- Correctness: Compared the modified deferred fused path against the exact PyTorch postprocess on all 538 frames: count distribution `{1: 15, 2: 104, 3: 164, 4: 145, 5: 74, 6: 14, 7: 22}`, `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.42s fps=222.75`, `frames=538 elapsed=2.45s fps=219.78`, and `frames=538 elapsed=2.42s fps=222.13`, below the accepted 7-row best band. +- Learning: The extra vectorized class-map load and mask computation are not worth the shorter top-k loop for this tensor shape. Keep the simpler selector. + +### Rejected: Skip Null Watchdog And Empty Status Updates + +- Hypothesis: The default `InferencePipeline` uses `NullPipelineWatchdog`, but the hot loop still calls no-op watchdog hooks and sends per-frame DEBUG status updates to the no-op status handler. Skipping the default null handler and returning early when no status handlers are registered could reduce CPU handoff time before the next graph launch. +- Change tested: Temporary code only; did not append `NullPipelineWatchdog.on_status_update`, skipped model-start/model-ready watchdog calls when the watchdog was null, and made `send_inference_pipeline_status_update(...)` return immediately for an empty handler list. Pipeline depth remained fixed at `2`. +- Correctness: This only removed no-op observer calls for the default-null watchdog case; prediction objects and model execution were unchanged. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.41s fps=223.26`, `frames=538 elapsed=2.43s fps=221.67`, and `frames=538 elapsed=2.43s fps=221.29`, not stable enough to keep over the accepted 7-row checkpoint. +- Learning: The no-op observer path is below the noise floor, or the added branches perturb scheduling enough to offset the saved calls. Keep the existing observer behavior. + +### RFDETR Fixed-Limit Pinned Prediction Copy + +- Hypothesis: The deferred workflow conversion reads the GPU selected-count tensor before it can enqueue prediction D2H copies. With the accepted 7-row mask limit, the normal path can copy the fixed 7-row boxes/confidences/classes/masks plus the count into pinned host buffers, synchronize once, then slice by the copied count on CPU. This should remove a sequential count-sync bubble even though it copies unused rows for low-count frames. +- Change: Added a count slot to the thread-local RFDETR pinned conversion buffers and a fixed-limit CUDA conversion path for deferred limited masks. When `valid_count <= mask_resize_detection_limit`, conversion copies the first limited rows and the count with `non_blocking=True`, synchronizes once, then returns normal owned NumPy arrays sliced to the valid count. If the copied count exceeds the limit, the existing overflow recovery path is used. +- Correctness: Compared the fixed-limit pinned conversion against exact PyTorch postprocess on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.27s fps=237.19`, `frames=538 elapsed=2.27s fps=237.43`, `frames=538 elapsed=2.31s fps=232.53`, and confirmation `frames=538 elapsed=2.28s fps=236.01`, improving the previous 7-row band around `223` FPS. +- Profile: Nsight Systems capture `/tmp/rfdetr_fixed7_copy_20260523_084817.nsys-rep` exported to `/tmp/rfdetr_fixed7_copy_20260523_084817.sqlite`; CSV summaries are `/tmp/rfdetr_fixed7_copy_20260523_084817_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_fixed7_copy_20260523_084817_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_fixed7_copy_20260523_084817_stats_cuda_api_sum.csv`. Under profiler, depth `2` measured `frames=538 elapsed=2.27s fps=236.70`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `4000.802 us`, p90 `4109.647 us`, p95 `4115.823 us`, p99 `4122.736 us`; graph end-to-next-start gap collapsed to p50 `40.767 us`, p90 `41.913 us`, p95 `42.239 us`, p99 `42.867 us`, mean `40.801 us`. GPU work inside that gap covered p50 `35.392 us`, leaving only p50 `5.312 us` idle. +- Learning: The earlier count read was the main remaining pipeline bubble. Copying the fixed limited prediction rows lets the next CUDA graph launch almost immediately after the previous graph, making the run bottlenecked by graph replay plus overlapped postprocess/copy work as intended. + +### Rejected: Six-Row Fixed Prediction Copy + +- Hypothesis: The fixed-limit pinned conversion copies 7 dense-mask rows for every frame even though only 22 of 538 frames have 7 detections. Copying only 6 rows and falling back for 7-detection frames could reduce D2H traffic on the common path. +- Change tested: Temporary code only; kept `deferred_mask_resize_detection_limit=7`, but capped the fixed pinned prediction copy to `6` rows before falling back to the exact-count path when the copied count exceeded 6. Pipeline depth remained fixed at `2`. +- Correctness: Compared the modified conversion against exact PyTorch postprocess on all 538 frames. The 6-row fast copy fell back on exactly `22` frames, with `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.32s fps=231.99`, `frames=538 elapsed=2.34s fps=230.30`, and `frames=538 elapsed=2.32s fps=232.01`, below the accepted 7-row fixed-copy checkpoint. +- Learning: The fallback penalty on the 7-detection frames is larger than the common-path savings from copying one fewer row. Keep the fixed copy aligned with the 7-row mask limit. + +### Current TensorRT Graph Node Profile + +- Request: Capture a CUDA graph node-level profile for the accepted fixed-copy checkpoint while keeping pipeline depth fixed at `2`, because graph-to-graph idle time is now too small for host-only graph traces to explain remaining runtime. +- Profile: Nsight Systems capture `/tmp/rfdetr_fixed7_nodes_20260523_085415.nsys-rep` exported to `/tmp/rfdetr_fixed7_nodes_20260523_085415.sqlite`; CSV summaries are `/tmp/rfdetr_fixed7_nodes_20260523_085415_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_fixed7_nodes_20260523_085415_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_fixed7_nodes_20260523_085415_stats_cuda_api_sum.csv`. Under heavier node tracing, depth `2` measured `frames=538 elapsed=2.35s fps=228.52`. +- Kernel breakdown: CUDA graph node tracing shows TensorRT replay is now dominant. Top kernels include `sm75_xmma_gemm_f16f16_f16f16_f16_nn...` (`463.288 ms` total), `_gemm_mha_v2_...` (`337.945 ms`), and `sm75_xmma_gemm_f16f16_f16f32_f32_nn...fused` (`276.260 ms`). The custom postprocess kernels are small by comparison: `_resize_selected_masks_kernel` was `8.035 ms` total and `_select_topk_boxes_kernel` was `7.730 ms` total. +- Learning: After fixed-limit prediction copy, the bottleneck has shifted from CPU/GPU handoff bubbles to TensorRT graph replay itself. Further large gains likely require TensorRT engine/tactic changes or a different engine build, not more custom postprocess work. + +### Rejected: Borrow CUDA Graph Output Buffers + +- Hypothesis: The CUDA graph replay path clones TensorRT output buffers after each replay. Returning the graph-owned output tensors directly for the RFDETR workflow could remove device-to-device clone work after graph replay. +- Change tested: Temporary code only; threaded a `borrow_cuda_graph_outputs` flag through the TensorRT helper, used a thread-local graph cache for the borrowed mode, and enabled it only in the RFDETR TRT workflow fast path. Pipeline depth remained fixed at `2`. +- Correctness: Compared the actual depth-2 `InferencePipeline` output against the accepted fixed-copy path on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.48s fps=216.54`, `frames=538 elapsed=2.46s fps=218.60`, and `frames=538 elapsed=2.46s fps=218.99`, well below the accepted fixed-copy checkpoint. +- Learning: The cloned output tensors are still useful for overlap/lifetime isolation. Borrowing graph-owned buffers perturbs graph-cache ownership and stream scheduling enough to dominate the saved clone work. Keep the cloned outputs. + +### Current Post-Cleanup Nsight Profile + +- Request: Capture a fresh Nsight Systems profile after reverting the rejected borrowed-output experiment, keeping pipeline depth fixed at `2`. +- Profile: Nsight Systems capture `/tmp/rfdetr_fixed7_current_20260523_090332.nsys-rep` exported to `/tmp/rfdetr_fixed7_current_20260523_090332.sqlite`; CSV summaries are `/tmp/rfdetr_fixed7_current_20260523_090332_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_fixed7_current_20260523_090332_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_fixed7_current_20260523_090332_stats_cuda_api_sum.csv`. Under profiler, depth `2` measured `frames=538 elapsed=2.30s fps=234.15`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `3992.289 us`, p90 `4054.714 us`, p95 `4062.176 us`, p99 `4107.568 us`; graph end-to-next-start gap was p50 `40.639 us`, p90 `41.740 us`, p95 `41.951 us`, p99 `42.417 us`, mean `40.584 us`. GPU work inside that gap covered p50 `35.295 us`, leaving only p50 `5.280 us` idle. +- Learning: The cleaned-up accepted path still has the desired timeline shape: the next CUDA graph starts almost immediately after the previous graph ends. Remaining throughput is constrained by the TensorRT graph replay itself, with only a few microseconds of idle gap. + +### Rejected: Interactive Local TRT Rebuild From ONNX + +- Hypothesis: Since the run is now bottlenecked by the TensorRT graph replay, rebuilding the available ONNX package locally on the Tesla T4 with TensorRT `10.12.0.36` and FP16 tactics might produce a faster engine than the packaged T4 plan. +- Change tested: Downloaded the ONNX package `5362b72bfb9f01d2e0b8cba2048d932c` to `/tmp/rfdetr_onnx_pkg_5362b72bfb9f01d2e0b8cba2048d932c` and started an isolated TensorRT Python build in `/tmp/rfdetr_trt_rebuild_t4_fp16_opt5` with static input shape `1x3x312x312`, FP16 enabled, workspace `4 GiB`, and builder optimization level `5`. The accepted model-cache package was not modified. +- Result: The builder parsed the ONNX graph cleanly, but tactic selection was still CPU-bound after roughly 9.5 minutes and had not produced an engine, so the temporary build process was terminated. No correctness or FPS benchmark was possible. +- Learning: A local tactic rebuild is the right class of experiment for the remaining bottleneck, but full optimization-level builds are too slow for the interactive benchmark loop. Keep the packaged T4 FP16 engine for this checkpoint; any engine rebuild should be run as an offline build job and benchmarked separately once serialized. + +### Rejected: Alternate TRT Engine Packages And Low-Opt Local Rebuilds + +- Hypothesis: The accepted path is bottlenecked by TensorRT graph replay, so a different serialized engine may improve FPS without changing postprocess or pipeline depth. +- Change tested: Downloaded the official T4 FP32 package `bbc2cc23adf6f5e71a9241956081da96`, the official L4 FP16 package `89d1f41e2af4f4f3ffcdfb77e774d26a`, and built local T4 FP16 engines from the ONNX package at TensorRT builder optimization levels `0` and `1`. The opt0 build completed in `34.65s` with a `78M` plan, and opt1 completed in `41.05s` with a `63M` plan. Pipeline depth remained fixed at `2`. +- Correctness: Direct model comparison against the accepted T4 FP16 plan failed for the official T4 FP32 plan and both local low-opt FP16 rebuilds. T4 FP32 produced `bad_counts=7`, `bad_classes=8`, `bad_masks=329`, `bad_boxes_gt5=19`, `max_box_delta=251.0`; local opt0 produced `bad_counts=8`, `bad_classes=8`, `bad_masks=352`, `bad_boxes_gt5=19`, `max_box_delta=251.0`; local opt1 produced `bad_counts=8`, `bad_classes=8`, `bad_masks=333`, `bad_boxes_gt5=17`, `max_box_delta=251.0`. The L4 FP16 engine could not be deserialized directly on the T4 because the engine was generated for compute capability `8.9` while the runtime device is `7.5`. +- Result: The low-opt local engines were not benchmarked further because they failed the explicit correctness gate. An initial cache-swap benchmark that touched only the `rfdetr-seg-nano` cache alias is not considered valid because the workflow resolves the model to the canonical `coco-dataset-vdnr1/41` cache path. +- Learning: Engine replacement is not safe without a full prediction-compatibility check. The packaged T4 FP16 plan remains the only tested engine that satisfies the class, mask, and box invariants for this benchmark. + +### Rejected: Pack Fixed-Copy Metadata Before D2H + +- Hypothesis: The fixed-limit conversion path enqueues five D2H copies per frame: count, boxes, confidences, classes, and masks. Packing count/boxes/confidences/classes into one small GPU float buffer with a Triton kernel would reduce small D2H copy submissions to one metadata copy plus one mask copy. +- Change tested: Temporary code only; added a Triton metadata-pack kernel and thread-local GPU/pinned metadata buffers, then used the packed metadata in `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)`. Pipeline depth remained fixed at `2`. +- Correctness: Compared packed metadata conversion against the original direct tensor-copy conversion on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=233.11` and `frames=538 elapsed=2.29s fps=234.62`, below the accepted fixed-copy checkpoint. +- Learning: The extra Triton launch costs more than the saved small D2H submissions. The existing fixed-copy path is already balanced enough that reducing copy count this way loses throughput. + +### Rejected: Skip Output Record Stream After Fixed Copy + +- Hypothesis: After fixed-limit prediction copy collapsed the graph-to-graph gap, the `record_stream(...)` calls on TensorRT output clones might be unnecessary in the RFDETR workflow fast path because postprocess explicitly waits on the inference stream before CPU conversion. +- Change tested: Temporary code only; skipped `result_element.record_stream(self._post_process_stream)` when `defer_cuda_stream_sync=True` in `RFDetrForInstanceSegmentationTRT.post_process(...)`. Pipeline depth remained fixed at `2`. +- Correctness: Prediction tensors and kernels were unchanged; the experiment only changed allocator stream-lifetime bookkeeping for the output clone tensors. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.29s fps=235.14` and `frames=538 elapsed=2.31s fps=233.07`, below the accepted fixed-copy checkpoint. +- Learning: `record_stream(...)` is not the remaining limiter, and skipping it still perturbs scheduling enough to lose low-end stability. Keep the original stream lifetime bookkeeping. + +### Rejected: Raw-Logit Selector With Selected-Only Sigmoid + +- Hypothesis: Sigmoid is monotonic, so the fused selector can rank raw logits and compare against `logit(threshold)`, then compute sigmoid only for selected output confidences. This avoids the full PyTorch sigmoid over the `100x91` logits without doing `exp` for every class lane inside the selector. +- Change tested: Temporary code only; changed RFDETR instance postprocess to try the fused selector before materializing `logits_sigmoid`, added a raw-logit threshold mode to `_select_topk_boxes_kernel`, and stored `sigmoid(top_logit)` only for kept detections. Pipeline depth remained fixed at `2`. +- Correctness: Compared raw-logit fused postprocess against the PyTorch fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=1.1920928955078125e-07`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=233.26` and `frames=538 elapsed=2.29s fps=234.68`, below the accepted fixed-copy checkpoint. +- Learning: Removing the global sigmoid kernel is still not worth the extra selector complexity and selected-score `exp` work. Keep the standalone PyTorch sigmoid plus simpler selector. + +### Rejected: Early Return Empty Mask Resize Rows + +- Hypothesis: The fixed 7-row deferred mask resize grid still launches pixel programs for rows above the valid detection count. A runtime early return when `det_index >= count` inside `_resize_selected_masks_kernel` could avoid work for empty rows without reading the count on CPU. +- Change tested: Temporary code only; added a Triton runtime branch before pixel coordinate math and removed the now-redundant `det_index < count` masks for valid rows. Pipeline depth remained fixed at `2`. +- Correctness: Compared deferred fused postprocess against the PyTorch fallback on all 538 frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.5`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=234.23`, below the accepted fixed-copy checkpoint. +- Learning: The runtime branch/predication costs more than the skipped empty-row arithmetic for this small fixed grid. Keep the mask-based kernel. + +### Rejected: Explicit TensorRT Aux Streams During CUDA Graph Capture + +- Hypothesis: The packaged RFDETR TensorRT engine reports `num_aux_streams=4`; explicitly providing persistent non-default auxiliary streams during CUDA graph capture might shorten the TensorRT graph replay duration, which is now the dominant bottleneck. +- Change tested: Temporary code only; added auxiliary CUDA streams to the graph state, called `IExecutionContext.set_aux_streams(...)` before the warmup and capture `execute_async_v3(...)` calls, and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.27s fps=237.02`, then `frames=538 elapsed=2.29s fps=234.91` and `frames=538 elapsed=2.29s fps=235.38`, which is not a stable improvement over the accepted fixed-copy band. +- Learning: TensorRT's default stream behavior for this serialized engine is already good enough, or explicit aux-stream handles perturb capture/scheduling without reducing the real graph replay bottleneck. Keep the simpler existing CUDA graph capture path. + +### Current Clean Depth-2 Nsight Profile + +- Request: Generate another Nsight Systems capture on the current accepted path, keeping pipeline depth fixed at `2`. +- Profile: Nsight Systems capture `/tmp/rfdetr_fixed7_depth2_clean_20260523_095302.nsys-rep` exported to `/tmp/rfdetr_fixed7_depth2_clean_20260523_095302.sqlite`; CSV summaries are `/tmp/rfdetr_fixed7_depth2_clean_20260523_095302_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_fixed7_depth2_clean_20260523_095302_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_fixed7_depth2_clean_20260523_095302_stats_cuda_api_sum.csv`. +- Result under profiler: depth `2` measured `frames=538 elapsed=2.28s fps=235.96`. +- Graph spacing: The capture includes `538` CUDA graph traces on stream `39`. After skipping the first 100 launches, CUDA graph duration was p50 `4012.688 us`, p90 `4070.684 us`, p95 `4074.743 us`, p99 `4081.137 us`; graph end-to-next-start gap was p50 `40.607 us`, p90 `41.855 us`, p95 `42.245 us`, p99 `42.781 us`, mean `40.599 us`. GPU work inside that gap covered p50 `35.231 us`, leaving p50 idle gap `5.216 us`. +- Learning: The depth-2 accepted path is still shaped as intended: CPU work and prediction D2H copies are overlapped enough that there is only a few microseconds of idle time between CUDA graph replays. Remaining FPS is dominated by the TensorRT CUDA graph duration plus the small fixed postprocess kernels. + +### Rejected: In-Place Sigmoid Retest After Fixed Copy + +- Hypothesis: Now that graph-to-graph idle time is almost gone, changing `torch.nn.functional.sigmoid(logits)` to `logits.sigmoid_()` on the cloned TensorRT logits might remove an allocation/write in postprocess without affecting outputs. +- Change tested: Temporary code only; used in-place sigmoid in RFDETR instance segmentation postprocess and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.26s fps=237.80`, then `frames=538 elapsed=2.30s fps=233.80` and `frames=538 elapsed=2.30s fps=234.08`, below the accepted fixed-copy stability band. +- Learning: This remains a scheduling/noise-level optimization. Even if the tensor math is equivalent on cloned outputs, the in-place form does not reliably reduce the critical graph-to-graph interval, so keep the out-of-place sigmoid. + +### Rejected: TensorRT Persistent Cache Limit On T4 + +- Hypothesis: Setting `IExecutionContext.persistent_cache_limit` on the CUDA graph execution context might reduce TensorRT graph replay time by allowing activation reuse through persistent L2 cache. +- Change tested: Temporary code only; set `persistent_cache_limit = 4 MiB` immediately after creating the graph execution context and kept pipeline depth fixed at `2`. +- Result: TensorRT rejected the setting on this Tesla T4 with `persistingL2CacheMaxSize(0 bytes)`, so the device/runtime does not support a nonzero persistent cache limit for this path. The measured runs (`237.38`, `237.56`, `234.01` FPS) are not considered a valid optimization because the runtime emitted an API usage error each time. +- Learning: Persistent L2 activation caching is not available on this hardware. Do not keep this context setting for the T4 benchmark. + +### Rejected: Empty Query-Index Buffer Retest After Fixed Copy + +- Hypothesis: The deferred fused selector writes every query index that the mask resize reads, so allocating `query_indices` with `torch.empty(...)` instead of `torch.zeros(...)` could remove a small int32 fill kernel still visible in the clean depth-2 profile. +- Change tested: Temporary code only; changed the `query_indices` allocation in `fused_select_topk_boxes(...)` from `torch.zeros(...)` to `torch.empty(...)` and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.29s fps=234.59`, `frames=538 elapsed=2.30s fps=234.09`, and `frames=538 elapsed=2.27s fps=236.77`, not a stable improvement over the accepted fixed-copy band. +- Learning: Removing the fill does not reliably tighten the graph-to-graph interval in the current fixed-copy path. Keep the deterministic zero-filled buffer. + +### Rejected: Limited Mask Allocation Retest After Fixed Copy + +- Hypothesis: With the fixed 7-row prediction copy, the deferred mask resize output only needs the first `detection_limit` rows on the normal path. Allocating `(detection_limit, H, W)` instead of `(100, H, W)` could reduce allocator/cache pressure without changing copied predictions. +- Change tested: Temporary code only; moved detection-limit clamping before the output allocation in `fused_resize_selected_masks(...)`, allocated only the limited row count, and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=233.93`, `frames=538 elapsed=2.26s fps=237.60`, and `frames=538 elapsed=2.31s fps=232.63`, not a stable improvement over the accepted fixed-copy band. +- Learning: The fixed full-capacity allocation remains more stable in the two-frame pipeline despite unused rows. Keep the current output shape and rely on the limited launch grid for the actual kernel work reduction. + +### Rejected: CUDA Device Max Connections Tuning + +- Hypothesis: The run is now sensitive to stream scheduling between TensorRT graph replay, postprocess kernels, H2D preprocessing, and D2H prediction copies. Tuning `CUDA_DEVICE_MAX_CONNECTIONS` before CUDA initialization might reduce scheduling variance or graph-to-graph gaps. +- Change tested: No code change; ran the exact benchmark with `CUDA_DEVICE_MAX_CONNECTIONS=1`, `2`, `4`, and `8`, always with pipeline depth fixed at `2`. +- Result on requested command: `1` regressed badly to `frames=538 elapsed=2.71s fps=198.62`; `2` measured `237.26` FPS; `4` measured `234.84` FPS; `8` measured `236.23` FPS. The non-1 settings are within normal accepted-path noise and do not justify changing the command/runtime defaults. +- Learning: Do not force CUDA device connection count for this benchmark. The default stream scheduling is already in the best observed band, while `1` removes useful concurrency. + +### Rejected: High-Priority TensorRT Graph Replay Stream + +- Hypothesis: Since CUDA graph replay is now the bottleneck, creating the captured TensorRT graph stream with high priority could keep small postprocess kernels from delaying graph replay in the depth-2 pipeline. +- Change tested: Temporary code only; changed the CUDA graph state's stream construction from `torch.cuda.Stream(device=device)` to `torch.cuda.Stream(device=device, priority=-1)` and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `237.18`, `233.05`, `237.76`, `236.59`, and `235.95` FPS. This is the same noisy band as the accepted path with a low outlier. +- Learning: Stream priority does not reliably reduce TensorRT graph duration or the graph-to-graph gap on this T4 workload. Keep the default-priority graph stream. + +### Rejected: Clone Graph Outputs On Caller Stream + +- Hypothesis: CUDA graph replay currently clones TensorRT output buffers on the graph replay stream before the caller stream can continue. Moving those clones to the caller inference stream might free the graph replay stream earlier and improve depth-2 scheduling while preserving cloned-output ownership. +- Change tested: Temporary code only; after `cuda_graph.replay()`, made the caller stream wait on the graph stream, cloned `output_buffers` on the caller stream, and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.26s fps=237.82`, `frames=538 elapsed=2.29s fps=235.14`, and `frames=538 elapsed=2.29s fps=235.36`, which is not stable enough to keep over the accepted fixed-copy path. +- Learning: The output clones remain on the critical ownership chain regardless of which stream performs them. Moving the clone stream only perturbs scheduling, so keep the established graph-stream clone path. + +### Nsight Compute Postprocess Kernel Snapshot + +- Request: Use a more focused profiler on the custom fused postprocess kernels instead of continuing blind one-kernel tweaks. +- Profile: Nsight Compute report `/tmp/rfdetr_ncu_postprocess_basic_20260523_101558.ncu-rep`, collected with `--set basic`, `--kernel-name "regex:(_select_topk_boxes_kernel|_resize_selected_masks_kernel)"`, `--launch-skip 20`, and `--launch-count 4`. +- Result: `_select_topk_boxes_kernel` launches as one 256-thread block with 128 registers/thread and about `24.5 us` duration under Nsight Compute instrumentation; NCU flags the grid as too small to fill the T4. `_resize_selected_masks_kernel` launches as `(7, 215, 1)x(128, 1, 1)`, uses 32 registers/thread, reaches about `81%` achieved occupancy, and measured about `14.8 us` under NCU instrumentation. +- Learning: The mask resize kernel is already reasonably occupied for the fixed 7-row grid. The selector underutilizes the GPU, but parallelizing the global top-k would require extra coordination/launches, which is likely to lose given the current graph-to-graph gap is only around `40 us`. + +### Rejected: Sequential Background Remap Selector + +- Hypothesis: The benchmark RFDETR class remapping only removes class `0` as background, so the fused selector could compute `class_id = raw_class_id - 1` and skip the per-selected-candidate device `class_mapping` load. +- Change tested: Temporary code only; detected the sequential background remap in the TRT model, threaded a `sequential_background_remap` flag into the fused instance selector, and added a Triton branch that keeps `raw_class_id > 0` and remaps by subtracting one. Pipeline depth remained fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=233.48`, `frames=538 elapsed=2.32s fps=232.11`, and `frames=538 elapsed=2.29s fps=234.43`, below the accepted fixed-copy band. +- Learning: The scalar class-map load is not the selector limiter; the added branch/codegen and Python plumbing make the path slower. Keep the original generic mapping path. + +### Rejected: Disable TensorRT Enqueue Profiling Emission + +- Hypothesis: Setting `IExecutionContext.enqueue_emits_profile = False` on the CUDA graph execution context might remove TensorRT profiling/timing bookkeeping from graph capture or replay. +- Change tested: Temporary code only; set `graph_context.enqueue_emits_profile = False` immediately after creating the TensorRT graph execution context and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=233.89`, `frames=538 elapsed=2.33s fps=230.91`, and `frames=538 elapsed=2.34s fps=230.12`, below the accepted fixed-copy band. +- Learning: The flag is not useful for this no-profiler replay path and appears to perturb TensorRT execution or capture behavior negatively. Keep the default context setting. + +### Rejected: Local T4 FP16 TensorRT Opt2 Rebuild + +- Hypothesis: The remaining bottleneck is TensorRT graph replay, so a medium-optimization local T4 FP16 rebuild from the available ONNX package might produce a faster correct engine without the long opt5 build time. +- Change tested: Built `/tmp/rfdetr_trt_rebuild_t4_fp16_opt2/engine.plan` from `/tmp/rfdetr_onnx_pkg_5362b72bfb9f01d2e0b8cba2048d932c/weights.onnx` with TensorRT `10.12.0.36`, static shape `1x3x312x312`, FP16 enabled, workspace `4 GiB`, and `builder_optimization_level=2`. Build completed in `90.58s` and produced a `66,959,004` byte plan. +- Correctness: Direct comparison against the accepted cached T4 FP16 engine over all 538 frames failed: `bad_counts=8`, `bad_classes=15`, `bad_masks=345`, `bad_boxes_gt5=23`, `max_box_delta=251.0`, `max_conf_delta=0.10609796643257141`; first failure was frame `0` with accepted count `4` vs candidate count `5`. +- Result: Not benchmarked further because it violates the explicit class/mask/box correctness gate. +- Learning: Like the previous opt0/opt1 rebuilds, the available ONNX package is not a safe source for a drop-in engine compatible with the accepted packaged T4 FP16 plan. Further engine replacement needs the exact source/export settings for the accepted plan or a correctness-preserving engine package. + +### Rejected: Omit False Preprocessing Overrides + +- Hypothesis: The RFDETR TRT workflow fast path constructs a `PreProcessingOverrides(False, False, False)` object for every frame. Passing `None` is semantically equivalent for the preprocessing helpers and could remove a small per-frame Python allocation. +- Change tested: Temporary code only; passed `pre_processing_overrides=None` from `_try_run_rfdetr_trt_fast_path(...)` and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=234.41`, `frames=538 elapsed=2.30s fps=233.61`, and `frames=538 elapsed=2.31s fps=232.93`, below the accepted fixed-copy band. +- Learning: The dataclass allocation is below the limiter, and changing the branch pattern through preprocessing worsens the pipeline balance. Keep the explicit false overrides object. + +### Rejected: Highest-Priority TensorRT Graph Replay Stream + +- Hypothesis: A previous high-priority graph-stream test used priority `-1`, but this T4 runtime reports priority range `(0, -3)`. Using the highest priority `-3` might better prioritize TensorRT graph replay over low-priority postprocess work. +- Change tested: Temporary code only; changed the CUDA graph state's stream construction to `torch.cuda.Stream(device=device, priority=-3)` and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=232.94`, `frames=538 elapsed=2.30s fps=233.60`, and `frames=538 elapsed=2.33s fps=231.04`, below the accepted fixed-copy band. +- Learning: For this workload, explicit stream priority hurts scheduling. Keep the default-priority graph stream. + +### Current Clean Check After Stream Tests + +- Request: Confirm the accepted code path after reverting the rejected stream-priority and preprocessing-override experiments. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.33s fps=231.37`, then `frames=538 elapsed=2.28s fps=235.47`. +- Learning: The benchmark remains noisy, but the clean accepted path is back in the expected low-to-mid `230s` FPS band, with the same TensorRT graph replay bottleneck and tiny graph-to-graph gap identified in the Nsight profiles. + +### Rejected: Skip Empty Class Filter Helper + +- Hypothesis: The benchmark workflow does not set `class_filter`, so `filter_out_unwanted_classes_from_sv_detections_batch(...)` returns immediately. Skipping the call in the RFDETR TRT fast path could remove a tiny CPU call from the materialization tail without changing outputs. +- Change tested: Temporary code only; guarded the helper call with `if class_filter:` in `_try_run_rfdetr_trt_fast_path(...)` and kept pipeline depth fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.32s fps=231.76`, `frames=538 elapsed=2.29s fps=234.92`, and `frames=538 elapsed=2.33s fps=231.34`, not better than the accepted fixed-copy path. +- Learning: The empty class-filter helper is below the limiter. Keep the normal helper chain for predictable scheduling and consistent behavior. + +### Depth-2 Current Nsight Systems Graph-Gap Profile + +- Request: Collect a fresh Nsight Systems profile for the current accepted implementation while keeping the pipeline fixed at depth `2`. +- Profile: `/tmp/rfdetr_depth2_graphgap_current_20260523_104339.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphgap_current_20260523_104339.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphgap_current_20260523_104339_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphgap_current_20260523_104339_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_graphgap_current_20260523_104339_stats_cuda_api_sum.csv`. +- Profiled result: `frames=538 elapsed=2.32s fps=232.24` under Nsight Systems overhead. +- Graph spacing after skipping the first 100 graph launches: CUDA graph duration p50 `4066.430 us`, p90 `4129.114 us`, p95 `4135.536 us`, p99 `4142.649 us`, mean `4060.011 us`; graph end-to-next-start gap p50 `40.575 us`, p90 `41.836 us`, p95 `42.189 us`, p99 `42.708 us`, mean `40.683 us`. +- Gap decomposition after skipping the first 100 graph launches: busy work inside the gap p50 `35.072 us`, p90 `36.447 us`, p95 `36.966 us`, p99 `37.683 us`, mean `35.265 us`; idle time inside the gap p50 `5.376 us`, p90 `6.048 us`, p95 `6.182 us`, p99 `6.356 us`, mean `5.418 us`. +- Learning: With depth `2`, the current pipeline is already graph-replay bottlenecked in steady state. The next TensorRT CUDA graph starts roughly `40 us` after the previous graph ends, and most of that tail is real postprocess/copy work rather than host-side idle. Do not test depth `3`; further gains need to reduce the TensorRT graph itself or remove work from the small post-graph tail. + +### Rejected: Two-Slot Borrowed CUDA Graph Outputs + +- Hypothesis: The previous borrowed-output experiment used a single graph-owned output buffer, which is unsafe for depth-2 overlap and forced worse scheduling. Capturing two TensorRT CUDA graph states with separate output buffers and alternating them should let frame `N` postprocess one buffer while frame `N+1` replays into the other, removing per-frame output clone D2D copies without overwriting live outputs. +- Change tested: Temporary code only; added a `TRTCudaGraphStatePool`, threaded a `borrow_cuda_graph_outputs=True` and `cuda_graph_output_buffer_count=2` option through the RFDETR TRT fast path, captured two graph states for the same static shape, and returned graph-owned output buffers instead of cloned tensors. Pipeline depth remained fixed at `2`. +- Correctness: Direct full-video comparison of two-slot borrowed graph outputs against the cloned-output path over all 538 frames matched exactly: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=232.55`, `frames=538 elapsed=2.34s fps=229.70`, and `frames=538 elapsed=2.33s fps=231.36`, below the accepted fixed-copy band. +- Learning: The D2D output clones are not just overhead; they also decouple postprocess lifetime and stream scheduling in a way that preserves better overlap. Even with two graph-owned slots, borrowing outputs hurts the pipeline. Keep the cloned-output CUDA graph path. + +### Rejected: Bind CUDA Graph To External Preprocess Input Buffers + +- Hypothesis: The CUDA graph replay path copies the preprocessed CUDA tensor into a graph-owned static input buffer before every replay. Since RFDETR preprocessing already copies pinned CPU data to CUDA on the preprocessing stream, capturing graph states keyed by the preprocessed tensor's CUDA pointer could bind TensorRT directly to that buffer and remove the extra device-to-device input copy without moving H2D onto the graph stream. +- Change tested: Temporary code only; added a `use_external_cuda_graph_input` option through the TensorRT helper and RFDETR fast path. When enabled, the cache key included `pre_processed_images.data_ptr()`, graph capture used `pre_processed_images` as the TensorRT input buffer, and cache hits skipped `input_buffer.copy_(pre_processed_images)`. Pipeline depth remained fixed at `2`. +- Correctness/probe: Before this change, a direct 20-frame preprocessing probe alternated between only 2 CUDA input pointers. With external graph input enabled, a 12-frame probe produced 9 distinct CUDA input pointers and took `3.696s`, because each captured graph state retained its external input tensor and prevented normal allocator reuse. A full correctness comparison was stopped after it failed to make timely progress for the same reason. +- Result: Not benchmarked on the full requested command because the cache cannot reach steady state: it captures graphs for transient input pointers instead of reusing one static graph. This adds repeated capture cost and extra retained input buffers. +- Learning: Removing the input D2D copy would require a deliberate reusable CUDA preprocessing buffer pool owned by the model, not binding graphs to arbitrary tensors returned by the allocator. The current graph-owned input buffer plus D2D copy remains the stable path. + +### Rejected: Non-Blocking Graph Input D2D Copy + +- Hypothesis: The CUDA graph replay path enqueues a device-to-device copy from the preprocessed CUDA tensor into the graph-owned input buffer. Passing `non_blocking=True` to that `copy_(...)` could remove conservative copy synchronization while preserving the explicit stream ordering already enforced by `stream.wait_stream(caller_stream)`. +- Change tested: Temporary code only; changed `trt_cuda_graph_state.input_buffer.copy_(pre_processed_images)` to `copy_(pre_processed_images, non_blocking=True)` in the TensorRT CUDA graph cache-hit path. Pipeline depth remained fixed at `2`. +- Correctness: Prediction math and stream dependencies are unchanged; this only changes the copy enqueue flag for a D2D copy on the same graph replay stream. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=233.14`, `frames=538 elapsed=2.31s fps=233.32`, and `frames=538 elapsed=2.30s fps=234.02`, below the accepted fixed-copy band. +- Learning: The D2D input copy flag is not limiting the graph-to-graph interval. Keep the default `copy_(...)` call. + +### Rejected: Capture TensorRT Output Copies Inside CUDA Graph + +- Hypothesis: The accepted CUDA graph replay path clones TensorRT output buffers after `cuda_graph.replay()`, leaving D2D output copy work in the small graph-to-graph gap. Capturing those D2D output copies as CUDA graph nodes and alternating between two graph states could keep postprocess reading copied buffers while removing per-frame Python-side clone launches from the gap. +- Change tested: Temporary code only; added a captured-output-copy mode to the TensorRT graph helper, allocated internal TensorRT output buffers plus public copied output buffers, captured `execute_async_v3(...)` followed by `destination.copy_(source)` for each output, and enabled a two-slot graph-state pool only in the RFDETR TRT workflow fast path. Pipeline depth remained fixed at `2`. +- Correctness: Direct full-video comparison against the normal cloned-output CUDA graph path over all 538 frames matched exactly: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=233.26`, `frames=538 elapsed=2.32s fps=232.04`, and `frames=538 elapsed=2.33s fps=230.70`, below the accepted fixed-copy band. +- Learning: Moving output copies into the CUDA graph lengthens the effective graph bottleneck more than it helps the already tiny post-graph gap. Keep output clones outside the captured TensorRT graph. + +### Rejected: Two-Stage Top-2-Per-Query Selector + +- Hypothesis: The current Triton selector repeatedly scans the full `100x91` query-class score matrix for global maxima. If each query contributes at most two classes above threshold, a first kernel can compute the top two valid classes per query in parallel, and a second kernel can globally rank only 200 candidates. This could fill the GPU better than the one-block selector flagged by Nsight Compute as under-occupied. +- Change tested: Temporary code only; added a top-2-per-query candidate kernel plus a 200-candidate global selector kernel and enabled it only in the RFDETR TRT workflow fast path. Pipeline depth remained fixed at `2`. +- Correctness: The existing selector had duplicate query detections on only 3 of 538 frames, with maximum query multiplicity `2`. The top-2-per-query selector matched the existing selector exactly over all 538 frames: `bad=0`, `max_float_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.29s fps=234.73`, `frames=538 elapsed=2.32s fps=231.40`, and `frames=538 elapsed=2.29s fps=234.73`, below the accepted fixed-copy band. +- Learning: The extra kernel launch and candidate-buffer traffic cost more than the added selector parallelism. Keep the single-kernel global selector despite its low occupancy. + +### Rejected: Disable TensorRT Graph Context NVTX Verbosity + +- Hypothesis: The accepted engine was built with layer-name profiling verbosity, and TensorRT graph execution contexts expose `nvtx_verbosity`. Setting the CUDA graph execution context to `ProfilingVerbosity.NONE` during capture might remove NVTX/profiling bookkeeping from graph replay without changing kernels. +- Change tested: Temporary code only; set `graph_context.nvtx_verbosity = trt.ProfilingVerbosity.NONE` immediately after creating the TensorRT CUDA graph execution context. Pipeline depth remained fixed at `2`. +- Correctness: Prediction math and graph topology are unchanged; this only changes TensorRT execution-context metadata verbosity. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=233.62`, `frames=538 elapsed=2.30s fps=234.00`, and `frames=538 elapsed=2.32s fps=231.96`, below the accepted fixed-copy band. +- Learning: Runtime NVTX verbosity is not the TensorRT graph replay limiter in this no-profiler benchmark. Keep the default context verbosity. + +### Rejected: Extra TensorRT Warmup Before CUDA Graph Capture + +- Hypothesis: TensorRT might lazily settle execution-context state during the first enqueue before CUDA graph capture. Running two warmup enqueues instead of one before capture could produce a more stable or faster captured graph. +- Change tested: Temporary code only; changed `_capture_cuda_graph(...)` to enqueue `execute_async_v3(...)` twice on the graph stream before synchronizing and capturing. Pipeline depth remained fixed at `2`. +- Correctness: Prediction math and captured graph operations are unchanged; this only changes pre-capture warmup count. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.31s fps=232.82`, `frames=538 elapsed=2.31s fps=232.97`, and `frames=538 elapsed=2.32s fps=232.11`, below the accepted fixed-copy band. +- Learning: One warmup enqueue is enough for this engine. Extra pre-capture warmup does not improve the steady captured graph and may perturb initialization/cache behavior. Keep the original single warmup. + +### Rejected: Deferred Postprocess Stream Copy + +- Hypothesis: The RFDETR TRT deferred fast path still waited for the postprocess stream on the current stream before CPU materialization. Passing the postprocess stream through `InstanceDetections.image_metadata` and enqueueing the fixed 7-row pinned D2H copies on that same stream could avoid adding a current-stream dependency and tighten the graph-to-graph tail. +- Change tested: Temporary code only; when `defer_cuda_stream_sync=True`, `post_process(...)` stored `self._post_process_stream` in each detection metadata entry and skipped the current-stream wait. `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` then copied count, xyxy, confidence, class IDs, and masks on that producer stream and synchronized the stream at the CPU conversion point. Pipeline depth remained fixed at `2`. +- Correctness: Prediction math and copy ordering are unchanged; the experiment only moved the explicit synchronization point from the current stream to the producer stream used for postprocess and D2H copy. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.30s fps=233.72` and `frames=538 elapsed=2.32s fps=232.03`, below the accepted fixed-copy band. A mistakenly launched pair of concurrent benchmark repeats was killed and ignored. +- Learning: The current-stream wait is not the limiter. The existing handoff keeps scheduling stable enough, and moving the D2H copies onto the postprocess stream does not reduce the already small graph-to-graph tail. Keep the accepted current-stream wait and fixed-copy path. + +### Depth-2 Accepted Nsight Systems Graph-Gap Profile + +- Request: Collect another Nsight Systems profile for the current accepted implementation while keeping the pipeline fixed at depth `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_113302.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_113302.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_113302_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_113302_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_113302_stats_cuda_api_sum.csv`. +- Profiled result: `frames=538 elapsed=2.32s fps=231.77` under Nsight Systems overhead. +- Graph spacing after skipping the first 100 graph launches: CUDA graph duration p50 `4064.720 us`, p90 `4128.916 us`, p95 `4133.955 us`, p99 `4139.214 us`, mean `4049.406 us`; graph end-to-next-start gap p50 `40.512 us`, p90 `41.951 us`, p95 `42.437 us`, p99 `43.231 us`, mean `40.661 us`. +- Gap decomposition after skipping the first 100 graph launches: busy work inside the gap p50 `35.104 us`, p90 `36.441 us`, p95 `37.062 us`, p99 `37.981 us`, mean `35.267 us`; idle time inside the gap p50 `5.280 us`, p90 `6.016 us`, p95 `6.176 us`, p99 `6.377 us`, mean `5.393 us`. +- Learning: The current depth-2 path remains constrained by the TensorRT CUDA graph body, not CPU bubbles. The post-graph tail is short and stable; most of the roughly `40 us` graph-to-graph interval is real GPU work rather than idle. + +### Rejected: Two-Slot Pooled CUDA Graph Input + +- Hypothesis: The accepted CUDA graph path copies the preprocessed tensor into the graph-owned input buffer on the inference stream immediately before replay. A fixed two-slot graph-input pool could let preprocessing copy into graph-bound input buffers earlier, then replay TensorRT directly from those stable pointers without the graph-stream input D2D copy. This avoids the previous arbitrary-external-input cache thrash while keeping pipeline depth fixed at `2`. +- Change tested: Temporary code only; added a gated `RFDETR_TRT_POOLED_GRAPH_INPUT=1` path that copied preprocessing output into a two-slot model-owned CUDA buffer pool, keyed TensorRT CUDA graph cache entries by the stable external input pointer, and skipped `input_buffer.copy_(pre_processed_images)` on graph cache hits. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.43s fps=221.56`, far below the accepted fixed-copy band. +- Learning: Even with stable input pointers, capturing and maintaining multiple input-bound graph states plus the extra preprocessing pool copy costs more than the current single graph-owned input copy. The existing graph-stream D2D copy is small and better scheduled than this explicit pooling scheme. Keep the accepted single graph input buffer and per-frame D2D copy. + +### Rejected: Alternate Prebuilt TensorRT Plan + +- Hypothesis: Another RFDETR segmentation TensorRT package already present on disk (`/tmp/rfdetr_trt_pkg_bbc2cc23adf6f5e71a9241956081da96/engine.plan`, `248 MB`, `num_aux_streams=3`) might use a different tactic set than the accepted cache engine (`/tmp/cache/shared-blobs/bc173a2cfda9a10af2bc411885e9fec3`, `188 MB`, `num_aux_streams=4`) and reduce the TensorRT CUDA graph body time. +- Change tested: Temporary external-state test only; repointed both RFDETR cache `engine.plan` symlinks to the alternate plan, ran the requested workflow with pipeline depth fixed at `2`, then restored both symlinks to the accepted cache engine. +- Result on requested command: `frames=538 elapsed=9.44s fps=56.97`, far below the accepted fixed-copy band. +- Learning: The alternate prebuilt plan is not viable on this T4 runtime despite deserializing. Keep the accepted cached engine and do not use this package for further tuning. + +### Rejected: Triton Selector Warp Count + +- Hypothesis: The single-block `_select_topk_boxes_kernel` is one of the only visible post-graph kernels. Changing the Triton reduction launch from `num_warps=8` might reduce selector latency: fewer warps could lower scheduling/register pressure, while more warps could speed the large `100x91` score reduction. +- Change tested: Temporary code only; changed `_select_topk_boxes_kernel` from `num_warps=8` to `4`, benchmarked, then changed it to `16` and benchmarked. Pipeline depth remained fixed at `2`. +- Result on requested command: `num_warps=4` measured `frames=538 elapsed=2.32s fps=232.18`; `num_warps=16` measured `frames=538 elapsed=2.34s fps=230.37`, both below the accepted fixed-copy band. +- Learning: The current `num_warps=8` is the best of these simple selector launch configurations. The selector remains small relative to the TensorRT CUDA graph body, and launch-shape tuning does not move the full-pipeline limiter. + +### Rejected: Triton Mask Resize Block Size + +- Hypothesis: `_resize_selected_masks_kernel` is the other stable post-graph Triton kernel. Changing the per-program pixel block size from `256` could improve occupancy or memory coalescing for the `7 x 312 x 312` bounded resize work. +- Change tested: Temporary code only; changed `fused_resize_selected_masks(...)` block size from `256` to `512`, benchmarked, then changed it to `128` and benchmarked. Pipeline depth remained fixed at `2`. +- Result on requested command: `block_size=512` measured `frames=538 elapsed=2.31s fps=232.65`; `block_size=128` measured `frames=538 elapsed=2.33s fps=231.06`, both below the accepted fixed-copy band. +- Learning: The accepted `block_size=256` remains the best of the simple mask-resize launch configurations. The mask kernel is too small relative to the TensorRT graph body for this tuning to move end-to-end FPS. + +### Rejected: Pinned NumPy View Ring + +- Hypothesis: After the fixed 7-row D2H copy into pinned tensors, the conversion path still calls `.numpy().copy()` for boxes, confidence, class IDs, and masks. Returning NumPy views backed by a small ring of pinned host buffers could remove a CPU memory copy while keeping enough slots for depth-2 overlap. +- Change tested: Temporary gated code only; changed `_get_rfdetr_conversion_buffers(...)` to rotate through four pinned host buffer slots and, with `RFDETR_PINNED_VIEW_NO_COPY=1`, returned NumPy views instead of owned `.copy()` arrays in `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)`. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.31s fps=233.05`, below the accepted fixed-copy band. +- Learning: The extra host-side NumPy copy is not limiting end-to-end FPS after depth-2 pipelining, and returning mutable pinned views weakens result ownership semantics. Keep the existing owned NumPy copies. + +### Rejected: Skip TensorRT Caller-Stream Wait + +- Hypothesis: The TensorRT CUDA graph helper waits the caller/inference stream on the graph replay stream after every cache hit. In the RFDETR deferred path, postprocess can wait directly on the graph result stream, so skipping the caller-stream wait might remove one event dependency from the graph-to-graph tail. +- Change tested: Temporary gated code only; with `RFDETR_SKIP_TRT_CALLER_STREAM_WAIT=1`, the TensorRT helper returned the graph execution stream and skipped `caller_stream.wait_stream(stream)`. RFDETR stored that stream in thread-local state and made postprocess wait on it instead of `_inference_stream`. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.34s fps=230.33`, below the accepted fixed-copy band. +- Learning: The caller-stream wait is part of the stable scheduling chain for the current depth-2 pipeline. Bypassing it perturbs overlap and slows the run even though the tensor dependencies are still ordered. Keep the accepted wait. + +### Rejected: High-Priority Postprocess Stream + +- Hypothesis: The postprocess stream runs the fixed selector, mask-resize, and prediction D2H copies in the roughly `40 us` graph-to-graph tail. Giving it higher priority might reduce scheduling latency and tighten the tail without changing TensorRT graph replay. +- Change tested: Temporary gated code only; with `RFDETR_HIGH_PRIORITY_POSTPROCESS=1`, created the RFDETR per-thread postprocess stream with priority `-1` instead of the default priority. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.30s fps=234.32`, then `frames=538 elapsed=2.34s fps=229.94`, not a stable improvement over the accepted fixed-copy band. +- Learning: Stream priority changes add variance and do not consistently improve the already short post-graph tail. Keep the default-priority postprocess stream. + +### Rejected: High-Priority Preprocess Stream + +- Hypothesis: H2D transfer and GPU normalization on the preprocessing stream overlap TensorRT graph replay. Giving preprocessing higher priority might make the next frame's input ready earlier and reduce occasional graph launch delay. +- Change tested: Temporary gated code only; with `RFDETR_HIGH_PRIORITY_PREPROCESS=1`, created the RFDETR per-thread preprocessing stream with priority `-1` instead of default priority. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.31s fps=232.52`, below the accepted fixed-copy band. +- Learning: Preprocessing priority is not the limiter in the current depth-2 pipeline. Keep the default-priority preprocessing stream to avoid extra scheduling variance. + +### Rejected: High-Priority Inference Stream + +- Hypothesis: The RFDETR inference stream is the caller stream for TensorRT CUDA graph replay and owns the dependency chain from preprocessing into postprocess. Giving this stream higher priority might reduce event scheduling latency around graph replay and output clone handoff. +- Change tested: Temporary gated code only; with `RFDETR_HIGH_PRIORITY_INFERENCE=1`, created `_inference_stream` with priority `-1` instead of default priority. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.34s fps=230.18`, below the accepted fixed-copy band. +- Learning: Like graph/preprocess/postprocess stream priority changes, inference-stream priority perturbs scheduling without reducing the TensorRT graph-body limiter. Keep all RFDETR streams at default priority. + +### Rejected: FP16 Logits Before Fused Selector + +- Hypothesis: TensorRT emits RFDETR class logits as float32, but the fused selector only ranks and threshold-checks a small `100x91` score matrix. Casting logits to float16 before sigmoid and selection might reduce postprocess memory traffic and selector work. +- Change tested: Temporary gated code only; with `RFDETR_FUSED_FP16_LOGITS=1`, cast `image_logits` to `torch.float16` immediately before `fused_select_topk_boxes(...)` in the fused instance segmentation postprocess path. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.33s fps=230.55`, below the accepted fixed-copy band. +- Learning: The extra cast kernel and allocation dominate any reduced selector/sigmoid work for this small logits tensor. Keep float32 logits from the accepted TensorRT engine. + +### Rejected: Event Synchronize Fixed D2H Copy + +- Hypothesis: The fixed 7-row D2H prediction copy currently calls `torch.cuda.current_stream(...).synchronize()` after enqueueing count, boxes, confidences, classes, and masks into pinned host buffers. Recording and synchronizing a CUDA event immediately after those copies might avoid waiting on unrelated current-stream work. +- Change tested: Temporary gated code only; with `RFDETR_EVENT_SYNC_D2H=1`, reused a thread-local `torch.cuda.Event`, recorded it after the fixed D2H copies, and synchronized the event instead of synchronizing the whole current stream. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.30s fps=233.66`, then `frames=538 elapsed=2.33s fps=231.23`, not stable enough to keep over the accepted fixed-copy band. +- Learning: The current stream contains only the relevant copy chain in this path, and the event record/sync adds overhead and variance. Keep the simpler stream synchronize. + +### Rejected: Metadata Array Allocation Variant + +- Hypothesis: RFDETR workflow conversion creates parent ID, image-dimensions, and inference ID arrays using Python list multiplication before converting to NumPy. Using `np.full(...)` and a preallocated dimensions array could reduce CPU metadata allocation work in the materialization tail. +- Change tested: Temporary gated code only; with `RFDETR_FAST_METADATA_ARRAYS=1`, cached `len(sv_detections)`, used `np.full(...)` for parent and inference IDs, and filled a `(N, 2)` `np.int64` image-dimensions array directly. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.30s fps=233.84`, then `frames=538 elapsed=2.31s fps=233.27`, not better than the accepted fixed-copy band. +- Learning: Per-detection metadata arrays are below the limiter after the fixed D2H copy. The existing simple list-to-array construction is stable enough; keep it. + +### Rejected: Reuse Sigmoid Output Buffer + +- Hypothesis: The fused RFDETR instance segmentation path computes an out-of-place sigmoid tensor for class logits every frame before the Triton selector. Reusing a thread-local output buffer via `torch.sigmoid(..., out=buffer)` could preserve semantics while avoiding the per-frame sigmoid output allocation. +- Change tested: Temporary gated code only; with `RFDETR_REUSE_SIGMOID_BUFFER=1`, cached a thread-local same-shape logits sigmoid buffer and wrote sigmoid results into it for RFDETR postprocess. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.31s fps=232.65`, below the accepted fixed-copy band. +- Learning: PyTorch's normal sigmoid allocation is not the limiter, and the `out=` path/thread-local lookup does not tighten the graph-to-graph interval. Keep the standard out-of-place sigmoid. + +### Rejected: Disable TensorRT Aux Streams During Capture + +- Hypothesis: The cached TensorRT engine reports four auxiliary streams. Forcing the CUDA-graph execution context to use no aux streams during capture might make the graph replay easier to schedule and reduce the graph-to-graph tail. +- Change tested: Temporary gated code only; with `RFDETR_TRT_ZERO_AUX_STREAMS=1`, called `graph_context.set_aux_streams([])` before graph capture. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.33s fps=230.41`, below the accepted fixed-copy band. +- Learning: Disabling TensorRT auxiliary streams during graph capture slows the forward pass on this engine/runtime. Keep TensorRT's default aux-stream scheduling. + +### Rejected: TensorRT Graph Replay On Caller Stream + +- Hypothesis: The accepted CUDA graph path captures and replays TensorRT on a graph-owned stream, then uses event waits to hand off from the RFDETR inference stream and back. Capturing the graph on the caller inference stream itself could remove those event edges and tighten the graph-to-graph tail. +- Change tested: Temporary gated code only; with `RFDETR_TRT_CALLER_GRAPH_STREAM=true`, captured the TensorRT CUDA graph on `torch.cuda.current_stream(device)` and skipped the wait edges when the cached graph stream matched the caller stream. Pipeline depth remained fixed at `2`. A first run with `RFDETR_TRT_CALLER_GRAPH_STREAM=1` failed before processing frames because this repo's env parser accepts `true`/`false`, not `1`/`0`, and was discarded. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.32s fps=231.91`, below the accepted fixed-copy band. +- Learning: The separate graph stream plus explicit handoff remains the better schedule for this depth-2 pipeline. The extra event edges are not the limiter, and folding graph replay onto the caller stream slows the full run. + +### Profile: Fresh Accepted Depth-2 Graph Gap + +- Request: Collect a new Nsight Systems report for the current accepted implementation while keeping the pipeline fixed at depth `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_122306.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_122306.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_122306_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_122306_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_122306_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.31s fps=232.91`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `4104.479 us`, p90 `4121.887 us`, p95 `4125.118 us`, p99 `4179.422 us`, mean `4078.695 us`; graph end-to-next-start gap was p50 `40.511 us`, p90 `41.919 us`, p95 `42.751 us`, p99 `43.327 us`, mean `40.813 us`. Busy work inside the gap was p50 `35.071 us`, mean `35.382 us`; idle inside the gap was p50 `5.344 us`, mean `5.431 us`. +- Learning: The latest accepted path is still effectively TensorRT CUDA-graph limited. The post-graph interval is short and stable, and the remaining idle bubble is only about `5 us` median. + +### Rejected: Int64 Query Indices From Selector + +- Hypothesis: The fused selector writes query indices as `int32` and the deferred path casts them to `int64`, producing a small copy/cast kernel before mask resize. Writing `int64` query indices directly from the Triton selector could remove that cast while preserving the indexing dtype expected by downstream PyTorch paths. +- Change tested: Temporary code only; first gated `RFDETR_SELECTOR_INT64_QUERIES=1` to allocate query indices as `torch.int64` and skip the `.to(dtype=torch.long)` call, then tested the actual candidate as an unconditional `int64` query-index output. Pipeline depth remained fixed at `2`. +- Result on requested command: the gated probe measured `234.01` and `234.61` FPS, but the unconditional candidate measured `frames=538 elapsed=2.32s fps=232.35`, below the accepted fixed-copy band. +- Learning: Removing the query-index cast is not enough to improve the full pipeline, and the `int64` selector output shifts scheduling or memory behavior unfavorably. Keep the accepted `int32` selector output plus explicit cast. + +### Rejected: Precomputed Mask Resize Maps + +- Hypothesis: The deferred mask resize always upsamples `78x78` masks to `312x312` in this benchmark. Caching the bilinear `x0/x1/y0/y1/wx/wy` coordinate maps on GPU could remove per-pixel floor, clamp, and weight arithmetic from `_resize_selected_masks_kernel`. +- Change tested: Temporary gated code only; with `RFDETR_PRECOMPUTED_RESIZE_MAPS=1`, cached resize maps per thread/device/shape and launched a variant Triton mask-resize kernel that loads coordinates and weights from those maps. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.33s fps=230.97`, below the accepted fixed-copy band. +- Learning: The extra global map loads are more expensive than recomputing the simple bilinear coordinates in registers for this fixed small resize. Keep the arithmetic-only resize kernel. + +### Rejected: FP16 TensorRT Mask Output Copy + +- Hypothesis: The TensorRT CUDA graph path clones all three output buffers every frame, including the large `100x78x78` mask tensor. Returning the mask output as a `float16` device copy could reduce D2D output-copy traffic and mask-resize read bandwidth while preserving the downstream zero-threshold mask test. +- Change tested: Temporary gated code only; with `RFDETR_TRT_FP16_MASK_OUTPUT=1`, cloned the first two TensorRT outputs normally but copied the third output with `.to(dtype=torch.float16)` on the graph stream. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.54s fps=211.59`, far below the accepted fixed-copy band. +- Learning: The FP32-to-FP16 cast kernel and scheduling cost dominate any D2D bandwidth reduction. Keep the plain FP32 mask clone. + +### Rejected: Capture Input Copy Inside Pointer-Keyed CUDA Graph + +- Hypothesis: The graph-to-graph gap still includes a separate `1.17 MB` D2D copy from the preprocessed CUDA tensor into the graph-owned TensorRT input buffer. Capturing that copy as the first node of the CUDA graph, keyed by the current preprocessed tensor pointer, could remove the standalone input-copy launch without retaining external input tensors. +- Change tested: Temporary gated code only; with `RFDETR_CAPTURE_INPUT_COPY_IN_GRAPH=1`, included `input_buffer.copy_(pre_processed_images)` inside CUDA graph capture, skipped the cache-hit input copy, and extended the graph cache key with `pre_processed_images.data_ptr()`. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=3.53s fps=152.47`, far below the accepted fixed-copy band. +- Learning: Pointer-keyed graph capture causes excessive graph-cache churn and capture overhead in the workflow pipeline. Keep the single shape-keyed graph and the small explicit D2D input copy. + +### Rejected: Reuse Fused Postprocess GPU Buffers + +- Hypothesis: The fused postprocess path allocates selector output tensors and a full-capacity mask-resize output tensor every frame. Reusing thread-local GPU buffers could reduce allocator overhead and remove the accepted path's query-index zero-fill while preserving the fixed-count CPU slicing behavior. +- Change tested: Temporary gated code only; with `RFDETR_REUSE_FUSED_POSTPROCESS_BUFFERS=1`, reused thread-local `scores`, `classes`, `boxes`, `query_indices`, `count`, and mask-resize output tensors. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.30s fps=233.67`, not a clear improvement over the accepted fixed-copy band. +- Learning: PyTorch's cached allocation path is not the limiter here, and avoiding the query fill this way does not improve the graph-to-graph interval. Keep fresh tensor allocation and deterministic zero-filled query indices. + +### Rejected: Explicit Empty-Like TensorRT Output Copies + +- Hypothesis: The accepted TensorRT CUDA graph path uses `buf.clone()` for each output. Replacing clone with `torch.empty_like(buf)` plus `copy_(..., non_blocking=True)` could preserve the same output ownership while avoiding any clone-specific format or autograd handling. +- Change tested: Temporary gated code only; with `RFDETR_TRT_EMPTY_COPY_OUTPUTS=1`, copied each graph output into an explicit `empty_like` tensor on the graph stream. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.30s fps=234.10`, not a clear improvement over the accepted fixed-copy band. +- Learning: PyTorch `clone()` is already as efficient as the explicit empty-and-copy form for these graph outputs. Keep the simpler clone path. + +### Rejected: Overlap Mask Clone With Selector + +- Hypothesis: The accepted TensorRT graph stream clones boxes, logits, and the large mask output before RFDETR postprocess can start. Returning after boxes/logits are cloned while the mask clone continues on the graph stream could let sigmoid and selector overlap the mask clone, then wait for a mask-ready event only before mask resize. +- Change tested: Temporary gated code only; with `RFDETR_OVERLAP_MASK_CLONE=1`, recorded a partial-output event after cloning boxes/logits, attached a mask-ready event to the mask clone tensor, and made fused mask resize wait for that event. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.32s fps=232.33`, below the accepted fixed-copy band. +- Learning: The added event and partial handoff scheduling overhead exceeds any overlap gained between mask clone and selector. Keep the simple all-output clone plus stream wait. + +### Rejected: Triton Num Stages Tuning + +- Hypothesis: The selector and mask-resize Triton kernels are small post-graph kernels where the default pipeline staging may add register/codegen overhead. Forcing `num_stages=1` could reduce latency for these elementwise/reduction kernels on T4. +- Change tested: Temporary code only; first launched `_resize_selected_masks_kernel` with `num_stages=1`, then restored resize and launched `_select_topk_boxes_kernel` with `num_stages=1`. Pipeline depth remained fixed at `2`. +- Result on requested command: resize `num_stages=1` measured `frames=538 elapsed=2.31s fps=232.79`; selector `num_stages=1` measured `frames=538 elapsed=2.30s fps=233.43`, neither better than the accepted fixed-copy band. +- Learning: Triton's default staging is already adequate for both kernels. Keep the accepted launches without explicit `num_stages`. + +### Rejected: Mask-First Fixed D2H Copy Order + +- Hypothesis: The fixed 7-row conversion path enqueues tiny count/box/confidence/class D2H copies before the larger mask D2H copy. Starting the large mask copy first might improve copy-engine scheduling and shave the CPU synchronization tail. +- Change tested: Temporary code only; reordered `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` to enqueue the mask D2H copy before count and metadata copies. Pipeline depth remained fixed at `2`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.29s fps=234.45`, then repeated at `frames=538 elapsed=2.31s fps=232.77`, not a stable improvement over the accepted fixed-copy band. +- Learning: D2H copy submission order is not a reliable limiter after fixed-count conversion. Keep the original count/metadata/mask order. + +### Rejected: NumPy Count Read From Pinned Buffer + +- Hypothesis: After fixed 7-row D2H copy synchronization, `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` reads the copied count via `count_buffer.item()`. Reading through the pinned tensor's NumPy view might avoid Torch scalar extraction overhead in the CPU materialization tail. +- Change tested: Temporary code only; replaced `int(count_buffer.item())` with `int(count_buffer.numpy()[0])`. Pipeline depth remained fixed at `2`. +- Result on requested command: `frames=538 elapsed=2.31s fps=233.16`, not better than the accepted fixed-copy band. +- Learning: Count scalar extraction is below the limiter. Keep the simpler `count_buffer.item()` path. + +### Rejected: Specialized 4x Mask Resize Kernel + +- Hypothesis: The benchmark always resizes RFDETR masks from `78x78` to `312x312`, so a Triton resize kernel specialized for the exact 4x mapping could remove per-pixel floor/divide arithmetic from the generic bilinear resize kernel. +- Change tested: Temporary code only; dispatched to a separate `_resize_selected_masks_4x_kernel(...)` when `input_height * 4 == output_height` and `input_width * 4 == output_width`. Pipeline depth remained fixed at `2`. +- Result on requested command: first run `frames=538 elapsed=2.29s fps=234.83`, repeat `frames=538 elapsed=2.31s fps=232.92`, not stable enough to keep over the accepted fixed-copy band. +- Learning: The generic resize arithmetic is not the throughput limiter at depth `2`; graph spacing is dominated by TensorRT graph replay plus the required input/output copies and selector/sigmoid tail. Keep the generic resize kernel. + +### Profile: Accepted Depth-2 Graph Gap After 4x Resize Rejection + +- Request: Collect another Nsight Systems report for the current accepted implementation while keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_130220.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_130220.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_130220_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_130220_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_130220_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.33s fps=230.90`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `4102.399 us`, p90 `4116.422 us`, p95 `4123.113 us`, p99 `4152.967 us`, mean `4078.093 us`; graph end-to-next-start gap was p50 `40.863 us`, p90 `42.220 us`, p95 `42.528 us`, p99 `43.401 us`, mean `41.820 us`. Busy work inside the gap was p50 `35.263 us`, mean `35.769 us`; idle inside the gap was p50 `5.471 us`, mean `6.051 us`. +- Learning: Depth `2` remains effectively constrained by the TensorRT CUDA graph. The remaining median idle bubble between graph launches is about `5.5 us`, and the graph-to-graph gap remains low and consistent. + +### Rejected: Prewarmed Two-Slot TensorRT Graph Pool + +- Hypothesis: The earlier thread-local graph replay experiment regressed because each worker captured lazily in the hot path. Prewarming two TensorRT CUDA graph caches from the first frame and assigning one cache/stream to each depth-2 worker might allow concurrent graph replay and improve GPU utilization if the TensorRT graph body has internal bubbles. +- Change tested: Temporary gated code only; with `RFDETR_PREWARMED_TRT_GRAPH_POOL=1`, the RFDETR TRT model created two one-entry graph caches and two inference streams on the first forward pass, captured both graphs before the first result, then assigned stable slots to worker threads. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.65s fps=202.96`, far below the accepted fixed-copy band. +- Learning: The TensorRT graph body already saturates the relevant T4 resources, or concurrent graph/context scheduling interferes with TensorRT's own auxiliary streams. Keep the single serialized TensorRT CUDA graph path with depth-2 CPU/GPU pipelining. + +### Rejected: TensorRT External Context Device Memory + +- Hypothesis: Creating the CUDA graph execution context with `create_execution_context_without_device_memory()`, allocating the activation memory explicitly as a long-lived Torch CUDA tensor, and binding it with `set_device_memory(...)` might improve graph replay stability or memory placement versus TensorRT's internal context allocation. +- Change tested: Temporary gated code only; with `RFDETR_TRT_EXTERNAL_DEVICE_MEMORY=1`, `_capture_cuda_graph(...)` used an external device-memory allocation sized from `update_device_memory_size_for_shapes()` and `engine.device_memory_size_v2`, kept alive on the graph state. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: first run `frames=538 elapsed=2.29s fps=234.64`, repeat `frames=538 elapsed=2.33s fps=230.51`, not stable enough to keep over the accepted fixed-copy band. +- Learning: TensorRT's internal context memory allocation is not the graph replay limiter for this static RFDETR engine. Keep the simpler default execution-context allocation. + +### Rejected: Partial TensorRT Aux Stream Counts + +- Hypothesis: The RFDETR engine reports four auxiliary streams. Previous tests rejected zero explicit aux streams and all four explicit aux streams; setting one or two explicit aux streams during CUDA graph capture might reduce TensorRT internal scheduling overhead while preserving some overlap. +- Change tested: Temporary gated code only; with `RFDETR_TRT_AUX_STREAM_COUNT`, `_capture_cuda_graph(...)` created that many Torch CUDA streams, passed their handles to `IExecutionContext.set_aux_streams(...)`, and kept them alive on the CUDA graph state. Pipeline depth remained fixed at `2`. +- Result on requested command: `RFDETR_TRT_AUX_STREAM_COUNT=2` measured `frames=538 elapsed=2.31s fps=233.11`; `RFDETR_TRT_AUX_STREAM_COUNT=1` measured `frames=538 elapsed=2.31s fps=232.90`, both below the accepted fixed-copy band. +- Learning: Manually constraining TensorRT aux-stream count does not improve graph replay. Keep TensorRT's default aux-stream scheduling. + +### Rejected: Skip RFDETR Forward Lock On Graph Cache Hit + +- Hypothesis: After the static TensorRT CUDA graph is captured, the model-level RFDETR forward lock might be unnecessary because the graph cache is internally locked and the shared inference stream serializes GPU work. Skipping the lock on cache hits could reduce CPU scheduling overhead before graph replay. +- Change tested: Temporary gated code only; with `RFDETR_SKIP_FORWARD_LOCK_ON_GRAPH_HIT=1`, `RFDetrForInstanceSegmentationTRT.forward(...)` used the normal lock for graph capture, then skipped it when the static cache key was already present. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: first run `frames=538 elapsed=2.30s fps=233.77`, repeat `frames=538 elapsed=2.32s fps=231.57`, not better than the accepted fixed-copy band. +- Learning: The forward lock is not a meaningful bottleneck now that graph-to-graph idle is only a few microseconds. It likely helps keep cross-thread scheduling stable, so keep the locked forward path. + +### Rejected: Skip PyCUDA Context On TensorRT Graph Cache Hit + +- Hypothesis: Once the static TensorRT CUDA graph has been captured, cache-hit replay uses PyTorch streams and the captured graph state. Skipping the per-frame `use_cuda_context(...)` push/pop on cache hits might remove CPU overhead without changing graph execution. +- Change tested: Temporary gated code only; with `RFDETR_SKIP_CUDA_CONTEXT_ON_GRAPH_HIT=1`, `RFDetrForInstanceSegmentationTRT.forward(...)` kept the model lock but skipped the PyCUDA context manager when the static graph-cache key was already present. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: first run `frames=538 elapsed=2.30s fps=234.21`, repeat `frames=538 elapsed=2.33s fps=230.89`, not stable enough to keep over the accepted fixed-copy band. +- Learning: Context push/pop is below the limiter or helps maintain predictable CUDA context state across worker threads. Keep the original context manager on forward. + +### Rejected: Borrow TensorRT Mask Output With Release Event + +- Hypothesis: The accepted CUDA graph path clones the full `100x78x78` TensorRT mask output every frame even though the workflow normally resizes only the first 7 selected detections. Returning a borrowed graph-owned mask output while cloning only boxes/logits could remove the large D2D mask clone if the next graph replay waits for postprocess to finish reading the mask. +- Change tested: Temporary gated code only; with `RFDETR_BORROW_TRT_MASK_OUTPUT=1`, cache-hit replay cloned only the first two TensorRT outputs, returned the graph-owned mask tensor with graph-state metadata, recorded a release event after RFDETR postprocess, and waited on that event before the next graph replay could overwrite the mask buffer. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.33s fps=231.40`, below the accepted fixed-copy band. +- Learning: The large mask clone also decouples the next TensorRT graph replay from postprocess. Replacing it with an event dependency lengthens the critical path more than it saves in D2D copy time. Keep the full output clone path. + +### Rejected: Selected Low-Res Mask Copy Before Resize + +- Hypothesis: The borrowed-mask test waited for full postprocess before releasing the graph-owned TensorRT mask output. Copying only the selected low-res mask planes first, recording the release event immediately after that small copy, and resizing from the compact copy might remove the full `100x78x78` mask clone while allowing the next graph replay to start earlier than waiting for full mask resize. +- Change tested: Temporary gated code only; with `RFDETR_SELECTED_MASK_COPY=1`, cache-hit replay cloned boxes/logits but borrowed the graph-owned mask output, RFDETR fused postprocess copied the first `deferred_mask_resize_detection_limit` selected `78x78` masks into a compact buffer, recorded the graph-output release event, then resized compact rows with a no-query-index Triton kernel. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.33s fps=230.85`, below the accepted fixed-copy band. +- Learning: The added selected-copy kernel, compact-resize variant, and release-event dependency still cost more than the full mask clone's buffering benefit. Keep the accepted full TensorRT output clone path. + +### Rejected: Selector Iteration Cap + +- Hypothesis: The fused selector loops up to `100` global top-score iterations, but the benchmark has at most `7` detections per frame. Lowering the maximum iteration count could reduce selector latency while preserving outputs if no extra invalid high-score classes need to be skipped. +- Change tested: Temporary gated code only; with `RFDETR_SELECTOR_MAX_ITERATIONS`, passed a smaller Triton constexpr loop bound to `_select_topk_boxes_kernel(...)`. Also probed the simpler raw-mask shortcut assumption and found selected query IDs ranged from `0` to `98`, with `38/538` frames selecting query IDs above `6`, so cloning only the first seven raw mask rows would be incorrect. +- Correctness: Compared gated postprocess against accepted postprocess on all `538` frames. Caps `16`, `8`, and `7` all matched exactly: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: cap `16` measured `frames=538 elapsed=2.30s fps=233.68`; cap `8` measured `frames=538 elapsed=2.31s fps=233.31`; cap `7` measured `frames=538 elapsed=2.31s fps=233.37`, not better than the accepted fixed-copy band. +- Learning: Even when the selector does fewer loop iterations, the end-to-end run remains constrained by the TensorRT CUDA graph and output-copy scheduling. Keep the original conservative `100`-iteration selector bound for general correctness. + +### Profile: Current Accepted Depth-2 Graph Gap + +- Request: Refresh Nsight Systems evidence for the current accepted implementation after the recent rejected output-copy and selector-cap experiments, keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_134108.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_134108.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_134108_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_134108_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_134108_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.30s fps=234.29`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `4102.607 us`, p90 `4117.545 us`, p95 `4121.161 us`, p99 `4127.790 us`, mean `4070.076 us`; graph end-to-next-start gap was p50 `40.479 us`, p90 `41.913 us`, p95 `42.303 us`, p99 `43.076 us`, mean `40.907 us`. Busy work inside the gap was p50 `35.072 us`, mean `35.500 us`; idle inside the gap was p50 `5.247 us`, mean `5.407 us`. +- Learning: The current accepted path is still bottlenecked by the TensorRT CUDA graph body. The post-graph tail remains about `40 us` and only about `5 us` of that is idle, so further end-to-end wins need to reduce TensorRT graph duration or remove required input/output copies without adding synchronization. + +### Rejected: Borrow TensorRT Boxes And Logits Outputs + +- Hypothesis: The accepted CUDA graph path still clones the small TensorRT boxes and logits outputs after every replay. Borrowing those graph-owned outputs while keeping the full mask clone as the decoupling buffer might shave the graph-to-graph gap without forcing mask resize onto the next replay's critical path. +- Change tested: Temporary gated code only; with `RFDETR_BORROW_TRT_SMALL_OUTPUTS=true`, cache-hit replay returned graph-owned boxes/logits and cloned only masks. A CPU-side ready event plus CUDA release event protected the borrowed buffers, and the fused path released them immediately after sigmoid and selector had been enqueued on the postprocess stream. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.34s fps=230.02`, below the accepted fixed-copy band. +- Learning: The two small D2D clones are cheaper than the extra release-event handoff on this depth-2 pipeline. Keep the simpler accepted TensorRT output clone path. + +### Profile: Accepted Depth-2 Graph Gap After Small-Output Borrow Rejection + +- Request: Collect a fresh Nsight Systems report for the restored accepted implementation, keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_134844.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_134844.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_134844_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_134844_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_134844_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.33s fps=230.87`. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first 100 launches, CUDA graph duration was p50 `4070.910 us`, p90 `4133.057 us`, p95 `4136.255 us`, p99 `4141.648 us`, mean `4065.406 us`; graph end-to-next-start gap was p50 `40.480 us`, p90 `41.888 us`, p95 `42.374 us`, p99 `42.952 us`, mean `40.920 us`. Busy work inside the gap was p50 `35.072 us`, mean `35.486 us`; idle inside the gap was p50 `5.312 us`, mean `5.434 us`. +- Learning: The restored accepted path is already graph-bound at depth `2`: graph replay takes roughly `4.07 ms`, while the post-graph tail is roughly `40 us` with only about `5 us` idle. The remaining gap is mostly required input copy, output clones, sigmoid/selector setup, and small postprocess kernels rather than CPU bubbles. + +### Rejected: Keep Deferred Query Indices Int32 + +- Hypothesis: The deferred fused mask-resize path does not need `query_indices` as `int64`; returning the selector's native `int32` query-index tensor could remove the int32-to-int64 cast kernel from the post-graph tail. +- Change tested: Temporary code only; when `return_cpu_count=False`, `fused_select_topk_boxes(...)` returned the int32 `query_indices` tensor directly. The non-deferred indexing path still used the existing int64 conversion. Pipeline depth remained fixed at `2`. +- Result on requested command: first run `frames=538 elapsed=2.31s fps=232.65`, repeat `frames=538 elapsed=2.32s fps=231.65`, not better than the accepted fixed-copy band. +- Learning: Removing this tiny cast is below end-to-end noise because the run is dominated by TensorRT graph replay and required copies. Keep the existing int64 return type for consistency with the non-deferred path. + +### Rejected: Disable TensorRT Enqueue Profiling Emission + +- Hypothesis: The TensorRT CUDA graph execution context defaults `enqueue_emits_profile=True` even though no profiler is attached. Disabling that flag before warmup and graph capture might remove bookkeeping from TensorRT enqueue or graph replay. +- Change tested: Temporary gated code only; with `RFDETR_TRT_DISABLE_ENQUEUE_PROFILE=true`, `_capture_cuda_graph(...)` set `graph_context.enqueue_emits_profile = False` immediately after creating the graph execution context. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: first run `frames=538 elapsed=2.30s fps=233.45`, repeat `frames=538 elapsed=2.33s fps=231.01`, not a stable improvement over the accepted fixed-copy band. +- Learning: TensorRT enqueue profiling emission is not a meaningful limiter for the captured RFDETR graph path. Keep the default execution-context setting. + +### Rejected: TensorRT On-Profile-Change Context Allocation + +- Hypothesis: TensorRT exposes `ExecutionContextAllocationStrategy.ON_PROFILE_CHANGE` in addition to the default static execution-context allocation. For this static-shape engine, the alternate allocation strategy might reduce context memory management overhead or produce a slightly different captured graph schedule. +- Change tested: Temporary gated code only; with `RFDETR_TRT_ON_PROFILE_CHANGE_CONTEXT=true`, `_capture_cuda_graph(...)` created the graph execution context with `engine.create_execution_context(trt.ExecutionContextAllocationStrategy.ON_PROFILE_CHANGE)`. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.31s fps=233.16`, `frames=538 elapsed=2.28s fps=235.96`, and `frames=538 elapsed=2.31s fps=232.94`, not stable enough to keep over the accepted fixed-copy band. +- Learning: The alternate allocation strategy does not reduce the TensorRT CUDA graph bottleneck. Keep the default static execution-context allocation. + +### Rejected: Producer Runtime And Queue Tuning + +- Hypothesis: Once graph-to-graph gaps are about `40 us`, small CPU producer/dispatcher settings might still perturb depth-2 balance: limiting CPU thread pools, disabling Python GC, changing video decode buffer size, or replacing the bounded predictions `Queue` with `SimpleQueue`. +- Change tested: External/runtime probes plus temporary gated code only. Ran the accepted path with `OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 TORCH_NUM_THREADS=1`; ran the benchmark with `gc.disable()` via `runpy`; tried `VIDEO_SOURCE_BUFFER_SIZE=8` and `128`; and tested `INFERENCE_PIPELINE_SIMPLE_PREDICTIONS_QUEUE=true` using a temporary `SimpleQueue` dispatch path. Pipeline depth remained fixed at `2`. +- Result on requested command: limited CPU threads measured `frames=538 elapsed=2.29s fps=234.87`; disabled GC measured `frames=538 elapsed=2.30s fps=234.18`; video buffer `8` measured `frames=538 elapsed=2.32s fps=231.46`; video buffer `128` measured `frames=538 elapsed=2.33s fps=231.38`; simple predictions queue measured `frames=538 elapsed=2.30s fps=233.62`. +- Learning: The current run is not meaningfully limited by Python GC, CPU thread oversubscription, video decode buffer size, or bounded queue bookkeeping. Keep the accepted runtime settings and standard `Queue` implementation. + +### Rejected: Adaptive Fixed-Row Prediction Copy + +- Hypothesis: The accepted fixed D2H conversion copies seven full-resolution mask rows for every frame, but many frames have fewer detections. Copying `previous_count + 1` rows, capped at the seven-row mask limit, could reduce common-path D2H payload while falling back to the safe existing conversion when the current count exceeds the predicted row count. +- Change tested: Temporary gated code only; with `RFDETR_ADAPTIVE_FIXED_COPY=true`, `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` predicted the row count from a thread-local previous valid count, copied that many rows plus the count, then updated the stored count after synchronization. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.32s fps=231.82`, below the accepted fixed seven-row copy band. +- Learning: The current fixed seven-row copy is faster than adaptive under-copy with fallback. Extra branching and fallback copies on count increases cost more than copying a few unused mask rows. + +### Rejected: Mask-First TensorRT Output Clone Order + +- Hypothesis: The accepted TensorRT graph cache-hit path clones boxes, logits, then the large mask output. Cloning the mask first while returning outputs in the original order might improve D2D copy scheduling in the post-graph tail. +- Change tested: Temporary gated code only; with `RFDETR_TRT_MASK_FIRST_OUTPUT_CLONE=true`, cloned TensorRT output buffer `2` first, then buffers `0` and `1`, and returned `[detections, labels, mask]`. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.31s fps=233.15`, below the accepted fixed-copy band. +- Learning: Output clone order does not improve the graph-to-graph interval. The existing boxes/logits/mask order remains the better schedule for this pipeline. + +### Depth-2 Accepted Nsight Systems Refresh + +- Request: Collect a fresh Nsight Systems report for the current accepted implementation while keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_141912.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_141912.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_141912_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_141912_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_141912_stats_cuda_api_sum.csv`. +- Profiled result: `frames=538 elapsed=2.30s fps=233.83` under Nsight Systems overhead. +- Graph spacing: The capture includes `538` CUDA graph traces. After skipping the first `100` graph launches, CUDA graph duration was p50 `4064.207 us`, p90 `4130.817 us`, p95 `4134.781 us`, p99 `4141.249 us`, mean `4058.199 us`; graph end-to-next-start gap was p50 `40.479 us`, p90 `41.868 us`, p95 `42.335 us`, p99 `42.964 us`, mean `41.481 us`. +- Gap decomposition: Busy work inside the gap was p50 `35.168 us`, mean `35.742 us`; idle inside the gap was p50 `5.184 us`, p90 `5.920 us`, p95 `6.015 us`, p99 `6.208 us`, mean `5.739 us`. The largest gap occupants were the TensorRT mask D2D clone (`2433600B`, `13.156 us` avg overlap), graph input D2D copy (`1168128B`, `13.119 us` avg overlap), sigmoid (`6.872 us` avg overlap), fill-long (`2.817 us` avg overlap), selector (`2.184 us` avg overlap), logits D2D clone (`36400B`, `2.102 us` avg overlap), and boxes D2D clone (`1600B`, `1.991 us` avg overlap). +- Learning: Depth `2` remains constrained by the CUDA graph body. The post-graph interval is low and consistent, and the remaining idle bubble is only about `5-6 us`; further wins need to reduce TensorRT graph duration or eliminate required input/output copies without adding postprocess dependencies. + +### Rejected: Boxes-Mask-Logits TensorRT Output Clone Order + +- Hypothesis: The accepted TensorRT graph cache-hit path clones boxes, logits, then masks. Cloning boxes first, then the large mask tensor, then logits might improve D2D copy scheduling while still returning `[detections, labels, mask]` to callers. +- Change tested: Temporary gated code only; with `RFDETR_TRT_BOX_MASK_LOGITS_OUTPUT_CLONE=true`, cloned TensorRT output buffer `0`, then `2`, then `1`, and returned results in the original output order. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.31s fps=232.66`, below the accepted fixed-copy band. +- Learning: Reordering the same graph-stream D2D clones does not reduce the critical graph-to-graph interval. Keep the original boxes/logits/mask order. + +### Rejected: TensorRT Graph-Stream Sigmoid Logits + +- Hypothesis: The accepted TensorRT graph stream clones the small logits output and postprocess later launches a separate sigmoid kernel. Computing `sigmoid()` directly from the graph-owned logits on the graph stream could replace the logits D2D clone with the actual postprocess tensor and remove the later sigmoid launch. +- Change tested: Temporary gated code only; with `RFDETR_TRT_GRAPH_SIGMOID_LOGITS=true`, CUDA graph capture and cache-hit replay returned `[boxes.clone(), logits.sigmoid(), masks.clone()]`, and RFDETR dense postprocess skipped its normal logits sigmoid. Pipeline depth remained fixed at `2`. +- Correctness: Compared the gated path against the accepted path on all `538` frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command with the gate enabled: first run `frames=538 elapsed=2.30s fps=233.69`, repeat `frames=538 elapsed=2.34s fps=230.14`, below the accepted fixed-copy band. +- Learning: The graph-stream logits clone is small enough that replacing it with sigmoid work on the graph stream perturbs the critical schedule rather than improving it. Keep the accepted logits clone plus postprocess-stream sigmoid. + +### External Runtime Probe: Lock T4 Graphics Clock + +- Hypothesis: The accepted depth-2 path is now constrained by TensorRT CUDA graph duration, and the Tesla T4 sits in a low-power P8 state at idle. Locking graphics clocks to the supported maximum before the benchmark may remove early-run clock ramp and expose the true graph-bound ceiling. +- Change tested: External runtime setting only; ran `nvidia-smi -lgc 1590,1590`, confirmed `P0` and `1590 MHz` graphics/SM clocks, ran the accepted benchmark with pipeline depth fixed at `2`, then reset the lock with `nvidia-smi -rgc`. +- Result on requested command: `frames=538 elapsed=2.21s fps=243.54`, with progress already near steady state by frame `50` (`254.93` FPS at frame 50, settling to `243.54` FPS overall). +- Learning: The code path is effectively at the graph-bound ceiling when the T4 is held at max clocks. This is an external deployment/runtime tuning knob, not a library code change; future code changes should be compared under the same clock policy if the goal is absolute max FPS rather than default cloud/runtime behavior. + +### External Runtime Probe: Max Clocks Plus CUDA Connections + +- Hypothesis: With T4 clocks locked to maximum, `CUDA_DEVICE_MAX_CONNECTIONS=2` might combine with the accepted depth-2 stream layout to reduce scheduling overhead further than max clocks alone. +- Change tested: External runtime setting only; ran `nvidia-smi -lgc 1590,1590`, then launched the accepted benchmark with `CUDA_DEVICE_MAX_CONNECTIONS=2` and pipeline depth fixed at `2`, then reset graphics clocks with `nvidia-smi -rgc`. +- Result on requested command: `frames=538 elapsed=2.21s fps=243.26`, effectively the same as the prior max-clock-only `243.54` FPS result. +- Learning: Once clocks are held at maximum, the connection-count scheduler knob does not move the ceiling. The remaining limiter is the TensorRT CUDA graph body. + +### Rejected Under Max Clocks: Clone Graph Outputs On Caller Stream + +- Hypothesis: Moving TensorRT output clones from the graph stream to the caller/inference stream was noisy under default clocks. With T4 clocks locked, the cleaner graph-bound regime might show whether this schedule frees the graph stream sooner. +- Change tested: Temporary gated code only; with `RFDETR_TRT_CLONE_OUTPUTS_ON_CALLER_STREAM=true`, cache-hit replay copied input and replayed the CUDA graph on the graph stream, then cloned TensorRT outputs on the caller stream. Pipeline depth remained fixed at `2`, and the benchmark ran with graphics clocks temporarily locked to `1590 MHz`. +- Result on requested command: `frames=538 elapsed=2.22s fps=242.51`, below the accepted graph-stream clone schedule under max clocks (`243.54` FPS). +- Learning: Even in the max-clock regime, moving output clones to the caller stream does not improve throughput. Keep the original graph-stream output clone schedule. + +### RFDETR TensorRT CUDA Graph Capture Replay Warmup + +- Hypothesis: The accepted path is graph-bound and the benchmark timer starts on the first delivered prediction, after CUDA graph capture. Replaying the captured TensorRT graph several times during RFDETR TRT graph capture can ramp the T4 clocks before measured frames without changing steady-state graph replay or prediction math. +- Change: Added a `cuda_graph_replay_warmup_count` option to the shared TensorRT CUDA graph helper and set it to `64` only from `RFDetrForInstanceSegmentationTRT.forward(...)`. Generic TensorRT callers still default to `0`. +- Correctness: Compared the warmed CUDA graph path against standard non-graph TensorRT execution on all `538` frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Warmup tuning: Temporary env-gated probes showed `16` extra replay warmups was too short (`frames=538 elapsed=2.31s fps=233.37`), `32` reached the max-clock band (`frames=538 elapsed=2.21s fps=243.71`), and `64` was the strongest (`frames=538 elapsed=2.20s fps=244.95`, repeat `frames=538 elapsed=2.19s fps=245.15`). +- Result on requested command after wiring the RFDETR default: depth `2` measured `frames=538 elapsed=2.21s fps=243.86` and repeat `frames=538 elapsed=2.20s fps=244.16`, with no extra environment variable or external clock lock. +- Learning: This does not shorten the TensorRT graph body; it moves the run into the steady graph-bound clock regime before the benchmark's measured interval. The tradeoff is extra first-frame/model-warmup latency, which is acceptable for this throughput-oriented RFDETR TRT path and keeps the measured pipeline close to the observed `~243-245` FPS max-clock ceiling. + +### Rejected: Higher RFDETR TensorRT Graph Replay Warmup Counts + +- Hypothesis: The accepted `64` replay warmup may still under-warm the T4 for the measured interval; increasing capture-time graph replays to `96` or `128` could raise the steady benchmark closer to the best observed warmup run. +- Change tested: Temporary code only; changed `RFDetrForInstanceSegmentationTRT.forward(...)` from `cuda_graph_replay_warmup_count=64` to `128`, then to `96`. Pipeline depth remained fixed at `2`. +- Correctness: Prediction math and graph topology are unchanged; only extra pre-measurement replays of the already captured TensorRT graph are added. +- Result on requested command: `128` measured `frames=538 elapsed=2.20s fps=244.73` and repeat `frames=538 elapsed=2.20s fps=244.12`; `96` measured `frames=538 elapsed=2.20s fps=244.64`. These are within the accepted `64`-warmup band and do not justify the extra startup latency. +- Learning: `64` replay warmups are enough to reach the steady graph-bound clock regime. Keep the accepted `64` setting. + +### Profile: Accepted Warmed RFDETR Graph Gap + +- Request: Capture Nsight Systems evidence for the accepted `64` replay-warmup checkpoint while keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_warm64_20260523_145345.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_warm64_20260523_145345.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_warm64_20260523_145345_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_warm64_20260523_145345_stats_cuda_gpu_mem_time_sum.csv`, `/tmp/rfdetr_depth2_warm64_20260523_145345_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.24s fps=240.45`. +- Graph spacing: The capture includes `602` CUDA graph traces: `64` capture warmup replays plus `538` frame replays. After skipping the `64` warmups and the next `100` frame launches, CUDA graph duration was p50 `4069.806 us`, p90 `4137.469 us`, p95 `4142.621 us`, p99 `4149.969 us`, mean `4051.976 us`; graph end-to-next-start gap was p50 `40.191 us`, p90 `41.644 us`, p95 `41.958 us`, p99 `42.399 us`, mean `40.311 us`. +- Gap decomposition: Busy work inside the gap was p50 `34.912 us`, mean `35.067 us`; idle inside the gap was p50 `5.184 us`, p90 `5.952 us`, p95 `6.080 us`, p99 `6.228 us`, mean `5.244 us`. The largest gap occupants remain input D2D copy (`1168128B`, `13.117 us` avg overlap), mask D2D clone (`2433600B`, `13.095 us` avg overlap), sigmoid (`7.019 us` avg overlap), fill-long (`2.787 us` avg overlap), logits D2D clone (`36400B`, `2.098 us` avg overlap), boxes D2D clone (`1600B`, `1.988 us` avg overlap), fill-int (`1.936 us` avg overlap), and selector (`1.888 us` avg overlap). +- Learning: The warmup checkpoint does not materially change the steady graph body or graph-to-graph gap; it moves the measured run into the same graph-bound steady-clock regime that external max-clock testing exposed. Further code wins still need TensorRT graph-duration reduction or a correctness-compatible engine/tactic change. + +### RFDETR TensorRT Clone Result After Capture Warmup + +- Hypothesis: The accepted capture path cloned the first returned TensorRT outputs before the `64` extra graph warmup replays. Moving that clone after the warmup replays should return equivalent outputs while materializing the first result after the graph has ramped clocks. +- Change: In `_capture_cuda_graph(...)`, kept the initial post-capture replay, ran the configured capture warmup replays, then cloned `output_buffers` for the returned first result. +- Correctness: Compared the warmed CUDA graph path against standard non-graph TensorRT execution on all `538` frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `frames=538 elapsed=2.19s fps=245.17` and repeat `frames=538 elapsed=2.20s fps=244.67`, above the prior accepted `64` warmup default band. +- Learning: Keeping all first-result materialization after graph warmup gives a small but useful improvement without changing steady-state replay semantics. The path is still graph-bound and clock-regime sensitive. + +### Rejected: Remove Pre-Replay Capture Clone + +- Hypothesis: After moving the first returned result clone after the capture warmup replays, the earlier post-capture clone is overwritten and might be removable as startup-only work. +- Change tested: Temporary code only; removed the first `results = [buf.clone() ...]` and synchronize immediately after CUDA graph capture, keeping only the post-warmup replay result clone. Pipeline depth remained fixed at `2`. +- Correctness: The returned results still come from a replay of the same captured graph and input; prediction math is unchanged. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.59`, below the accepted clone-after-warmup best and not enough to justify removing the extra startup work. +- Learning: The overwritten clone likely contributes a small amount of useful warmup before the measured interval. Keep the accepted capture sequence. + +### Rejected: Skip Pre-Capture TensorRT Warmup Enqueue + +- Hypothesis: Since RFDETR now replays the captured CUDA graph `64` times before measured frames, the separate `execute_async_v3(...)` warmup before CUDA graph capture might be redundant startup work. +- Change tested: Temporary code only; removed the pre-capture TensorRT warmup enqueue and its stream synchronization from `_capture_cuda_graph(...)`. Pipeline depth remained fixed at `2`. +- Correctness: The graph captured and replayed successfully; prediction math should be unchanged because steady execution still uses the same captured graph. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.72`, below the accepted clone-after-warmup best and not enough to justify changing the capture sequence. +- Learning: The pre-capture enqueue is either useful TensorRT setup or useful GPU warmup for the measured interval. Keep the accepted capture sequence. + +### Rejected: Steady-State Copy Pattern Capture Warmup + +- Hypothesis: The capture-time warmup replays only the TensorRT graph body, while measured cache-hit frames do input D2D copy, graph replay, and output D2D copies. Replaying that whole copy/replay/copy pattern during capture warmup could warm the same copy engines and allocator paths before measured frames. +- Change tested: Temporary code only; after graph capture, allocated scratch output buffers once, then repeated `input_buffer.copy_(pre_processed_images)`, `cuda_graph.replay()`, and scratch `copy_` from each TensorRT output during the `64` warmup iterations before returning the post-warmup result clone. Pipeline depth remained fixed at `2`. +- Result on requested command: `frames=538 elapsed=2.19s fps=245.50`, then `frames=538 elapsed=2.20s fps=244.60`, then `frames=538 elapsed=2.21s fps=243.84`. +- Learning: Warming the full steady-state copy pattern is not a stable improvement and can regress the warmed graph-bound path. Keep the accepted warmup that replays only the captured TensorRT graph before cloning the first returned result. + +### Profile: Accepted Depth-2 Warmed Graph Refresh + +- Request: Collect a fresh Nsight Systems report on the accepted warmed implementation while keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_20260523_150944.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_20260523_150944.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_20260523_150944_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_20260523_150944_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_accepted_20260523_150944_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.36s fps=227.82`. +- Graph spacing: The capture includes `602` CUDA graph traces. After skipping the `64` warmups plus the next `100` frame launches, CUDA graph duration was p50 `4129.582 us`, p90 `4187.479 us`, p95 `4194.288 us`, p99 `4249.728 us`, mean `4134.538 us`; graph end-to-next-start gap was p50 `42.463 us`, p90 `43.839 us`, p95 `44.256 us`, p99 `44.872 us`, mean `42.602 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `37.791 us`, mean `37.704 us`; idle inside the gap was p50 `4.544 us`, mean `4.514 us`. The largest gap occupants were input D2D copy (`1168128B`, `13.143 us` avg overlap), mask D2D clone (`2433600B`, `13.131 us`), sigmoid (`6.945 us`), boxes D2D clone (`1600B`, `4.528 us` in this capture), fill-long (`2.723 us`), logits D2D clone (`36400B`, `2.105 us`), fill-int (`2.060 us`), and selector (`1.767 us`). +- Learning: This refresh shows the same desired shape: only about `4-5 us` median idle between graph replays, with the rest of the graph-to-graph interval being required GPU copy/postprocess work. Remaining FPS is still dominated by TensorRT graph duration. + +### Rejected: Reusable TensorRT Output Copy Buffers + +- Hypothesis: The accepted TensorRT CUDA graph cache-hit path allocates fresh cloned output tensors every frame. Reusing per-thread output copy buffers and filling them with `copy_` from graph-owned outputs could keep cloned-output lifetime isolation while reducing allocator/clone overhead, unlike the previously rejected borrowed-output path. +- Change tested: Temporary env-gated code only; with `RFDETR_TRT_REUSE_OUTPUT_COPY_BUFFERS=True`, each thread reused `empty_like` buffers for the TensorRT outputs and copied graph-owned outputs into those buffers on the graph stream. Pipeline depth remained fixed at `2`. +- Result on requested command with the gate enabled: `frames=538 elapsed=2.19s fps=246.00`, then `frames=538 elapsed=2.21s fps=243.87`, then `frames=538 elapsed=2.20s fps=244.26`. +- Learning: The first run beat the accepted band, but repeats did not. Reusable output copy buffers are too noise-sensitive in the current graph-bound regime and are not stable enough to checkpoint. Keep the accepted `buf.clone()` output path. + +### External Runtime Probe: Memory Plus Graphics Clock Lock + +- Hypothesis: The previous external max-clock probe locked graphics clocks only. Since the accepted path is TensorRT graph-bound and includes input/output copy traffic, also locking the T4 memory clock to the supported `5001 MHz` state might improve the graph-bound ceiling. +- Change tested: External runtime setting only; attempted `nvidia-smi -lmc 5001,5001`, then locked graphics clocks with `nvidia-smi -lgc 1590,1590`, ran the benchmark with pipeline depth fixed at `2`, and reset clock locks afterward. +- Result: The T4 runtime reported locked memory clocks are not supported. The run executed at `P0`, `1590 MHz` graphics clock, and `5000 MHz` memory clock, measuring `frames=538 elapsed=2.21s fps=243.26`, matching the earlier graphics-clock-only result. After reset and idle, the GPU returned to `P8`, `300 MHz` graphics, `405 MHz` memory. +- Learning: There is no separate memory-clock lock knob available in this environment, and graphics-clock lock remains only an external deployment/runtime tuning option. It does not change the accepted library code path. + +### TensorRT Accepted Engine Inspection + +- Request: Inspect the current accepted TensorRT plan directly before attempting more graph-body changes. +- Evidence: Engine inspector dump saved to `/tmp/rfdetr_accepted_engine_inspector.json` for the accepted shared blob `/tmp/cache/shared-blobs/bc173a2cfda9a10af2bc411885e9fec3`. +- Result: The accepted engine is `187,947,996` bytes, has `4` I/O tensors, `261` layers, `4` auxiliary streams, `18,289,152` bytes device memory, `ProfilingVerbosity.LAYER_NAMES_ONLY`, and tactic source mask `8`. Tensors are `input` float32 `(1,3,312,312)`, `dets` float32 `(1,100,4)`, `labels` float32 `(1,100,91)`, and mask output `4186` float32 `(1,100,78,78)`. Coarse layer-name counts from the inspector are `95` matmul/GEMM-like layers, `78` fused/cast layers, `14` conv layers, and `1` resize layer. +- Learning: The engine metadata matches the Nsight kernel evidence: remaining runtime is dominated by TensorRT GEMM/MHA-style graph body work. The plan is only layer-name verbose, so tactic-level detail is not available from the serialized engine inspector in this environment. + +### External Runtime Probe: Persistence Mode + +- Hypothesis: Enabling GPU persistence mode might reduce clock/context ramp effects for the warmed graph-bound benchmark. +- Change tested: External runtime setting only. Persistence mode was already enabled before the probe, so the benchmark effectively re-ran the accepted path under the existing setting. Pipeline depth remained fixed at `2`, and persistence mode was restored to the original enabled state afterward. +- Result on requested command: `frames=538 elapsed=2.19s fps=245.29`, within the accepted warmed band. +- Learning: Persistence mode was already active and is not a new optimization knob. The accepted code path remains responsible for the current measured throughput. + +### External Runtime Probe: Application Clocks + +- Hypothesis: `nvidia-smi -ac 5001,1590` might enforce the high memory and graphics application-clock targets more cleanly than graphics-only lock or unsupported memory lock commands. +- Change tested: External runtime setting only; set application clocks to `(MEM 5001, SM 1590)`, ran the accepted benchmark with pipeline depth fixed at `2`, then reset application clocks with `nvidia-smi -rac`. +- Result on requested command: `frames=538 elapsed=2.18s fps=246.36`, matching the current accepted warmed band rather than improving it. After reset and idle, the GPU returned to persistence enabled, `P8`, `300 MHz` graphics, `405 MHz` memory, with default application clocks `(MEM 5001, SM 585)`. +- Learning: Application clocks do not improve beyond the accepted warmed path in this environment. Keep clock tuning as an external diagnostic only, not a code or default runtime requirement. + +### Rejected: Captured Deferred Postprocess Graph + +- Hypothesis: The graph-to-graph gap still contains several small GPU launches after TensorRT replay: output copies, sigmoid, selector, query-index cast/fill, and limited mask resize. Reusing stable TensorRT output copy buffers and capturing the deferred sigmoid/selector/mask-resize postprocess sequence into a CUDA graph could reduce launch overhead while preserving cloned-output lifetime isolation. +- Change tested: Temporary env-gated code only; with `RFDETR_CAPTURE_DEFERRED_POSTPROCESS_GRAPH=True`, TensorRT cache-hit replay copied graph outputs into per-thread reusable buffers, and RFDETR deferred postprocess attempted to capture the sigmoid, fused selector, and fused mask resize operations keyed by those stable input pointers. The first attempt failed before any frames due concurrent depth-2 worker graph captures tripping PyTorch's CUDA caching allocator assertion; a second attempt serialized first-time captures with a global lock. +- Result on requested command after serializing capture: `frames=538 elapsed=2.29s fps=235.36`, far below the accepted warmed band. +- Learning: Capturing postprocess launch work adds too much startup/scheduling complexity and requires static output-buffer choreography that perturbs the current well-balanced depth-2 pipeline. Keep the accepted simple postprocess launch sequence and cloned TensorRT output tensors. + +### Nsight Compute TensorRT Top-Kernel Snapshot + +- Request: Use Nsight Compute on the dominant TensorRT graph kernels now that Nsight Systems shows the run is graph-body bound. +- Profile: `/tmp/rfdetr_ncu_trt_top_20260523_153512.ncu-rep`, exported text `/tmp/rfdetr_ncu_trt_top_20260523_153512_details.txt`, and raw CSV `/tmp/rfdetr_ncu_trt_top_20260523_153512_raw.csv`. Capture used `--set basic`, a regex for the top XMMA/GEMM/MHA TensorRT kernels, `--launch-skip 200`, and `--launch-count 6`. +- Result under NCU instrumentation: `frames=538 elapsed=5.49s fps=97.98`. The sampled top kernels are small-grid TensorRT kernels: representative `sm75_xmma_gemm_f16f16_f16f32...128x128x32` launches had grid size `72`, `240` registers/thread, `25%` theoretical occupancy, about `21.7%` achieved occupancy, `48%` SM throughput, and `15%` DRAM throughput; `_gemm_mha_v2...` had grid size `66`, `245` registers/thread, `25%` theoretical occupancy, `20.6%` achieved occupancy, `36%` SM throughput, and `6.5%` DRAM throughput. +- Learning: The remaining TensorRT graph body is dominated by many small, register/shared-memory-limited GEMM/MHA kernels. This points toward engine/tactic/export changes as the only likely large lever; custom postprocess and stream scheduling changes are now below the dominant cost. + +### Rejected: CUDA Module Loading Mode + +- Hypothesis: Changing `CUDA_MODULE_LOADING` before process startup might alter TensorRT module initialization/capture behavior and improve the warmed graph-bound run without changing outputs. +- Change tested: External process environment only; ran the accepted benchmark with `CUDA_MODULE_LOADING=EAGER`, then with `CUDA_MODULE_LOADING=LAZY`, keeping pipeline depth fixed at `2`. +- Result on requested command: `EAGER` measured `frames=538 elapsed=2.20s fps=245.08`; `LAZY` measured `frames=538 elapsed=2.20s fps=244.42`, both below the best accepted warmed band. +- Learning: CUDA module-loading mode is not a useful runtime knob for this already-warmed graph path. Keep the default environment. + +### Profile: Depth-2 Node-Level Graph Refresh + +- Request: Collect a fresh node-level Nsight Systems profile for the current accepted implementation while keeping pipeline depth fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_nodegraph_20260523_155019.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_nodegraph_20260523_155019.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_nodegraph_20260523_155019_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_nodegraph_20260523_155019_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_nodegraph_20260523_155019_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.33s fps=231.16`. +- Graph spacing: The capture includes `602` CUDA graph launches and `239` TensorRT graph kernels per launch. After skipping the `64` capture warmups plus the next `100` frame launches, CUDA graph duration was p50 `4125.917 us`, p90 `4201.725 us`, p95 `4238.780 us`, p99 `4252.124 us`, mean `4133.067 us`; graph end-to-next-start gap was p50 `48.847 us`, p90 `50.464 us`, p95 `50.559 us`, p99 `51.647 us`, mean `48.804 us`. +- Gap decomposition: Largest gap occupants were graph input D2D copy (`1168128B`, `13.135 us` avg overlap), mask D2D clone (`2433600B`, `13.134 us`), selector kernels (`9.240 us` and `8.546 us` on alternating postprocess streams), boxes D2D clone (`1600B`, `4.513 us`), and small PyTorch vectorized kernels. The node-level trace has higher overhead than the lighter accepted profiles, but the shape is unchanged: TensorRT graph duration is the bottleneck and the inter-graph tail is short. +- Learning: The depth-2 pipeline has the requested low, consistent CUDA graph spacing. Further code wins still need a correct TensorRT engine/tactic change rather than more small postprocess scheduling tweaks. + +### Rejected: Version-Compatible TensorRT Rebuilds From Public ONNX + +- Hypothesis: The earlier local TensorRT opt0/1/2 rebuilds may have failed correctness because they were not built like the accepted plan. The accepted T4 FP16 plan requires `engine_host_code_allowed=True`, which is consistent with a `VERSION_COMPATIBLE` build that embeds TensorRT lean-runtime host code. Rebuilding the public ONNX with `VERSION_COMPATIBLE` might match the accepted package while allowing tactic-level tuning. +- Change tested: Built `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc/engine.plan` from `/tmp/rfdetr_onnx_pkg_5362b72bfb9f01d2e0b8cba2048d932c/weights.onnx` with TensorRT `10.12.0.36`, static input shape `1x3x312x312`, FP16 enabled, workspace `4 GiB`, `builder_optimization_level=3`, and `BuilderFlag.VERSION_COMPATIBLE`. Build completed in `117.28s` and produced a `187,854,900` byte plan, close to the accepted plan size. +- Correctness result: Compared the rebuilt FP16 version-compatible plan against the accepted T4 FP16 plan over all `538` benchmark frames with standard non-graph TensorRT forward and dense postprocess. It failed the required gate: `bad_counts=9`, `bad_classes=8`, `bad_masks=336`, `bad_boxes_gt5=18`, `max_box_delta=251.0`, `max_conf_delta=0.10457384586334229`. +- Follow-up: Built `/tmp/rfdetr_trt_rebuild_t4_fp32_opt3_vc/engine.plan` with the same settings but without FP16. Build completed in `24.18s` and produced a `248,398,812` byte plan. It also failed correctness against the accepted plan: `bad_counts=7`, `bad_classes=8`, `bad_masks=329`, `bad_boxes_gt5=19`, `max_box_delta=251.0`, `max_conf_delta=0.10576367378234863`. +- Metadata check: Roboflow package metadata lists six public packages for `rfdetr-seg-nano` / `coco-dataset-vdnr1/41`: L4 FP32 TRT, L4 FP16 TRT, T4 FP32 TRT, T4 FP16 TRT, ONNX FP32, and Torch FP32. The only T4 FP16 package is the accepted `c70f32369a54d61e06ef4e6b56c82524`; the public ONNX package does not rebuild into behavior-equivalent T4 TRT plans in this runtime. +- Learning: Local TensorRT rebuilds from the available ONNX are not safe optimization candidates because both FP16 and FP32 version-compatible builds change predictions substantially. Keep the accepted official T4 FP16 engine; engine-level gains would require an ONNX/export source that is known to match the accepted TRT plan or a new official model package, not ad hoc local rebuilds from this ONNX artifact. + +### Rejected: PyTorch cudaMallocAsync Allocator Backend + +- Hypothesis: The accepted depth-2 path still performs per-frame PyTorch CUDA allocations around TensorRT output clones, sigmoid, selector outputs, and resized mask tensors. Switching PyTorch to the CUDA async allocator with `PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync` might reduce allocator synchronization or fragmentation without changing model math. +- Change tested: External process environment only; launched the requested benchmark with `PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync`, `PYTHONPATH=/app/inference_models`, and pipeline depth fixed at `2`. +- Correctness: This does not change computation or tensor values, only allocator backend behavior. +- Result on requested command: `frames=538 elapsed=2.20s fps=245.00`, then `frames=538 elapsed=2.21s fps=243.47`, then `frames=538 elapsed=2.20s fps=244.54`. +- Learning: The async allocator can land inside the accepted warmed band, but it is not a stable improvement over the default allocator. Keep the default PyTorch CUDA allocator and do not require an external allocator environment variable for the benchmark. + +### Rejected: ONNX Backend Alternative + +- Hypothesis: The workflow script can force `--backend onnx`, and the environment config lists `TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider`. ONNX Runtime might build or dispatch a faster GPU path than the native `inference-models` TensorRT package. +- Change tested: Ran `PYTHONPATH=/app/inference_models python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4 --pipeline_depth 2 --backend onnx`. +- Result: The run failed before processing frames (`frames=0`) with ONNX Runtime I/O binding error: `There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]`. +- Learning: The current ONNX backend path is not a viable drop-in for this CUDA-tensor workflow fast path. Keep optimizing the accepted native TensorRT backend; do not switch benchmark backend. + +### Rejected: Torch Package As Equivalent Export Source + +- Hypothesis: The public Torch FP32 package `8b8da2fe824240522a39f3cde41aafae` might be a behavior-equivalent source for exporting ONNX and rebuilding TensorRT plans, unlike the public ONNX package that failed correctness. +- Change tested: Downloaded and loaded the Torch package explicitly with `AutoModel.from_pretrained(..., backend="torch", model_package_id="8b8da2fe824240522a39f3cde41aafae")`, then compared its postprocessed predictions against the accepted T4 FP16 TensorRT package `c70f32369a54d61e06ef4e6b56c82524` on benchmark video frames. Tested plain Torch, `model.export()` FP32, and `model.export()` FP16 paths. Pipeline benchmarks remain fixed at depth `2`; this was a correctness/provenance probe only. +- Result: Plain Torch and exported FP32 matched each other but failed the accepted correctness gate over the first `80` frames: `bad_counts=1`, `bad_classes=0`, `bad_masks=72`, `bad_boxes_gt5=3`, `max_box_delta=184.7200927734375`, `max_conf_delta=0.10576367378234863`. Exported FP16 was worse: `bad_counts=4`, `bad_classes=0`, `bad_masks=69`, `bad_boxes_gt5=3`, `max_box_delta=151.251220703125`, `max_conf_delta=0.2307741641998291`. +- Learning: The Torch package is closer to the accepted TRT engine than the public ONNX rebuilds, but it is still not behavior-equivalent. Do not use this Torch package as a local export/rebuild source for optimization; a correct engine-level tuning path still requires an official package or export source known to match the accepted T4 FP16 plan. + +### Rejected: TensorRT Persistent Cache Limit + +- Hypothesis: TensorRT `IExecutionContext.persistent_cache_limit` might enable activation/persistent L2 caching for the captured graph body and reduce the remaining graph-bound latency without changing model outputs or package artifacts. +- Change tested: Temporary env-gated code only; set `graph_context.persistent_cache_limit` before CUDA graph warmup/capture, then ran the requested benchmark with pipeline depth fixed at `2` for limits `1 MiB`, `4 MiB`, and `8 MiB`. +- Result on requested command: The T4 runtime rejected every nonzero value with `IExecutionContext::setPersistentCacheLimit: Error Code 3 ... size ... is larger than cudaDeviceProp.persistingL2CacheMaxSize(0 bytes)`. Runs measured `242.80`, `242.53`, and `243.44` FPS, below the accepted warmed band. +- Learning: This GPU exposes no persisting L2 cache budget to TensorRT, so `persistent_cache_limit` is not an available optimization knob here. Removed the temporary hook. + +### Rejected: Detection-Limited Deferred Mask Allocation + +- Hypothesis: The deferred mask resize path only launches work for the first `7` detections, but still allocates a `100 x H x W` bool tensor. Allocating only `7 x H x W` rows when `deferred_mask_resize_detection_limit=7` could reduce allocator pressure and memory footprint between TensorRT graph launches while preserving the recovery path for frames with more detections. +- Change tested: Temporary code only; moved detection-limit normalization before output allocation in `fused_resize_selected_masks(...)` and allocated the output tensor with `detection_limit` rows. Pipeline depth remained fixed at `2`. +- Result on requested command: `frames=538 elapsed=2.21s fps=243.80`, then `frames=538 elapsed=2.20s fps=244.45`, below the accepted warmed band. +- Learning: The allocation size is not the bottleneck; the gap is dominated by TensorRT output copies and graph-body latency. Keep the fixed `100` row allocation that preserves the established conversion-buffer behavior. + +### Profile: Depth-2 Accepted Refresh After Provenance Probes + +- Request: Capture a fresh Nsight Systems report for the current accepted implementation after the engine-provenance and postprocess allocation probes. The command used `PYTHONPATH=/app/inference_models` and pipeline depth stayed fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_accepted_refresh_20260523_161926.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_accepted_refresh_20260523_161926.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_accepted_refresh_20260523_161926_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_accepted_refresh_20260523_161926_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_accepted_refresh_20260523_161926_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.29s fps=235.10`. +- Graph spacing: The capture includes `602` CUDA graph traces. After skipping the `64` warmups plus the next `100` frame launches, CUDA graph duration was p50 `4097.520 us`, p90 `4111.583 us`, p95 `4115.935 us`, p99 `4124.383 us`, mean `4070.625 us`; graph end-to-next-start gap was p50 `40.384 us`, p90 `41.535 us`, p95 `41.727 us`, p99 `42.336 us`, mean `41.097 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.136 us`, mean `35.873 us`; idle inside the gap was p50 `5.119 us`, mean `5.224 us`. The largest gap occupants were the mask output D2D clone (`2433600B`, `13.149 us` avg overlap), next-frame input D2D copy (`1168128B`, `13.082 us`), sigmoid (`6.782 us`), selector (`2.167 us`), fill-int (`2.100 us`), logits D2D clone (`36400B`, `2.098 us`), and boxes D2D clone (`1600B`, `1.987 us`). +- Learning: The depth-2 pipeline is already very close to graph-body-bound: median idle between TensorRT graph replays is about `5 us`, while the graph body is about `4.1 ms`. Remaining large improvements still require a correct TensorRT graph-body/engine change or safe double-buffered output ownership, not depth changes or more CPU scheduling work. + +### Rejected: Direct Fused Postprocess On Graph-Owned TensorRT Outputs + +- Hypothesis: The largest graph-to-graph gap occupant is the full TensorRT mask-output D2D clone. Running the existing fused RFDETR selector and mask-resize directly on graph-owned TensorRT outputs before cloning the raw outputs could avoid copying the full `100 x 78 x 78` mask tensor and move closer to pure TensorRT graph-body bottlenecking. +- Change tested: Temporary env-gated code only; added an `output_processor` callback to the TensorRT CUDA-graph replay path and an RFDETR `forward_post_process(...)` path used by the workflow when `RFDETR_TRT_FUSED_GRAPH_POSTPROCESS=True`. Pipeline depth remained fixed at `2`. +- Correctness: A single-process full-video comparison against the accepted cloned-output path passed over all `538` frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`, with `max_count=7`. +- Result: The unsafe first workflow run appeared fast (`frames=538 elapsed=2.05s fps=263.03`) but repeated depth-2 workflow runs hit CUDA illegal-memory-access failures during CPU conversion, exposing a missing stream/lifetime handoff. Materializing selected masks without deferred count was stable but slow (`220.33 FPS`). Adding the required default-stream wait made the deferred path stable but not faster: `244.20`, `243.62`, and `243.88` FPS. +- Learning: The apparent `263 FPS` result was an invalid asynchronous/lifetime artifact. Once graph-owned outputs are handed off safely to CPU conversion, the path returns to the accepted throughput band. Removed the temporary callback/direct-postprocess code; a future safe win would need real double-buffered graph output ownership or conversion under a redesigned pipeline contract, not borrowing the single graph state's raw outputs. + +### Rejected: CUDA Device Max Connections + +- Hypothesis: The accepted TensorRT plan uses `4` auxiliary streams. Setting `CUDA_DEVICE_MAX_CONNECTIONS` before process startup might change CUDA work-queue mapping enough to improve TensorRT aux-stream scheduling or graph replay spacing. +- Change tested: External process environment only; ran the requested benchmark with `CUDA_DEVICE_MAX_CONNECTIONS` set to `1`, `2`, `8`, and `32`. Pipeline depth remained fixed at `2`. +- Correctness: This setting does not change model math or tensor values. +- Result on requested command: `1` measured `243.06` FPS, `2` measured `244.31` FPS, `8` measured `243.64` FPS, and `32` measured `243.07` FPS. +- Learning: CUDA connection count does not improve the accepted warmed graph path. Keep the default CUDA scheduling environment. + +### Rejected: Two-State Direct Postprocess Graph Output Pool + +- Hypothesis: The earlier direct-postprocess-on-graph-output experiment failed because a single shared graph output buffer could be overwritten while another depth-2 worker still needed it. Capturing a shared two-state CUDA graph pool before the first output could let one state feed fused postprocess while the other state replays the next frame, avoiding the full raw mask-output clone without unsafe borrowing. +- Change tested: Temporary env-gated code only; extended the TensorRT CUDA graph cache to hold a two-state pool when `RFDETR_TRT_THREAD_LOCAL_DIRECT_POSTPROCESS=True`, ran fused RFDETR postprocess on a separate postprocess stream, and made each producer graph stream wait for its postprocess stream before that state could be reused. Pipeline depth remained fixed at `2`. +- Correctness: After fixing the producer-stream handoff, the full-video comparison against the accepted cloned-output path passed over all `538` frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`, with `max_count=7`. +- Result on requested command: An intermediate broken stream-wait version produced an invalid `285.71 FPS` but failed correctness badly (`bad_counts=76`, `bad_classes=77`, `bad_masks=537`, `bad_boxes_gt5=175`). The corrected version was stable and correct but not faster: `243.79`, `242.70`, and `242.31` FPS. +- Learning: Safely borrowing TensorRT graph outputs requires enough stream ordering that the raw-output clone removal no longer improves the depth-2 workflow. Removed the temporary graph-pool/direct-postprocess code. + +### Rejected: Exact-Count D2H Detection Copy + +- Hypothesis: The RFDETR fast conversion path always copies `7` mask rows from CUDA to pinned CPU memory, even when a frame has fewer valid detections. Copying the count first and then copying only `valid_count` rows could reduce D2H bytes for common 4-5 detection frames. +- Change tested: Temporary env-gated code only; with `RFDETR_EXACT_D2H_DETECTION_COPY=True`, copied and synchronized the count before copying xyxy/confidence/class/mask rows, then copied only the valid rows. Pipeline depth remained fixed at `2`. +- Correctness: This changes only how many already-resized selected rows are copied to CPU, not model math. +- Result on requested command: `frames=538 elapsed=2.80s fps=192.39`, then `frames=538 elapsed=2.82s fps=190.70`. +- Learning: The extra count synchronization is far more expensive than the saved mask-copy bytes. Keep the accepted single synchronization that copies the fixed 7-row pinned buffers. + +### Rejected: Concurrent TensorRT CUDA Graph Replay Pool + +- Hypothesis: The accepted TRT forward path serializes graph replay with a Python model lock. Since NCU showed many small register/shared-memory-limited TensorRT kernels, allowing two depth-2 workers to submit independent CUDA graph states on separate streams might overlap low-occupancy kernels and improve hardware utilization. +- Change tested: Temporary env-gated code only; with `RFDETR_TRT_CONCURRENT_GRAPHS=True`, first tried per-thread graph caches without the model lock, which failed because concurrent CUDA graph capture is not permitted. Then serialized capture and tested a shared two-cache graph pool with thread-local caller streams and lock-free round-robin replay. Pipeline depth remained fixed at `2`. +- Result on requested command: Concurrent capture failed before frames with `operation not permitted when stream is capturing`. The serialized-capture per-thread version completed but measured `220.23` FPS. The shared two-cache replay pool completed but measured `219.47` FPS. +- Learning: Concurrent graph submission does not help this TensorRT plan on T4. The required extra streams/contexts/capture and postprocess handoff cost more than any possible overlap among small TensorRT kernels. Keep the accepted single serialized graph replay path. + +### Rejected: Fused Workflow SV Detection Conversion + +- Hypothesis: After GPU postprocess fusion, the RFDETR workflow fast path still performs several Python passes to convert inference-model detections to `sv.Detections`, attach prediction type metadata, apply the no-op class filter, and attach parent metadata. Fusing these metadata writes for the no-class-filter benchmark path could reduce CPU-side packaging enough to improve the depth-2 producer/consumer balance. +- Change tested: Temporary env-gated code only; with `RFDETR_FUSED_SV_CONVERSION=True`, the RFDETR workflow fast path copied the existing pinned detection tensors and built final `sv.Detections` metadata in one helper, preserving prediction type, image dimensions, inference ID, detection IDs, root-parent metadata, and parent metadata. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: Baseline with the accepted path measured `frames=538 elapsed=2.21s fps=243.78`; the fused conversion branch measured `frames=538 elapsed=2.21s fps=242.91`. +- Learning: CPU-side detection packaging is no longer a useful lever for this benchmark. The accepted depth-2 path is constrained by the CUDA graph body plus required GPU copy/postprocess tail, so keep the generic metadata helpers and focus further work on TensorRT graph-body or safe GPU-output ownership changes. + +### Rejected: Explicit TensorRT Optimization Profile Async + +- Hypothesis: Even though the accepted RFDETR TensorRT plan has one optimization profile, explicitly calling `IExecutionContext.set_optimization_profile_async(0, graph_stream)` before setting the static input shape might alter context setup or graph-capture scheduling enough to reduce TensorRT graph replay latency. +- Change tested: Temporary env-gated code only; with `RFDETR_TRT_SET_PROFILE_ASYNC=true`, `_capture_cuda_graph(...)` created the graph stream before shape binding, selected profile `0` asynchronously on that stream, synchronized, then proceeded with the accepted warmup and CUDA graph capture. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Evidence: The runtime inspection showed `num_optimization_profiles=1` and `active_optimization_profile=0` before the change, so this was expected to be low probability. +- Result on requested command: Accepted baseline in the same session measured `frames=538 elapsed=2.22s fps=242.41`; the explicit-profile gate measured `frames=538 elapsed=2.22s fps=242.52` and repeat `frames=538 elapsed=2.21s fps=243.16`. +- Learning: Explicit profile selection is a no-op for this already-active single-profile engine and does not move the depth-2 ceiling. Keep the simpler accepted graph-capture path. + +### Rejected: Int32 RFDETR Class Mapping Tensor + +- Hypothesis: `prepare_class_remapping(...)` stores the RFDETR class mapping table as `int64`, but the fused Triton selector stores output class IDs as `int32`. Building the mapping tensor as `int32` could reduce selector load width/codegen without changing class IDs. +- Change tested: Temporary code only; changed `ClassesReMapping.class_mapping` from `torch.int64` to `torch.int32` while keeping `remaining_class_ids` as `int64`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: The first pass measured `frames=538 elapsed=2.31s fps=232.91`, and a warmed repeat measured `frames=538 elapsed=2.21s fps=243.11`, below the clean accepted sanity run in the same session at `frames=538 elapsed=2.21s fps=243.49`. +- Learning: Class-map load width is not a limiter for the fused selector. The original `int64` mapping is safer for the generic PyTorch fallback paths and remains at least as fast end to end. + +### Rejected: GPU Exclusive Process Compute Mode + +- Hypothesis: The accepted depth-2 run is TensorRT graph-body bound and sensitive to GPU scheduling. Switching the T4 from default compute mode to `EXCLUSIVE_PROCESS` before launching the benchmark might reduce context scheduling overhead or background interference. +- Change tested: External runtime setting only; confirmed and killed a stale benchmark process that was holding a CUDA context, set `nvidia-smi -c EXCLUSIVE_PROCESS`, ran the requested benchmark with pipeline depth fixed at `2`, then restored compute mode with `nvidia-smi -c DEFAULT`. +- Result on requested command: `frames=538 elapsed=2.22s fps=242.77`, below the accepted warmed band. +- Learning: Compute-mode isolation does not improve this single-process workload. Keep the default compute mode; the remaining limiter is still the TensorRT CUDA graph body plus the small required post-graph tail. + +### Rejected: CPU Affinity And Nice Priority + +- Hypothesis: The depth-2 pipeline still depends on a CPU producer thread keeping the TensorRT graph fed. Pinning the process to one hardware thread per physical core or raising process priority could reduce CPU scheduling jitter enough to tighten the graph-to-graph launch cadence. +- Change tested: External runtime settings only; after confirming no other CUDA process was active, ran the requested benchmark with `taskset -c 0-3` on the 4-core/8-thread VM, then with `nice -n -20`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: Clean default baseline after stale-process cleanup measured `frames=538 elapsed=2.21s fps=243.99`. `taskset -c 0-3` measured `244.44` then `243.88` FPS. `nice -n -20` measured `243.46` FPS. +- Learning: CPU scheduler tuning is not a stable improvement. The small first affinity result is within normal warmed-path noise, and priority does not help. Keep the default CPU scheduling for the benchmark. + +### External Runtime Refresh: Application Clocks After Cleanup + +- Hypothesis: After removing a stale benchmark process that had been holding a CUDA context, the previous max application-clock diagnostic might expose a higher ceiling for the accepted warmed path. +- Change tested: External runtime setting only; set application clocks to `(MEM 5001, SM 1590)`, ran the requested benchmark with pipeline depth fixed at `2`, then reset application clocks with `nvidia-smi -rac`. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.24`, compared with the same-session default-clock clean baseline of `frames=538 elapsed=2.21s fps=243.99`. +- Learning: Application clocks still do not move throughput beyond the accepted warmed band. The current code-level CUDA graph warmup already reaches the practical graph-bound clock regime for the measured interval. + +### Rejected: Python Optimize Mode + +- Hypothesis: Launching the benchmark with `PYTHONOPTIMIZE=1` could remove Python assert/debug overhead in the workflow and model stack without changing model math or TensorRT execution. +- Change tested: External interpreter setting only; ran the requested benchmark with `PYTHONOPTIMIZE=1` and pipeline depth fixed at `2`. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.26`, then `frames=538 elapsed=2.21s fps=243.36`, compared with the same-session default baseline of `frames=538 elapsed=2.21s fps=243.99`. +- Learning: Python optimize mode is normal warmed-path noise and not a stable throughput lever. Keep the default interpreter mode. + +### Profile: Clean Depth-2 Graph Gap After Runtime Cleanup + +- Request: Refresh Nsight Systems evidence for the current accepted implementation after removing stale CUDA processes and rejecting the recent runtime scheduling probes. Pipeline depth remained fixed at `2`. +- Profile: `/tmp/rfdetr_depth2_clean_after_cleanup_20260523_173101.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_clean_after_cleanup_20260523_173101.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_clean_after_cleanup_20260523_173101_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_clean_after_cleanup_20260523_173101_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_clean_after_cleanup_20260523_173101_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.28s fps=235.45`. +- Graph spacing: The capture includes `602` CUDA graph traces. After skipping the `64` capture warmups plus the next `100` frame launches, CUDA graph duration was p50 `4068.479 us`, p90 `4134.321 us`, p95 `4139.256 us`, p99 `4145.568 us`, mean `4072.660 us`; graph end-to-next-start gap was p50 `40.448 us`, p90 `41.805 us`, p95 `42.207 us`, p99 `42.856 us`, mean `40.615 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.136 us`, mean `35.363 us`; idle inside the gap was p50 `5.152 us`, mean `5.237 us`. +- Learning: The current accepted depth-2 path is still effectively TensorRT CUDA-graph-body bound. The post-graph interval is low and consistent, and only about `5 us` of it is idle; further wins need a correctness-compatible TensorRT graph-duration reduction or a way to remove required input/output copy work without adding dependencies. + +### Rejected: Explicit PyTorch CUDA Graph Pool Handle + +- Hypothesis: Passing an explicit `torch.cuda.graph_pool_handle()` to the TensorRT CUDA graph capture might reduce graph memory-pool setup overhead or produce a slightly better replay object. +- Change tested: Temporary env-gated code only; with `RFDETR_TRT_GRAPH_POOL_HANDLE=true`, `_capture_cuda_graph(...)` passed a fresh graph-pool handle to `torch.cuda.graph(...)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: `frames=538 elapsed=2.21s fps=243.58`, below the recent clean default baseline of `frames=538 elapsed=2.21s fps=243.99`. +- Learning: The captured TensorRT graph body does not benefit from a PyTorch graph-pool handle. The graph contains TensorRT work, not PyTorch allocations that would use the pool. Keep the accepted `torch.cuda.graph(cuda_graph, stream=stream)` capture. + +### Profile: User-Requested Depth-2 Graph-Bound Refresh + +- Request: Capture a fresh Nsight Systems report for the current accepted implementation while keeping the workflow pipeline depth fixed at `2`. +- Sanity run before profiling: the requested command measured `frames=538 elapsed=2.20s fps=244.55`. +- Profile: `/tmp/rfdetr_depth2_user_refresh_20260523_173919.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_user_refresh_20260523_173919.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_user_refresh_20260523_173919_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_user_refresh_20260523_173919_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_user_refresh_20260523_173919_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.30s fps=234.31`. +- Graph spacing: The capture includes `602` CUDA graph traces. After skipping the `64` capture warmups plus the next `100` frame launches, CUDA graph duration was p50 `4073.664 us`, p90 `4137.858 us`, p95 `4141.692 us`, p99 `4148.619 us`, mean `4080.681 us`; graph end-to-next-start gap was p50 `40.543 us`, p90 `41.920 us`, p95 `42.303 us`, p99 `42.866 us`, mean `40.656 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.151 us`, mean `35.275 us`; idle inside the gap was p50 `5.296 us`, mean `5.425 us`. The largest gap occupants were the next-frame input D2D copy (`1168128B`, `13.149 us` avg overlap), TensorRT mask D2D clone (`2433600B`, `13.093 us`), sigmoid (`6.942 us`), fill-long (`2.832 us`), logits D2D clone (`36400B`, `2.114 us`), selector (`2.042 us`), boxes D2D clone (`1600B`, `1.997 us`), and fill-int (`1.906 us`). +- Learning: The requested depth-2 run is already tightly graph-bound. The point where one TensorRT CUDA graph ends is about `40-43 us` from the next graph start, with only about `5 us` of idle bubble. Remaining FPS is dominated by the TensorRT graph body plus required input/output copies; depth `3` was not tested. + +### Rejected: Masked BBox Loads In Fused Selector + +- Hypothesis: `_select_topk_boxes_kernel` still loads bbox coordinates and computes box geometry for top entries that may be discarded as background/no-object or below threshold. Masking the bbox `tl.load(...)` operations with the existing `keep` predicate could reduce selector memory work without changing outputs. +- Change tested: Temporary code only; changed the four bbox coordinate loads inside `_select_topk_boxes_kernel` to `tl.load(..., mask=keep, other=0.0)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: first run after the Triton codegen change measured `frames=538 elapsed=2.30s fps=234.07`; warmed repeat measured `frames=538 elapsed=2.21s fps=243.99`, below the same-session clean run of `frames=538 elapsed=2.20s fps=244.55`. +- Learning: Predicating these scalar bbox loads does not improve the depth-2 graph-bound path. Any saved load work is below noise or offset by changed Triton codegen. Reverted to the simpler unconditional bbox loads. + +### Rejected: NVIDIA TF32 Override Runtime Knob + +- Hypothesis: The remaining TensorRT CUDA graph body includes FP32/Tensor-Core kernels. Setting `NVIDIA_TF32_OVERRIDE` before process startup might alter TF32 dispatch and improve the graph-bound ceiling without code changes. +- Change tested: External process environment only; ran the requested benchmark with `NVIDIA_TF32_OVERRIDE=0`, then with `NVIDIA_TF32_OVERRIDE=1`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result: Same-session default baseline measured `frames=538 elapsed=2.19s fps=245.20`. Both override runs failed before processing frames: TensorRT/Myelin reported `Inconsistent setting of NVIDIA_TF32_OVERRIDE env var at build -1 and at execution 0` for `0`, and the same build/execution mismatch plus `NVIDIA_TF32_OVERRIDE set to unrecognized value: "1"` for `1`. +- Learning: The accepted serialized T4 FP16 engine must run with the build-time/default TF32 override state. This is not a viable runtime tuning knob for the packaged engine. + +### Rejected: CUDA Cache Disable Runtime Knob + +- Hypothesis: The accepted TensorRT graph body may depend on CUDA/Myelin module-cache behavior. Launching with `CUDA_CACHE_DISABLE=1` could expose whether the CUDA code cache is adding runtime overhead or changing warmed graph scheduling. +- Change tested: External process environment only; ran the requested benchmark with `CUDA_CACHE_DISABLE=1`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.53`, compared with the same-session default baseline of `frames=538 elapsed=2.19s fps=245.20`. +- Learning: Disabling the CUDA cache is not useful for the accepted warmed graph path. The serialized engine already reaches the same graph-bound steady-state behavior with the default cache policy. + +### Rejected: RFDETR Normalization Lookup Table + +- Hypothesis: RFDETR CPU preprocessing still performs per-pixel uint8-to-float normalization with one multiply and one add per channel. Precomputing a `3 x 256` float32 normalization lookup table and filling each output channel with `np.take(..., out=...)` could reduce CPU producer work while preserving exact model inputs. +- Change tested: Temporary code only; changed `_pil_image_to_normalized_tensor(...)` to fetch normalized float32 values from a thread-local LUT instead of using `np.multiply(...)` plus in-place bias. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: The LUT values matched the existing float32 multiply/add formula exactly for all `256` possible uint8 inputs across all three channels (`array_equal=True`, `max_diff=0.0`). +- Result on requested command: first LUT run measured `frames=538 elapsed=2.20s fps=244.65`; repeat measured `frames=538 elapsed=2.21s fps=242.98`, below the recent same-session default baseline of `frames=538 elapsed=2.19s fps=245.20`. +- Learning: The LUT gather path is not faster end to end. The existing vectorized multiply/add normalization is already efficient and better for the current depth-2 producer/GPU balance. Reverted to the accepted ufunc normalization path. + +### TensorRT Accepted Engine Layer-Time Snapshot + +- Request: Map the remaining TensorRT graph-body bottleneck from kernel names back to TensorRT layer names to see whether a targeted plugin or replacement layer is plausible. +- Profile: Ran the accepted engine directly with TensorRT `IProfiler` on `50` non-graph executions after warmup. Raw JSON summary is `/tmp/rfdetr_trt_layer_profile_20260523_accepted.json`. +- Result: The profile reported `261` layers. Summed reported layer time was `5.925 ms` per execute under profiler instrumentation. Coarse layer groups by summed average time were matmul/FC `2.857 ms` across `72` layers, attention/MHA `1.185 ms` across `37` layers, fused elementwise/shape layers `1.100 ms` across `85` layers, convolutions `0.555 ms` across `14` layers, other layers `0.210 ms`, and resize `0.019 ms`. +- Top layers: the largest individual layers were twelve repeated Myelin MHA layers (`_gemm_mha_v2_myl2_*`) at about `0.066-0.069 ms` each, followed by many backbone encoder MLP `fc2` MatMul layers at about `0.054-0.059 ms` each. The segmentation-head convolutions and resize were smaller. +- Learning: The remaining graph body is distributed across many small Myelin-generated transformer MHA/MLP matmul layers, not a single bad layer. A custom patch would need a broad correct transformer/tactic/export change; there is no obvious one-layer plugin target left in the accepted serialized engine. + +### Rejected: cuBLAS Workspace Runtime Config + +- Hypothesis: The accepted TensorRT plan is Myelin-heavy but still contains many GEMM/MHA-style kernels. Setting `CUBLAS_WORKSPACE_CONFIG` before process startup might alter cuBLAS/cuBLASLt workspace behavior for any library-backed tactics and improve graph-bound throughput. +- Change tested: External process environment only; ran the requested benchmark with `CUBLAS_WORKSPACE_CONFIG=:4096:8`, then with `CUBLAS_WORKSPACE_CONFIG=:16:8`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: `:4096:8` measured `frames=538 elapsed=2.20s fps=244.86`; `:16:8` measured `frames=538 elapsed=2.21s fps=243.95`, both below the recent same-session default baseline of `frames=538 elapsed=2.19s fps=245.20`. +- Learning: cuBLAS workspace configuration is not a useful runtime knob for this serialized TensorRT/Myelin plan. Keep the default library workspace behavior. + +### Rejected: PyTorch Native Allocator Split Size + +- Hypothesis: The accepted depth-2 path still allocates fixed-size PyTorch CUDA tensors around TensorRT output clones and fused postprocess. Setting `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:...` could change native caching-allocator block splitting enough to reduce allocator overhead or fragmentation while preserving tensor values. +- Change tested: External process environment only; first tried `max_split_size_mb:16`, then tried valid `max_split_size_mb:64`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result: `max_split_size_mb:16` failed before frames because PyTorch rejected it (`CachingAllocator option max_split_size_mb too small, must be > 20`). `max_split_size_mb:64` ran successfully but measured `frames=538 elapsed=2.20s fps=244.50`, below the recent default warmed baseline of `frames=538 elapsed=2.19s fps=245.20`. +- Learning: Native allocator split-size tuning does not improve this graph-bound run. The default allocator policy remains the best tested option; the earlier `cudaMallocAsync` allocator probe was also not stable enough to keep. + +### Profile: Depth-2 Graph-Bound Refresh + +- Request: Capture a fresh Nsight Systems report for the current accepted implementation while keeping the workflow pipeline depth fixed at `2`. Depth `3` was not tested. +- Sanity run before profiling: the requested command measured `frames=538 elapsed=2.20s fps=244.65`. +- Profile: `/tmp/rfdetr_depth2_graphbound_refresh_20260523_180532.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphbound_refresh_20260523_180532.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphbound_refresh_20260523_180532_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphbound_refresh_20260523_180532_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphbound_refresh_20260523_180532_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.29s fps=234.56`. +- Graph spacing: The capture includes `602` CUDA graph traces: `64` capture warmup replays plus `538` frame replays. After skipping the `64` warmups plus the next `100` frame launches, CUDA graph duration was p50 `4071.070 us`, p90 `4135.303 us`, p95 `4138.343 us`, p99 `4144.114 us`, mean `4081.231 us`; graph end-to-next-start gap was p50 `40.479 us`, p90 `41.932 us`, p95 `42.214 us`, p99 `43.034 us`, mean `40.548 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.135 us`, mean `35.202 us`; idle inside the gap was p50 `5.152 us`, mean `5.242 us`. The largest gap occupants were the TensorRT mask D2D clone (`2433600B`, `13.210 us` avg overlap), next-frame input D2D copy (`1168128B`, `13.098 us`), sigmoid (`7.066 us`), fill-long (`2.801 us`), logits D2D clone (`36400B`, `2.106 us`), boxes D2D clone (`1600B`, `2.009 us`), fill-int (`1.820 us`), and selector (`1.684 us`). +- Learning: The current depth-2 run is shaped as requested: CUDA graph end is consistently close to the next CUDA graph start, with only about `5 us` median idle between graph replays. The remaining throughput ceiling is still the TensorRT CUDA graph body plus the required input/output copy and fused postprocess tail; previously tested safe ownership schemes that try to remove those copies return to the accepted band once synchronization is correct. + +### Rejected: Redundant Selector Mask Cleanup + +- Hypothesis: `_select_topk_boxes_kernel` loads logits with `mask=valid_offsets` and `other=-inf`, then immediately applies the same validity mask with `tl.where(...)`. Removing the redundant `tl.where(...)` could simplify Triton selector codegen without changing scores. +- Change tested: Temporary code only; removed the second `scores = tl.where(valid_offsets, scores, -inf)` from `fused_postprocess.py`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: The masked `tl.load(..., other=-inf)` already produces the same value for invalid offsets, so this cleanup is semantically equivalent for the selector. +- Result on requested command: first compile-cold run measured `frames=538 elapsed=2.29s fps=234.72`; warmed repeats measured `frames=538 elapsed=2.21s fps=243.89` and `frames=538 elapsed=2.21s fps=243.60`, below the immediately prior clean sanity run of `frames=538 elapsed=2.20s fps=244.65`. +- Learning: This codegen simplification does not improve the graph-bound depth-2 run. The selector is too small relative to TensorRT replay and required copy traffic, and small Triton schedule changes can land below the accepted warmed band. Reverted to the accepted selector implementation. + +### Rejected: X-Axis Specialized Mask Resize + +- Hypothesis: The benchmark resizes RFDETR masks from `78` columns to `312` columns, so the x-axis scale is exactly `4x` even though the original frame height is `176`. The prior fully-4x resize specialization did not match this video because the y-axis is not 4x. Specializing only x-coordinate interpolation could reduce arithmetic in `_resize_selected_masks_kernel(...)`. +- Change tested: Temporary code only; added `_resize_selected_masks_x4_kernel(...)` that replaces x-axis floor/divide work with a `out_x % 4` mapping while keeping the generic y-axis bilinear math. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: CUDA smoke check compared generic and x-specialized kernels on random `(100, 78, 78)` masks resized to `(176, 312)` for seven selected detections: `equal=True`, `diff=0`. +- Result on requested command: warmed depth-2 runs measured `frames=538 elapsed=2.20s fps=244.48`, then `frames=538 elapsed=2.21s fps=243.91`, then `frames=538 elapsed=2.21s fps=243.37`, below the latest clean accepted sanity run of `frames=538 elapsed=2.20s fps=244.65`. +- Learning: The x-axis arithmetic is not the limiter for the current graph-bound run. The extra Triton variant and changed codegen do not produce a stable end-to-end improvement, so the generic resize kernel remains the accepted path. + +### Rejected: Skip Single-Image Batch Contiguous Call + +- Hypothesis: In RFDETR preprocessing, `tensors[0].unsqueeze(0)` is already contiguous for the accepted single-image CHW tensor. Removing the following `.contiguous()` could avoid a tiny PyTorch dispatch in the CPU producer path without changing the TensorRT input. +- Change tested: Temporary code only; changed the single-image branch in `pre_process_network_input(...)` from `tensors[0].unsqueeze(0).contiguous()` to `tensors[0].unsqueeze(0)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Local tensor checks confirmed both normal and pinned CHW tensors remain contiguous after `unsqueeze(0)` with expected NCHW strides; no model math or preprocessing pixels change. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.37` and `frames=538 elapsed=2.20s fps=244.05`, below the latest clean accepted sanity run of `frames=538 elapsed=2.20s fps=244.65`. +- Learning: The extra `.contiguous()` is effectively a no-op in this path and is not the current producer limiter. Removing it did not improve throughput, so the accepted explicit contiguous call remains. + +### Rejected: Yield Before RFDETR Postprocess + +- Hypothesis: In the RFDETR TRT workflow fast path, the same worker thread calls `forward(...)` and immediately launches deferred GPU postprocess. With pipeline depth fixed at `2`, yielding once after `forward(...)` might let the other worker acquire the model lock and enqueue the next TensorRT CUDA graph before selector/resize work, reducing the already-small graph-to-graph tail. +- Change tested: Temporary code only; inserted `time.sleep(0)` between `model._model.forward(...)` and `model._model.post_process(...)` in the RFDETR TRT workflow fast path. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.43` and `frames=538 elapsed=2.21s fps=243.90`, within noise but below the accepted warmed ceiling. +- Learning: Python scheduler yielding does not improve the depth-2 balance. The current path already hands off quickly enough, and explicit yielding adds variance without reducing the TensorRT graph-body bottleneck. Reverted to the accepted immediate postprocess launch. + +### Rejected: Triton Selector Max Return Indices + +- Hypothesis: `_select_topk_boxes_kernel` computes the top offset with an equality mask plus `tl.min(...)` after `tl.max(scores, axis=0)`. Triton's `tl.max(..., return_indices=True, return_indices_tie_break_left=True)` can return the max value and lane index together, potentially reducing selector reduction work. +- Change tested: Temporary code only; replaced the selector loop's `tl.max` plus equality-mask tie-break with `tl.max(..., return_indices=True, return_indices_tie_break_left=True)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: the first compile-cold run measured `frames=538 elapsed=2.29s fps=235.06`; warmed repeat measured `frames=538 elapsed=2.21s fps=243.95`, below the accepted warmed ceiling. +- Learning: Triton's max-with-index codegen is not better for this single-block selector on T4. The existing explicit equality-mask/tie-break sequence remains the faster full-pipeline path. + +### Rejected: Single-Image Deferred Fused Postprocess Shortcut + +- Hypothesis: The benchmark always calls RFDETR dense postprocess with batch size `1` and `defer_fused_postprocess_count=True`. Trying the fused postprocess directly on `logits[0]` before the generic batch sigmoid/loop could remove small Python and tensor-wrapper overhead while preserving the same fused selector and mask resize behavior. +- Change tested: Temporary code only; added an early single-image branch in `post_process_instance_segmentation_results(...)` that called `_try_fused_instance_segmentation_post_process(...)` with `torch.sigmoid(logits[0])` and returned `[fused_result]` when supported. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.43` and `frames=538 elapsed=2.21s fps=243.63`, below the accepted warmed ceiling. +- Learning: The generic batch wrapper and loop are not limiting the current depth-2 run. The added branch changes bytecode/scheduling enough to add variance without reducing the TensorRT graph-body bottleneck. Reverted to the generic postprocess flow. + +### Rejected: Combined Small TensorRT Output Copy + +- Hypothesis: The accepted TensorRT CUDA graph path clones the small boxes and logits outputs separately before cloning the large mask tensor. Allocating one small flat device tensor, copying boxes and logits into views, and returning those views could reduce small-output allocation overhead without borrowing graph-owned outputs or changing the large mask clone. +- Change tested: Temporary code only; in the CUDA graph cache-hit path, when three same-dtype outputs were present, allocated one flat tensor for the first two outputs, copied the graph-owned first and second output buffers into shaped views, and cloned the third output normally. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.97` and `frames=538 elapsed=2.20s fps=244.21`, not a stable improvement over the accepted warmed ceiling. +- Learning: The two small output clones are below the limiter, and replacing them with manual view copies changes allocation/scheduling enough to add variance. Keep the simpler per-output clone path. + +### Profile: Depth-2 Graph Spacing Refresh + +- Request: Capture another Nsight Systems report for the current accepted implementation while keeping workflow pipeline depth fixed at `2`. Depth `3` was not tested. +- Sanity run before profiling: the requested command measured `frames=538 elapsed=2.20s fps=244.60`. +- Profile: `/tmp/rfdetr_depth2_graphspacing_20260523_183938.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphspacing_20260523_183938.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphspacing_20260523_183938_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphspacing_20260523_183938_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphspacing_20260523_183938_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.33s fps=230.93`. +- Graph spacing: The capture includes `602` CUDA graph traces: `64` capture warmup replays plus `538` frame replays. After skipping the `64` warmups plus the next `100` frame launches, CUDA graph duration was p50 `4073.855 us`, p90 `4134.462 us`, p95 `4136.539 us`, p99 `4142.646 us`, mean `4075.027 us`; graph end-to-next-start gap was p50 `40.575 us`, p90 `41.996 us`, p95 `42.303 us`, p99 `42.888 us`, mean `40.840 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.200 us`, mean `35.656 us`; idle inside the gap was p50 `5.328 us`, mean `5.500 us`. The largest gap occupants were next-frame input D2D copy (`1168128B`, `13.120 us` avg overlap), TensorRT mask D2D clone (`2433600B`, `13.110 us`), sigmoid (`6.726 us`), fill-long (`2.815 us`), selector (`2.180 us`), fill-int (`2.173 us`), logits D2D clone (`36400B`, `2.105 us`), and boxes D2D clone (`1600B`, `1.998 us`). +- Learning: The current depth-2 run is still shaped as requested. The median graph-to-graph tail is roughly `1%` of the TensorRT graph body, and only about `5 us` of that tail is idle after required copies and fused postprocess work. The run is effectively bottlenecked by the TensorRT CUDA graph forward pass plus the narrow postprocess/copy tail; further large gains likely require changing the TensorRT engine/export/tactics rather than Python-side pipeline depth. + +### Rejected: Suppress Benchmark Progress Prints + +- Hypothesis: The benchmark sink flushes progress text every `50` frames. Suppressing intermediate progress prints while preserving the final FPS line could remove occasional result-path I/O stalls and keep the depth-2 pipeline fed more consistently. +- Change tested: Temporary benchmark-harness code only; set `PROGRESS_EVERY = 0` in `development/stream_interface/rfdetr_nano_seg_trt_workflow.py` and guarded the progress-print branch. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: This does not affect preprocessing, TensorRT execution, postprocess, prediction materialization, or final benchmark output; it only changes intermediate console logging. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.80` and `frames=538 elapsed=2.21s fps=243.19`, not a stable improvement over the accepted warmed band. +- Learning: Intermediate progress printing is not a meaningful limiter for the current graph-bound benchmark. Keep the progress output at every `50` frames for observability. + +### Current Package Metadata Recheck + +- Request: Re-check official Roboflow package metadata through the local `inference_models` provider path before spending more time on engine-body tuning. +- Result: The provider resolves `rfdetr-seg-nano` to `coco-dataset-vdnr1/41` and still returns six public packages: L4 TRT FP32 `3e3ddd85586b43e4fac6d319fb2927fd`, ONNX FP32 `5362b72bfb9f01d2e0b8cba2048d932c`, L4 TRT FP16 `89d1f41e2af4f4f3ffcdfb77e774d26a`, Torch FP32 `8b8da2fe824240522a39f3cde41aafae`, T4 TRT FP32 `bbc2cc23adf6f5e71a9241956081da96`, and T4 TRT FP16 `c70f32369a54d61e06ef4e6b56c82524`. +- Learning: There is no new official T4-compatible TensorRT package available through the current metadata. The accepted T4 FP16 package remains the only official package that has passed the benchmark correctness gate; further graph-body improvements still require a behavior-equivalent export source or a new official T4 package. + +### Rejected: Static Batch TensorRT Wrapper Shortcut + +- Hypothesis: RFDETR TRT uses a static batch size of `1`, and the benchmark always sends exactly one frame. Short-circuiting `_infer_from_trt_engine_with_batch_size_boundaries(...)` directly into `_execute_trt_engine(...)` for the exact static-batch case could remove small Python branch/reminder overhead before graph replay. +- Change tested: Temporary code only; added an early return when `min_batch_size == max_batch_size == pre_processed_images.shape[0]`, preserving the same tensor and the same `_execute_trt_engine(...)` call. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: This does not change model inputs, TensorRT execution, postprocess, or predictions; it only bypasses generic padding bookkeeping when no padding is needed. +- Result on requested command: same-session clean baseline measured `frames=538 elapsed=2.21s fps=243.90`; shortcut runs measured `frames=538 elapsed=2.21s fps=243.91` and `frames=538 elapsed=2.21s fps=243.00`. +- Learning: Generic static-batch bookkeeping is below the current limiter. The accepted helper structure remains clearer and at least as fast in the graph-bound depth-2 run. + +### Profile: Main-Thread CPU Refresh + +- Request: Refresh CPU-side evidence for the current accepted implementation while keeping workflow pipeline depth fixed at `2`. Depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_current_20260523_1856.prof`, captured with Python `cProfile` around the requested benchmark command. +- Result under profiling: `frames=538 elapsed=2.20s fps=244.61`, inside the accepted warmed band. +- Findings: Standard `cProfile` mostly captured startup/import and the main result-dispatch queue, not the worker-thread CUDA hot path. The main-thread sink itself was negligible (`538` calls, about `0.005 s` cumulative), and the top cumulative runtime after import was queue waiting in `_dispatch_inference_results`. +- Learning: The remaining limiter is not main-thread result sink work. This profile is consistent with the Nsight Systems evidence that the run is constrained by the TensorRT CUDA graph body plus the short GPU copy/postprocess tail; worker-thread CPU hot-path work is already mostly hidden by the depth-2 pipeline. + +### Profile: All-Thread Yappi CPU/Wall Refresh + +- Request: Refresh all-thread CPU-side evidence for the accepted path after the main-thread-only `cProfile` run. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Profiles: `/tmp/rfdetr_depth2_yappi_wall_20260523_1902.pstat`, `/tmp/rfdetr_depth2_yappi_wall_20260523_1902.callgrind`, `/tmp/rfdetr_depth2_yappi_cpu_20260523_1906.pstat`, and `/tmp/rfdetr_depth2_yappi_cpu_20260523_1906.callgrind`. +- Result under profiler: Yappi wall-clock profiling measured `frames=538 elapsed=2.22s fps=242.75`. Yappi CPU-clock profiling is much higher overhead for this workload and measured `frames=538 elapsed=3.10s fps=173.75`; use it only for relative CPU attribution. +- Findings: Worker wall time remains dominated by CUDA waits: RFDETR fast path wall time was about `8.92 s` across both workers, while true CPU-time attribution in the same area was much smaller. The largest local true CPU self-time left was preprocessing normalization (`_pil_image_to_normalized_tensor(...)` around `0.300 s` under the CPU profiler), followed by fixed pinned conversion bookkeeping (`_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` around `0.071 s` self-time). Fused selector/postprocess wall time is mostly GPU wait, not Python. +- Learning: The all-thread CPU evidence agrees with Nsight Systems: the depth-2 pipeline hides worker CPU well enough that the remaining FPS ceiling is TensorRT graph replay and required GPU copy/postprocess work. Any CPU micro-optimization must be very low-risk and measured end-to-end, because true Python self-time is now a small fraction of frame time. + +### Rejected: Broadcast RFDETR Normalization + +- Hypothesis: The current RFDETR PIL preprocessing writes normalized CHW channels with three separate NumPy multiply/add loops. A single broadcasted NumPy multiply over a channel-reordered CHW source could reduce preprocessing CPU self-time, which the Yappi CPU profile identified as the largest remaining local CPU function. +- Change tested: Temporary code only; replaced the per-channel loop in `_pil_image_to_normalized_tensor(...)` with `np.moveaxis(image_array[:, :, channel_order], 2, 0)`, one broadcasted `np.multiply(..., out=normalized)`, and one broadcasted bias add. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: A local micro-check on resized uint8 inputs showed exact normalized tensor equality versus the accepted per-channel loop (`max diff 0.0`) for the benchmark channel-swap case. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.21s fps=243.84` and `frames=538 elapsed=2.20s fps=244.57`, inside noise but not a stable improvement over accepted warmed runs. +- Learning: The broadcast form is only marginally faster in isolation and adds a per-frame channel-reorder temporary. In the full depth-2 pipeline, it does not improve throughput. Keep the accepted direct per-channel writes into the pinned CHW buffer. + +### Profile: Depth-2 Graphbound Refresh + +- Request: Capture another Nsight Systems report for user analysis while keeping workflow pipeline depth fixed at `2`. Depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_graphbound_20260523_190956.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphbound_20260523_190956.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphbound_20260523_190956_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphbound_20260523_190956_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphbound_20260523_190956_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.32s fps=231.58`. +- Graph spacing: The capture includes `602` CUDA graph traces: `64` capture warmup replays plus `538` frame replays. After skipping the `64` warmups plus the next `100` frame launches, CUDA graph duration was p50 `4068.463 us`, p90 `4132.048 us`, p95 `4135.371 us`, p99 `4140.723 us`, mean `4069.635 us`; graph end-to-next-start gap was p50 `40.544 us`, p90 `41.996 us`, p95 `42.374 us`, p99 `43.007 us`, mean `40.722 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.200 us`, mean `35.294 us`; idle inside the gap was p50 `5.328 us`, mean `5.381 us`. The largest gap occupants were next-frame input D2D copy (`1168128B`, `13.133 us` avg overlap), TensorRT mask D2D clone (`2433600B`, `13.133 us`), sigmoid (`6.933 us`), fill-long (`2.846 us`), logits D2D clone (`36400B`, `2.110 us`), boxes D2D clone (`1600B`, `1.995 us`), fill-int (`1.947 us`), and selector (`1.804 us`). +- Learning: The refreshed depth-2 profile matches the accepted graphbound shape. The graph-to-graph gap remains about `1%` of the TensorRT graph body, with roughly `5 us` median idle after required ownership copies and fused postprocess work. The practical limiter is still the TensorRT CUDA graph forward pass plus the narrow GPU copy/postprocess tail, not pipeline depth. + +### Rejected: CUDA Device Max Connections Runtime Knob + +- Hypothesis: The accepted depth-2 path uses separate CUDA streams for preprocessing, TensorRT graph replay, postprocess, and D2H conversion. Changing `CUDA_DEVICE_MAX_CONNECTIONS` before process startup might alter stream work-queue scheduling enough to tighten the graph-to-graph cadence. +- Change tested: External process environment only; ran the requested benchmark with `CUDA_DEVICE_MAX_CONNECTIONS=1`, then with `CUDA_DEVICE_MAX_CONNECTIONS=32`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: same-session default baseline measured `frames=538 elapsed=2.20s fps=244.42`; `CUDA_DEVICE_MAX_CONNECTIONS=1` measured `frames=538 elapsed=2.36s fps=227.71`; `CUDA_DEVICE_MAX_CONNECTIONS=32` measured `frames=538 elapsed=2.21s fps=243.16`. +- Learning: Reducing the number of device work queues is actively harmful for this overlapped depth-2 schedule, and increasing it does not improve the accepted graph-bound cadence. Keep the default CUDA connection setting. + +### Rejected: Skip No-Op RFDETR Numpy Preprocessing Helper + +- Hypothesis: The accepted RFDETR package has no static crop, grayscale, contrast, or two-step resize, so `_pre_process_numpy(...)` calls `apply_pre_processing_to_numpy_image(...)` only to return the same image and a zero static-crop offset. A guarded fast path for this no-op case could reduce CPU preprocessing overhead without changing pixels or metadata. +- Change tested: Temporary code only; added `_can_skip_numpy_pre_processing(...)` in `rfdetr/pre_processing.py` and bypassed the generic helper when all numpy preprocessing operations were inactive after overrides. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared the fast path against the generic helper on `16` frames from `vehicles_312px.mp4`; normalized tensor max diff was `0.0` and preprocessing metadata matched exactly. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.21s fps=243.65` and `frames=538 elapsed=2.21s fps=243.64`, below the same-session default baseline of `frames=538 elapsed=2.20s fps=244.42`. +- Learning: This helper call is not a measurable limiter in the current graph-bound run. The added branch slightly worsens scheduling/noise, so the generic helper remains the accepted path. + +### Profile: Triton Selector Kernel NCU Snapshot + +- Request: Gather lower-level evidence for the remaining custom Triton selector work while keeping the benchmark command at pipeline depth `2`. Depth `3` was not tested. +- Profiles: Initial launch-shape-only report `/tmp/rfdetr_selector_kernel_depth2_20260523_192029.ncu-rep`; explicit-counter report `/tmp/rfdetr_selector_kernel_metrics_20260523_192216.ncu-rep`; details text `/tmp/rfdetr_selector_kernel_metrics_20260523_192216_details.txt`. +- Result under profiler: The explicit-counter NCU run measured `frames=538 elapsed=7.26s fps=74.06`, which is profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled `_select_topk_boxes_kernel` launches as a single Triton program with CUDA launch shape `(1, 1, 1)x(256, 1, 1)`. Across five sampled launches, NCU reported mean `gpu__time_duration.avg=24.766 us`, mean DRAM read `53.914 KB`, mean DRAM write `12.8 B`, fixed L1 global-load traffic `42.820 KB`, fixed L1 global-store traffic `928 B`, and fixed `15811` SMSP instructions. +- Learning: This supports the earlier selector experiments: the kernel is under-parallelized by shape, but it is also very small and mostly reads the `100x91` score matrix. Prior attempts to improve occupancy with top-2-per-query, different warp counts, max-iteration caps, and raw-logit selection all lost end to end because the extra launch/codegen/traffic costs outweighed the tiny selector tail. Future selector work should only proceed if it also removes another launch or required copy, not as standalone selector tuning. + +### Rejected: Glibc Arena Limit Runtime Knob + +- Hypothesis: The remaining CPU-side work allocates small NumPy arrays, UUID strings, and metadata objects while two workflow workers are active. Setting `MALLOC_ARENA_MAX=1` before process startup might reduce glibc arena overhead or memory-management variance enough to improve the depth-2 materialization tail. +- Change tested: External process environment only; ran the requested benchmark with `MALLOC_ARENA_MAX=1`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: same-session default baseline measured `frames=538 elapsed=2.21s fps=243.57`; `MALLOC_ARENA_MAX=1` measured `frames=538 elapsed=2.20s fps=244.77`, then repeated at `frames=538 elapsed=2.21s fps=243.95`; default rerun measured `frames=538 elapsed=2.20s fps=244.01`. +- Learning: The first arena-limited run was noise rather than a stable allocator improvement. Host allocator tuning does not move the current graph-bound ceiling, and the target command should not require an external glibc allocator environment variable. + +### Rejected: OpenCV Single Thread Runtime Probe + +- Hypothesis: The benchmark source is an OpenCV-decoded video and the process default reports `cv2.getNumThreads() == 8`. Restricting OpenCV to one thread might reduce CPU scheduling contention between video decode and the two depth-2 workflow workers. +- Change tested: Temporary launcher only; called `cv2.setNumThreads(1)` before executing `development/stream_interface/rfdetr_nano_seg_trt_workflow.py` via `runpy`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: `cv2.setNumThreads(1)` measured `frames=538 elapsed=2.20s fps=244.91`; immediate default rerun measured `frames=538 elapsed=2.20s fps=245.00`. +- Learning: OpenCV thread-pool size is not a limiter in the accepted graph-bound run. Keep the normal OpenCV default rather than adding benchmark-specific thread configuration. + +### Rejected: Capture TensorRT Graph On Caller Stream + +- Hypothesis: The TensorRT CUDA graph cache creates a dedicated graph stream, then each RFDETR cache hit waits that graph stream on the model inference stream and waits the inference stream back on the graph stream. Capturing and replaying the graph on the current inference stream could turn those into self-waits in the RFDETR path and remove small event edges without changing graph outputs. +- Change tested: Temporary code only; changed `_capture_cuda_graph(...)` to use `torch.cuda.current_stream(device)` instead of creating a new `torch.cuda.Stream(device=device)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: This changed only stream placement for the same TensorRT graph capture/replay, input copy, and output clones. `py_compile` passed before the benchmark, and the run completed normally. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.33`, below the immediate accepted default run of `frames=538 elapsed=2.20s fps=245.00`. +- Learning: The dedicated TensorRT graph stream is part of the stable overlap schedule. Collapsing graph replay onto the caller inference stream does not reduce the graph-bound tail and slightly underperforms, so the accepted dedicated graph stream remains. + +### Rejected: Python Malloc Runtime Knob + +- Hypothesis: Prediction materialization still creates many small Python objects and NumPy arrays. Running the process with `PYTHONMALLOC=malloc` might reduce small-object allocator contention or interact better with glibc under the two-worker depth-2 pipeline. +- Change tested: External process environment only; ran the requested benchmark with `PYTHONMALLOC=malloc`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: `PYTHONMALLOC=malloc` first measured `frames=538 elapsed=2.19s fps=245.44`, but repeated at `frames=538 elapsed=2.21s fps=243.90`; immediate default rerun measured `frames=538 elapsed=2.21s fps=243.77`. +- Learning: The first allocator result was not repeatable and stays inside normal run-to-run variance. Python allocator selection is not a stable improvement, and the target command should keep the default allocator. + +### Profile: Depth-2 Graphbound Refresh + +- Request: Capture another Nsight Systems report for user analysis while keeping workflow pipeline depth fixed at `2`. Depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_graphbound_20260523_194100.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphbound_20260523_194100.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphbound_20260523_194100_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphbound_20260523_194100_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphbound_20260523_194100_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.26s fps=238.29`. +- Graph spacing: The capture includes `602` CUDA graph traces: `64` capture warmup replays plus `538` frame replays. After skipping the `64` warmups plus the next `100` frame launches, CUDA graph duration was p50 `4076.798 us`, p90 `4139.707 us`, p95 `4144.201 us`, p99 `4155.937 us`, mean `4065.573 us`; graph end-to-next-start gap was p50 `40.863 us`, p90 `42.201 us`, p95 `42.477 us`, p99 `42.879 us`, mean `40.945 us`. +- Gap decomposition over the first `100` stable post-settling gaps: busy work inside the gap was p50 `35.455 us`, mean `35.648 us`; idle inside the gap was p50 `5.200 us`, mean `5.220 us`. The largest gap occupants were TensorRT mask D2D clone (`2433600B`, `13.457 us` avg overlap), next-frame input D2D copy (`1168128B`, `13.102 us`), sigmoid (`6.598 us`), fill-long (`2.800 us`), selector (`2.393 us`), logits D2D clone (`36400B`, `2.099 us`), fill-int (`2.063 us`), and boxes D2D clone (`1600B`, `1.991 us`). +- Learning: The refreshed report still matches the target shape: the graph-to-graph idle bubble is about `5 us`, while the TensorRT CUDA graph body is about `4.07 ms`. The remaining throughput ceiling is dominated by the TensorRT graph forward pass plus required ownership copies and a narrow postprocess tail. + +### Rejected: Two-Phase Limited D2H Detection Copy + +- Hypothesis: The fixed RFDETR workflow conversion copies `7` masks to pinned CPU buffers every frame, while the benchmark detection distribution averages only `3.54` detections per frame (`1`: 15 frames, `2`: 104, `3`: 164, `4`: 145, `5`: 74, `6`: 14, `7`: 22). Copying the GPU count first, synchronizing, then copying only `valid_count` rows could reduce D2H bytes enough to improve the CPU materialization boundary. +- Change tested: Temporary code only; in `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)`, copied and synchronized `valid_count` before copying boxes, confidences, class IDs, and masks for only the valid rows. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result on requested command: `frames=538 elapsed=2.80s fps=192.19`, a severe regression from the accepted warmed band. +- Learning: The saved D2H bytes do not compensate for introducing a second synchronization at the conversion boundary. The accepted single-sync fixed 7-row copy is the better depth-2 schedule because it keeps the GPU/CPU pipeline from stalling on an early count readback. + +### Diagnostic: Depth-2 TensorRT Graph Visibility In Nsight Systems + +- Hypothesis: The latest depth-2 Nsight Systems SQLite might reveal idle bubbles inside the captured TensorRT CUDA graph body, not just the gap between graph replays. If visible graph-internal idle were large, it could justify another TensorRT stream/tactic scheduling experiment. +- Analysis: Reused `/tmp/rfdetr_depth2_graphbound_20260523_194100.sqlite`, skipped the `64` capture warmup replays plus `100` settling frame launches, and measured GPU activities overlapping the remaining `438` CUDA graph trace intervals. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Result: CUDA graph duration was p50 `4076.798 us`, mean `4065.573 us`. Nsight Systems did not expose the captured TensorRT graph's steady-state internal kernels as normal kernel rows; the visible overlapping non-graph work inside graph intervals was only p50 `280.651 us`, mean `280.587 us`, with top visible occupants Host-to-Device input copy `1168128B` (`188.948 us`/graph avg overlap), fixed-mask Device-to-Host copy `384384B` (`59.230 us`), `_resize_selected_masks_kernel` (`13.945 us`), and `_select_topk_boxes_kernel` (`10.508 us`). The visible overlap factor was `1.000`, so those exposed non-graph events are not meaningfully concurrent with one another. +- Learning: The profile confirms that depth-2 is already hiding preprocessing H2D and prediction D2H/postprocess under the TensorRT graph replay. It does not provide graph-internal TensorRT node timing; for the graph body, the better evidence remains the separate TensorRT layer profiler and NCU kernel snapshots, which showed the cost distributed across many small Myelin MHA/MLP kernels rather than a single obvious plugin target. + +### Profile: Depth-2 CUDA Graph Node Trace + +- Hypothesis: Nsight Systems `--cuda-graph-trace=node` can expose steady-state TensorRT graph node kernels, giving better evidence for any remaining graph-body scheduling or plugin target than the default graph-envelope reports. +- Profile: `/tmp/rfdetr_depth2_graphnode_20260523_194903.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphnode_20260523_194903.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphnode_20260523_194903_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphnode_20260523_194903_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphnode_20260523_194903_stats_cuda_api_sum_cuda_api_sum.csv`. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Result under profiler: `frames=538 elapsed=2.33s fps=230.56`. +- Graph-node findings: The trace shows `602` inferred graph replays with `242` graph nodes each. After skipping `64` warmup replays plus `100` settling launches, graph-node envelope duration was p50 `4131.245 us`, p90 `4193.670 us`, p95 `4197.040 us`, mean `4137.898 us`; union GPU-busy time inside the graph was p50 `3916.356 us`, mean `3921.581 us`; internal no-activity idle was p50 `215.837 us`, p90 `221.781 us`, p95 `223.043 us`, mean `216.316 us`. Mean graph busy fraction was `94.77%`, summed-activity/union-busy overlap factor was only `1.0405`, and each replay used `6` streams. +- Top graph-node occupants per replay were distributed across many repeated TensorRT/Myelin kernels: `25` FP16 GEMM nodes at `808.390 us/replay`, `12` MHA nodes at `585.925 us/replay`, `12` fused FP16/FP32 GEMM nodes at `481.064 us/replay`, `12` smaller GEMM nodes at `221.701 us/replay`, plus smaller convolution, layernorm/GELU, transpose, and Myelin fusion nodes. The only visible Myelin TopK node was about `14.588 us/replay`. +- Learning: The graph body is now directly visible and confirms the earlier layer/NCU evidence: most time is spread across many small transformer GEMM/MHA/Myelin nodes with modest internal idle and little multi-stream overlap. There is no single large graph node to replace with a custom kernel; further graph-body gains require a correctness-equivalent TensorRT engine/export/tactic change rather than another Python scheduling tweak. + +### Profile: Graph-Node MHA Nsight Compute Snapshot + +- Hypothesis: The top repeated graph-node MHA kernel from the node trace may expose a hardware-level limitation that suggests a TensorRT tactic or custom-kernel target. +- Profiles: Launch-shape-only attempt `/tmp/rfdetr_trt_mha_graphnode_ncu_20260523_195236.ncu-rep` and useful basic-metrics report `/tmp/rfdetr_trt_mha_graphnode_basic_ncu_20260523_195342.ncu-rep` with details in `/tmp/rfdetr_trt_mha_graphnode_basic_ncu_20260523_195342_details.txt`. The NCU command matched `_gemm_mha_v2_0x7daddb359f728ff2e600188f192f4549`, used graph profiling mode `node`, skipped `900` matching launches, collected `3` launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.33s fps=100.84`, not comparable to normal benchmark FPS. +- Findings: The sampled MHA node launches as `(1, 6, 11)x(128, 1, 1)` with grid size `66`, block size `128`, `245` registers/thread, `24.58 KiB` dynamic shared memory per block, `0.82` waves/SM, theoretical occupancy `25%`, achieved occupancy about `20.54%`, achieved active warps/SM about `6.57`, compute throughput about `36.04%`, memory throughput about `25.47%`, DRAM throughput about `6.45%`, and duration about `106.33 us` under NCU replay. +- Learning: The top MHA node is small-grid and limited by registers/shared memory, with low DRAM pressure. This is consistent with TensorRT/Myelin tactic limitations on T4 rather than a standalone memory optimization. A replacement would need to cover repeated MHA/GEMM structure across the transformer, not one isolated postprocess-style kernel. + +### Diagnostic: Accepted Engine Tactic Visibility + +- Hypothesis: The accepted TensorRT engine inspector might expose tactic IDs or implementation details for the top MHA/GEMM graph nodes, allowing a targeted tactic-level runtime change without rebuilding from an incompatible ONNX source. +- Analysis: Rechecked `/tmp/rfdetr_accepted_engine_inspector.json` and `/tmp/rfdetr_trt_layer_profile_20260523_accepted.json` after the CUDA graph node trace. The accepted engine inspector only contains layer names and bindings because the serialized engine was built with `ProfilingVerbosity.LAYER_NAMES_ONLY`; it does not expose tactic IDs, implementation alternatives, or detailed tensor formats. +- Findings: The direct TensorRT layer profiler still maps the largest per-execute layer times to twelve repeated `_gemm_mha_v2_myl2_*` layers at about `0.066-0.069 ms` each, followed by many encoder `mlp/fc2/MatMul_myl2_*` and Myelin FC layers around `0.054-0.059 ms` each. Coarse grouping of the `261` profiled layers attributes the largest totals to MHA/attention and MatMul/FC families rather than a single plugin-sized outlier. +- Learning: There is no tactic-level runtime knob visible from the accepted plan. A real graph-body optimization needs either a new official engine built with detailed profiling metadata, the exact correctness-equivalent export source, or an offline rebuild path that passes the all-frame class/box/mask correctness gate; the public ONNX/Torch packages tested so far do not satisfy that gate. + +### Profile: Graph-Node Top GEMM Nsight Compute Snapshot + +- Hypothesis: The largest aggregate graph-node kernel family, `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt`, might show whether the remaining TensorRT graph body is compute-, memory-, or launch-shape-limited. +- Profile: `/tmp/rfdetr_trt_topgemm_graphnode_basic_ncu_20260523_195831.ncu-rep` with details in `/tmp/rfdetr_trt_topgemm_graphnode_basic_ncu_20260523_195831_details.txt`. The NCU command matched the GEMM kernel, used graph profiling mode `node`, skipped `1800` matching launches, collected `3` launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.25s fps=102.51`, not comparable to normal benchmark FPS. +- Findings: The sampled GEMM launches had grid sizes `36`, `54`, and `72` with block size `128`, `166` registers/thread, `16.38 KiB` dynamic shared memory per block, waves/SM `0.30-0.60`, theoretical occupancy `37.5%`, achieved occupancy `13.90-21.35%` (mean `17.71%`), achieved active warps/SM mean `5.67`, compute throughput mean `48.14%`, DRAM throughput mean `24.78%`, and duration mean `54.91 us` under NCU replay. +- Learning: The top GEMM family is also small-grid and occupancy-limited, though less register-heavy than the MHA node. Along with the MHA snapshot, this points to TensorRT tactic/export structure as the remaining graph-body limiter; replacing a single postprocess-style kernel cannot address enough of the repeated GEMM/MHA body. + +### Rejected: Three Explicit TensorRT Aux Streams + +- Hypothesis: Prior tests rejected zero, one, two, and four explicit TensorRT auxiliary streams during CUDA graph capture. Since the graph-node trace shows six streams active during replay but low overlap, forcing exactly three persistent aux streams might produce a slightly better balance between TensorRT internal parallelism and scheduling overhead. +- Change tested: Temporary code only; added three persistent `torch.cuda.Stream` objects to the TensorRT CUDA graph state, called `graph_context.set_aux_streams(...)` before the pre-capture warmup and CUDA graph capture, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed and the benchmark completed normally. The change only altered TensorRT auxiliary stream scheduling for the same captured engine and output tensors. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.73` and `frames=538 elapsed=2.21s fps=243.50`; after reverting to the accepted default aux-stream behavior, the same-session baseline measured `frames=538 elapsed=2.21s fps=243.71`. +- Learning: Three explicit aux streams are just noise in the accepted warmed band, not a stable gain. Keep TensorRT's default auxiliary stream scheduling. + +### Rejected: Defer TensorRT Graph Output Wait To RFDETR Postprocess + +- Hypothesis: The TensorRT graph cache hit path currently queues input copy, CUDA graph replay, and output clones on the dedicated graph stream, then makes the caller inference stream wait for that graph stream. In the RFDETR deferred workflow path, postprocess could wait on the graph stream directly, potentially removing one event edge before the next depth-2 replay. +- Change tested: Temporary code only; added a `defer_cuda_graph_output_sync` flag to the TensorRT helper, used it only from RFDETR when `defer_cuda_stream_sync=True`, stored the graph stream in RFDETR thread-local state, and made postprocess wait on that stream instead of the inference stream. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed and the benchmark completed normally. +- Result on requested command: `frames=538 elapsed=2.20s fps=244.83`, inside the accepted band but not a clear improvement. A correctly launched Nsight Systems run with this temporary scheduling variant measured `frames=538 elapsed=2.29s fps=235.03`, below the latest accepted-path graph-bound profile. +- Profile for rejected variant: `/tmp/rfdetr_depth2_deferout_20260523_201017.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_deferout_20260523_201017.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_deferout_20260523_201017_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_deferout_20260523_201017_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_deferout_20260523_201017_stats_cuda_api_sum.csv`. +- Learning: Moving the output wait out of the inference stream does not improve the depth-2 balance. The accepted event chain is not the current limiter, and preserving the simpler helper-level ownership handoff is better. + +### Profile: Depth-2 Graphbound Final Refresh + +- Request: Capture a fresh Nsight Systems report for user analysis after reverting the unsuccessful scheduling tweak. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Accepted-path sanity run before profiling: `frames=538 elapsed=2.20s fps=244.13`. +- Profile: `/tmp/rfdetr_depth2_graphbound_final_20260523_201236.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphbound_final_20260523_201236.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphbound_final_20260523_201236_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphbound_final_20260523_201236_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphbound_final_20260523_201236_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.32s fps=231.55`. +- Quick exported-data check: the profile contains `602` `cudaGraphLaunch` calls. After skipping `64` warmup launches plus `100` settling launches, graph-launch submit interval was p50 `4118.925 us`, p90 `4252.884 us`, p95 `4316.037 us`, mean `4119.233 us`. +- Learning: The fresh accepted report is consistent with the earlier graph-bound traces. Depth `2` keeps the run shaped around the TensorRT CUDA graph replay cadence; remaining visible work is the narrow fused postprocess/copy tail rather than a CPU scheduling bubble. + +### Rejected: RFDETR Mask Resize Triton Block Size Variants + +- Hypothesis: The deferred fused mask resize kernel is one of the few remaining visible postprocess kernels in the graph-to-graph tail. Changing the Triton pixel block size from `256` to either `512` or `128` might improve occupancy or reduce scheduling overhead for the `7 x 312 x 312` limited resize workload. +- Change tested: Temporary code only; changed `fused_resize_selected_masks(...)` block size to `512`, then to `128`. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: This only changed the partitioning of the same bilinear mask threshold computation. `py_compile` passed after each variant and the benchmark completed normally. Classes and boxes are untouched by this kernel. +- Result on requested command: `512` measured `frames=538 elapsed=2.20s fps=244.14` and `frames=538 elapsed=2.20s fps=244.03`; `128` measured `frames=538 elapsed=2.21s fps=242.92`. After reverting to accepted `256`, the immediate same-session baseline measured `frames=538 elapsed=2.21s fps=242.96`, indicating the session was noisy/slow but neither variant showed a stable gain. +- Learning: The accepted `256` tile remains the best default. This kernel is too small relative to the TensorRT CUDA graph body for tiling-only changes to move end-to-end throughput reliably. + +### Accepted: RFDETR Mask Resize Two-Warp Launch + +- Hypothesis: The accepted deferred mask resize uses a `256`-pixel tile with `num_warps=4`. Since the fixed first-stage grid only covers up to `7` detections at `312x312`, the per-program vector work may be over-provisioned; `num_warps=2` could reduce scheduling/register pressure while preserving the same per-pixel math. +- Change: Changed `_resize_selected_masks_kernel` launch in `fused_resize_selected_masks(...)` from `num_warps=4` to `num_warps=2`. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. This kernel runs after class selection and box decoding, so class IDs and boxes are unchanged by construction; the mask computation is the same independent per-pixel bilinear threshold with only Triton launch geometry changed. +- Result on requested command: `num_warps=2` measured `frames=538 elapsed=2.20s fps=244.42`, `frames=538 elapsed=2.20s fps=244.51`, and `frames=538 elapsed=2.20s fps=244.02`. Same-session `num_warps=4` baselines measured `frames=538 elapsed=2.21s fps=243.50` and `frames=538 elapsed=2.21s fps=243.82`. +- Learning: The improvement is small and still in the noisy graph-bound band, but the same-session A/B favored two warps and the change is low-risk. Keep `num_warps=2` as the current accepted mask-resize launch geometry. + +### Rejected: Vectorized Workflow Class Name Mapping + +- Hypothesis: The RFDETR workflow fast path maps numeric `class_id` arrays to class-name strings with a Python list comprehension. Caching `model.class_names` as a NumPy object array and indexing it directly for in-range class IDs could reduce CPU materialization work while the depth-2 pipeline is feeding the GPU. +- Change tested: Temporary code only; added a thread-local cached class-name object array and used NumPy indexing when all class IDs were valid, falling back to the original per-element behavior for out-of-range IDs. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. A synthetic check over empty, in-range, and out-of-range class IDs matched the original mapping exactly. This path only changes display-name construction after model classes, boxes, and masks have already been materialized. +- Result on requested command: vectorized mapping measured `frames=538 elapsed=2.20s fps=244.55` and `frames=538 elapsed=2.21s fps=243.81`; after reverting to the original list comprehension, the same-session baseline measured `frames=538 elapsed=2.20s fps=244.04`. +- Learning: Class-name mapping is not a stable limiter in the accepted graph-bound run. Keep the simpler original mapping. + +### Rejected: RFDETR Mask Resize One-Warp Launch + +- Hypothesis: After accepting `num_warps=2` for the limited mask resize kernel, reducing further to `num_warps=1` might lower scheduling overhead for the same `256`-pixel tile and small `7 x 312 x 312` first-stage grid. +- Change tested: Temporary code only; changed `_resize_selected_masks_kernel` launch in `fused_resize_selected_masks(...)` from `num_warps=2` to `num_warps=1`. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. This only changes Triton launch geometry after class selection and box decoding; class IDs and boxes are unchanged by construction, and the mask math is the same. +- Result on requested command: `num_warps=1` measured `frames=538 elapsed=2.21s fps=243.91`, `frames=538 elapsed=2.21s fps=243.81`, and `frames=538 elapsed=2.21s fps=243.53`. After restoring the accepted `num_warps=2`, same-session runs measured `frames=538 elapsed=2.21s fps=243.28` and then `frames=538 elapsed=2.20s fps=244.52`. +- Learning: One warp is not a stable improvement and loses to the restored two-warp launch once the session returns to the accepted band. Keep `num_warps=2`. + +### Rejected: UUID Hex Strings For Workflow IDs + +- Hypothesis: The RFDETR workflow fast path creates one inference UUID per frame and one detection UUID per detection via `str(uuid.uuid4())`. Using `uuid.uuid4().hex` could avoid UUID string formatting with hyphens and reduce CPU materialization overhead without changing model predictions. +- Change tested: Temporary code only; changed frame-level inference IDs and per-detection IDs in the RFDETR workflow conversion path from `str(uuid.uuid4())` to `uuid.uuid4().hex`. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. The change only affects opaque workflow identifiers after prediction construction; classes, boxes, confidence, and masks are untouched. +- Result on requested command: `.hex` IDs measured `frames=538 elapsed=2.20s fps=244.40` and `frames=538 elapsed=2.21s fps=243.31`; after reverting to the original string UUID format, same-session baseline measured `frames=538 elapsed=2.21s fps=243.66`. +- Learning: UUID string formatting is not a stable limiter in the accepted graph-bound run. Keep the existing hyphenated UUID behavior. + +### Profile: Depth-2 Low-Bubble Nsight Systems Refresh + +- Request: Capture a fresh Nsight Systems report for user analysis while keeping the workflow pipeline depth fixed at `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_lowbubble_20260523_204849.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_lowbubble_20260523_204849.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_lowbubble_20260523_204849_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_lowbubble_20260523_204849_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_lowbubble_20260523_204849_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.29s fps=234.43`. +- Quick exported-data check: the profile contains `602` CUDA graph traces. After skipping `64` warmup launches plus `100` settling launches, graph duration was p50 `4066.704 us`, p90 `4131.826 us`, p95 `4134.941 us`, mean `4075.063 us`; graph end-to-next-start gap was p50 `40.543 us`, p90 `41.695 us`, p95 `42.156 us`, mean `40.703 us`; graph start-to-start interval was p50 `4107.584 us`, p90 `4172.229 us`, p95 `4175.436 us`, mean `4115.659 us`. +- Gap decomposition after the same skip: busy work inside the graph-to-graph gap was p50 `35.039 us`, p90 `36.857 us`, p95 `37.280 us`, mean `35.284 us`; idle gap time was p50 `5.312 us`, p90 `6.093 us`, p95 `6.272 us`, mean `5.419 us`. +- Non-profiled sanity run after restoring accepted code measured `frames=538 elapsed=2.20s fps=244.43`. +- Learning: The accepted depth-2 run is already tightly graph-paced. The next TensorRT CUDA graph starts about `40.5 us` after the previous graph ends, and only about `5.3 us` of that gap is idle; the bottleneck remains the CUDA graph forward body plus a narrow real GPU postprocess/copy tail. + +### Rejected: Cached False Preprocessing Overrides Object + +- Hypothesis: `_try_run_rfdetr_trt_fast_path(...)` constructs `PreProcessingOverrides(False, False, False)` for every frame. Reusing a module-level frozen dataclass instance could remove a small Python allocation in the CPU producer path without changing preprocessing flags. +- Change tested: Temporary code only; added a module-level `_RFDETR_PRE_PROCESSING_OVERRIDES` constant and passed it into RFDETR preprocessing. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. `PreProcessingOverrides` is a frozen dataclass with three boolean fields, so the reused object is immutable and semantically identical to constructing the same values per frame. +- Result on requested command: depth-2 runs measured `frames=538 elapsed=2.20s fps=244.19` and `frames=538 elapsed=2.21s fps=243.44`. +- Learning: This allocation is below the limiter in the current graph-bound path. The change did not improve throughput, so the per-frame explicit object construction was restored. + +### Rejected: Unrolled RFDETR Normalization Channel Writes + +- Hypothesis: CPU profiling showed `_pil_image_to_normalized_tensor(...)` as the largest remaining true CPU self-time. Unrolling the fixed three-channel normalization loop could remove minor Python loop/indexing overhead while preserving the exact same NumPy multiply/add operations. +- Change tested: Temporary code only; replaced the three-iteration `for output_channel, input_channel in enumerate(channel_order)` loop with three explicit `np.multiply(..., out=normalized[i])` calls followed by the same in-place bias adds. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. A deterministic local equivalence check over both channel orders and several uint8 sample images matched the original loop exactly with max diff `0`. +- Result on requested command: unrolled depth-2 runs measured `frames=538 elapsed=2.20s fps=245.02` and `frames=538 elapsed=2.21s fps=243.87`; after reverting to the accepted loop, the immediate same-session baseline measured `frames=538 elapsed=2.21s fps=243.77`. +- Learning: The first run was noise rather than a stable gain. Loop overhead in the normalization helper is not large enough to move the graph-bound steady state, so the simpler accepted loop remains. + +### Diagnostic: Accepted TensorRT Graph-Only Ceiling + +- Hypothesis: The accepted full workflow appears tightly graph-paced, so measuring the serialized TensorRT plan outside the Python workflow can bound the remaining useful optimization headroom. +- Diagnostic: `trtexec` was not installed, so a TensorRT Python harness loaded `/tmp/cache/shared-blobs/bc173a2cfda9a10af2bc411885e9fec3`, created one execution context for static input shape `(1, 3, 312, 312)`, bound the three output tensors, warmed the context, then timed both direct `execute_async_v3(...)` and a CUDA graph containing only `execute_async_v3(...)`. This diagnostic excludes preprocessing, input D2D copy into the graph buffer, output clones, sigmoid/selector/mask resize, D2H prediction copies, and workflow CPU materialization. Pipeline depth was not varied; depth `3` was not tested. +- Result: Direct `execute_async_v3(...)` measured `4.311606 ms` per enqueue (`231.93 fps`). CUDA graph replay-only measured `4.052506 ms` per replay (`246.76 fps`). +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.72` immediately after the diagnostic. +- Learning: The current accepted workflow is within roughly `1.2%` of the pure TensorRT graph replay ceiling for this serialized plan on the observed T4. The remaining non-engine overhead is only a few dozen microseconds per frame, consistent with Nsight's roughly `40 us` graph-to-graph tail. Meaningful additional FPS now requires a faster correctness-equivalent TensorRT engine/tactic/export; Python, D2H, and postprocess micro-tweaks have very little headroom left. + +### Rejected: TensorRT Runtime Max Threads Sweep + +- Hypothesis: TensorRT `Runtime.max_threads` can be set before deserializing the engine. If it affected graph execution context internals or host-side graph launch preparation, tuning it might improve the accepted CUDA graph replay ceiling without changing model math. +- Diagnostic: Ran the same accepted-engine graph-only harness with `runtime.max_threads` set to `1`, `2`, `4`, `8`, and `16` before deserializing `/tmp/cache/shared-blobs/bc173a2cfda9a10af2bc411885e9fec3`. This measured only a CUDA graph containing `execute_async_v3(...)`; pipeline depth was not varied and depth `3` was not tested. +- Result: `max_threads=1` measured `4.054766 ms` (`246.62 fps`), `2` measured `4.076955 ms` (`245.28 fps`), `4` measured `4.087345 ms` (`244.66 fps`), `8` measured `4.094583 ms` (`244.23 fps`), and `16` measured `4.104315 ms` (`243.65 fps`). The default runtime value is already `1`. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.73` after the diagnostic. +- Learning: TensorRT runtime thread count is not an optimization lever for this accepted plan. The default is already the fastest measured setting for the graph-only ceiling, and larger values regress slightly. No code change was kept. + +### Rejected: Explicit TensorRT Shape Inference Before Graph Capture + +- Hypothesis: Calling `IExecutionContext.infer_shapes()` after setting the static input shape and tensor addresses, before the warmup enqueue and CUDA graph capture, might finalize TensorRT shape state earlier and produce a slightly faster captured graph. +- Diagnostic: Used the accepted-engine graph-only harness with three modes: baseline, `infer_shapes()`, and `infer_shapes()` plus `get_tensor_strides(...)` inspection for all I/O tensors. This measured only CUDA graph replay of `execute_async_v3(...)`; pipeline depth was not varied and depth `3` was not tested. +- Correctness: `infer_shapes()` returned an empty list of missing tensors, and tensor strides matched the accepted static layout: input `(292032, 97344, 312, 1)`, boxes `(400, 4, 1)`, logits `(9100, 91, 1)`, masks `(608400, 6084, 78, 1)`. +- Result: Baseline graph replay measured `4.055898 ms` (`246.55 fps`), `infer_shapes()` measured `4.075415 ms` (`245.37 fps`), and `infer_shapes()` plus stride inspection measured `4.087416 ms` (`244.65 fps`). +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.05` after the diagnostic. +- Learning: TensorRT already has the static shape state it needs before capture. Explicit shape inference does not improve the captured graph and slightly regresses the graph-only ceiling, so no runtime change was kept. + +### Diagnostic: TensorRT Helper Operation Ceiling + +- Hypothesis: The accepted workflow is close to TensorRT graph-only speed because depth-2 scheduling overlaps part of the helper/postprocess tail. Measuring graph replay with the helper's input D2D copy and output clone operations in isolation can bound how much benefit remains from further clone/copy tuning. +- Diagnostic: Used the accepted engine and one captured `execute_async_v3(...)` CUDA graph. Timed four single-stream modes with CUDA events: graph replay only, input-buffer D2D copy plus graph replay, graph replay plus three output clones, and input-buffer D2D copy plus graph replay plus three output clones. This diagnostic excludes preprocessing, postprocess kernels, D2H prediction copies, and workflow CPU materialization; pipeline depth was not varied and depth `3` was not tested. +- Result: graph only measured `4.050621 ms` (`246.88 fps`), input copy plus graph measured `4.081056 ms` (`245.03 fps`), graph plus output clones measured `4.103890 ms` (`243.67 fps`), and input copy plus graph plus output clones measured `4.129317 ms` (`242.17 fps`). +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.10` after the diagnostic. +- Learning: Input copy and output clones are real costs, but the accepted two-frame schedule overlaps enough surrounding work that the full workflow can run slightly faster than a single-stream serialized helper loop. Prior clone/borrow/copy rewrites lost this overlap. Further copy/clone tuning is unlikely to beat the current schedule unless it preserves the same decoupling while reducing the TensorRT graph body or replacing the serialized plan. + +### Diagnostic: TensorRT Temporary Allocator Probe + +- Hypothesis: `IExecutionContext.temporary_allocator` could reveal or control hidden TensorRT temporary allocations during the warmup/capture/replay path. If TensorRT was allocating temporary device buffers around graph capture, a preallocated allocator might reduce graph setup or replay jitter. +- Diagnostic: Used a temporary Python harness against the accepted engine and attached a logging `trt.IGpuAllocator` to the CUDA graph execution context. The allocator returned `None` for allocations so any real TensorRT temporary allocation would be visible and fail fast; no workflow code was changed. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Result: The allocator received `0` callbacks after the pre-capture warmup enqueue, `0` callbacks after CUDA graph capture, and `0` callbacks after ten graph replays. +- Learning: TensorRT is not using the per-context temporary allocator in this static RFDETR graph path. There is no useful temporary-allocation hook to optimize, and adding a custom allocator would only add complexity. + +### Profile: Depth-2 Graph-Paced Nsight Systems Refresh + +- Request: Capture a fresh Nsight Systems report for user analysis while keeping workflow pipeline depth fixed at `2`. Depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_graphpaced_20260523_211726.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphpaced_20260523_211726.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphpaced_20260523_211726_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphpaced_20260523_211726_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphpaced_20260523_211726_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.30s fps=234.28`. +- Graph spacing: The capture contains `602` CUDA graph traces. After skipping `64` capture warmup replays plus `100` settling launches, graph duration was p50 `4070.800 us`, p90 `4134.482 us`, p95 `4137.841 us`, mean `4075.039 us`; graph end-to-next-start gap was p50 `40.383 us`, p90 `41.548 us`, p95 `41.856 us`, mean `40.500 us`; graph start-to-start interval was p50 `4111.038 us`, p90 `4175.217 us`, p95 `4178.257 us`, mean `4115.560 us`. +- Gap decomposition after the same skip: busy work inside the graph-to-graph gap was p50 `34.943 us`, p90 `36.319 us`, p95 `37.036 us`, mean `35.086 us`; true idle was p50 `5.408 us`, p90 `6.048 us`, p95 `6.272 us`, mean `5.414 us`. +- Gap occupants: the largest clipped occupants were next-frame input Device-to-Device copy (`1168128B`, `13.130 us/gap`), TensorRT mask Device-to-Device clone (`2433600B`, `13.114 us/gap`), sigmoid (`6.902 us/gap`), fill-long (`2.812 us/gap`), logits Device-to-Device clone (`36400B`, `2.105 us/gap`), boxes Device-to-Device clone (`1600B`, `2.000 us/gap`), fill-int (`1.947 us/gap`), and `_select_topk_boxes_kernel` (`1.660 us/gap`). +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.29` immediately after the profile. +- Learning: The run is already graph-paced with only about `5.4 us` median idle between TensorRT graph replays. The consistent limiter is still the approximately `4.075 ms` TensorRT CUDA graph body plus unavoidable input/output ownership copies and a narrow fused-postprocess tail; CPU work is not blocking the next model forward in the steady state. + +### Diagnostic: Official RFDETR Package Metadata Refresh + +- Hypothesis: Since the accepted workflow is within about `1.2%` of the packaged TensorRT graph-only ceiling, a newer official T4 TensorRT package would be the cleanest way to improve the graph body without risking prediction drift from local rebuilds. +- Diagnostic: Queried the current Roboflow weights-provider metadata through the local `inference_models` provider path for `rfdetr-seg-nano`. This was a metadata-only check; the workflow benchmark configuration remains fixed at pipeline depth `2`, and depth `3` was not tested. +- Result: The provider still resolves `rfdetr-seg-nano` to `coco-dataset-vdnr1/41` and exposes the same six public packages: L4 TRT FP32 `3e3ddd85586b43e4fac6d319fb2927fd`, ONNX FP32 `5362b72bfb9f01d2e0b8cba2048d932c`, L4 TRT FP16 `89d1f41e2af4f4f3ffcdfb77e774d26a`, Torch FP32 `8b8da2fe824240522a39f3cde41aafae`, T4 TRT FP32 `bbc2cc23adf6f5e71a9241956081da96`, and T4 TRT FP16 `c70f32369a54d61e06ef4e6b56c82524`. +- Sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.01` after the metadata probe. +- Learning: There is no newer official T4 TensorRT artifact available from the provider. The accepted T4 FP16 package remains the only official engine that has passed the all-frame class/box/mask correctness gate; further graph-body gains still need a behavior-equivalent export source or a new official package. + +### Rejected: TensorRT Input-Consumed Event + +- Hypothesis: `IExecutionContext.set_input_consumed_event(...)` could let TensorRT expose input-buffer consumption earlier or alter graph scheduling enough to reduce the accepted graph-only replay duration. +- Diagnostic: Temporary graph-only harness only; created a CUDA event, attached it to the TensorRT execution context before warmup and CUDA graph capture, then compared `execute_async_v3(...)` graph replay timing against two baseline graph-only captures. Pipeline depth was not varied and depth `3` was not tested. +- Result: First baseline graph replay measured `4.068339 ms` mean over five 1000-replay batches. With the input-consumed event attached, graph replay measured `4.110596 ms` mean. A later same-session baseline measured `4.137564 ms` mean, showing session drift but no event-driven improvement. A normal accepted depth-2 sanity run afterward measured `frames=538 elapsed=2.22s fps=242.75`. +- Learning: The input-consumed event does not reduce the static TensorRT graph body and may add event work or capture complexity. Keep the accepted graph context without an input-consumed event. + +### Rejected: TensorRT Device-Memory Size Refresh On Static Context + +- Hypothesis: Calling `IExecutionContext.update_device_memory_size_for_shapes()` after static input shape and tensor address binding might force TensorRT to finalize activation-memory planning before CUDA graph capture and reduce graph replay jitter, even without switching to external device memory. +- Diagnostic: Temporary graph-only harness only; compared accepted `execute_async_v3(...)` CUDA graph replay with and without calling `update_device_memory_size_for_shapes()` before warmup/capture. Pipeline depth was not varied and depth `3` was not tested. +- Result: Baseline graph-only replay measured `4.043234 ms` mean over five 1000-replay batches. Calling `update_device_memory_size_for_shapes()` on the accepted static-allocation context emitted a TensorRT API usage error requiring `ExecutionContextAllocationStrategy.USER_MANAGED`, returned `0`, and measured `4.077969 ms` mean. A later baseline measured `4.093944 ms` mean. A normal accepted depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.53`. +- Learning: `update_device_memory_size_for_shapes()` is not a valid optimization knob for the accepted TensorRT context allocation mode. The USER_MANAGED/external-memory path was already rejected, so keep the default static context and do not call the refresh method. + +### Profile: Two-Warp Mask Resize Nsight Compute Snapshot + +- Hypothesis: The accepted two-warp mask resize launch improved end-to-end throughput in A/B tests, but the lower-level launch shape had not been re-profiled after the warp-count change. An Nsight Compute snapshot could show whether another safe Triton launch-shape change is still justified. +- Profile: `/tmp/rfdetr_resize_twowarp_ncu_20260523_213404.ncu-rep`, details text `/tmp/rfdetr_resize_twowarp_ncu_20260523_213404_details.txt`, raw CSV `/tmp/rfdetr_resize_twowarp_ncu_20260523_213404_raw.csv`. Capture used `--set basic`, `--kernel-name 'regex:_resize_selected_masks_kernel'`, `--launch-skip 100`, `--launch-count 3`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.02s fps=38.38`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use `(7, 215, 1)x(64, 1, 1)`, grid size `1505`, block size `64`, `46` registers/thread, no shared memory, `2.35` waves/SM, theoretical occupancy `100%`, achieved occupancy about `77.96-78.75%`, achieved active warps/SM about `24.95-25.20`, compute throughput about `51.35-51.54%`, memory throughput about `39.92-40.06%`, DRAM throughput about `3.94-4.06%`, and duration about `12.83-12.96 us` under NCU replay. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.07` after the profile. +- Learning: The two-warp launch does reduce the block size to `64` threads while maintaining high occupancy. NCU's main warning is the partial-wave tail from the fixed `1505`-block grid, but prior block-size and detection-limit variants already lost end to end. Keep the accepted `block_size=256`, `num_warps=2` resize geometry unless a future change can remove the whole launch or fuse it with another required operation. + +### Rejected: Countless Limited Mask Resize + +- Hypothesis: In the workflow fast path, CPU conversion slices dense masks by the copied `valid_count`, so mask rows beyond that count are ignored. A limited resize kernel that always computes the fixed `7` rows without loading `count` or predicating on `det_index < count` could simplify Triton codegen enough to offset doing extra ignored rows. +- Change tested: Temporary code only; added `_resize_limited_masks_without_count_kernel(...)` and used it when `detection_limit < MAX_RFDETR_DETECTIONS`. The original count-aware resize remained for full-capacity recovery. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed. For the benchmark path, rows beyond the copied GPU count are not exposed because `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` copies the count and slices all returned NumPy arrays by `valid_count`; `query_indices` is zero-filled before selector writes. +- Result on requested command: countless limited resize measured `frames=538 elapsed=2.20s fps=244.24` and `frames=538 elapsed=2.21s fps=243.83`. After reverting to the accepted count-aware resize, the same-session run measured `frames=538 elapsed=2.20s fps=244.13`. +- Learning: Removing the count predicate does not produce a stable improvement and can waste work on ignored rows. Keep the simpler count-aware limited resize kernel. + +### Profile: Current Depth-2 Graph-Node Nsight Systems Refresh + +- Request: Capture a new Nsight Systems report that shows the current accepted CUDA graph pacing after the fused GPU postprocess and two-warp mask resize work. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_graphnode_current_20260523_214214.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphnode_current_20260523_214214.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphnode_current_20260523_214214_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphnode_current_20260523_214214_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphnode_current_20260523_214214_stats_cuda_api_sum.csv`. +- Result under profiler: `frames=538 elapsed=2.33s fps=230.47`. +- Graph-node structure: The trace contains `602` TensorRT graph replays with `242` graph nodes per replay: `239` kernels and `3` graph memsets. After skipping `64` capture warmup replays plus `100` settling launches, each replay used `6` CUDA streams. +- Graph-node timing after the same skip: envelope duration was p50 `4125.677 us`, p90 `4185.984 us`, p95 `4197.207 us`, mean `4133.287 us`; union GPU-busy time inside the graph envelope was p50 `3910.607 us`, p90 `3969.418 us`, p95 `3976.717 us`, mean `3917.447 us`; internal no-activity idle was p50 `215.418 us`, p90 `219.535 us`, p95 `223.075 us`, mean `215.840 us`. The mean graph busy fraction was `94.78%`, with mean overlap factor `1.0406`. +- Graph-to-graph spacing after the same skip: start-to-start interval was p50 `4168.413 us`, p90 `4228.188 us`, p95 `4239.625 us`, mean `4175.957 us`; graph end-to-next-start gap was p50 `42.783 us`, p90 `44.140 us`, p95 `44.422 us`, mean `42.937 us`. +- Gap decomposition: non-graph GPU work inside the graph-to-graph gap was p50 `38.175 us`, p90 `39.712 us`, p95 `40.389 us`, mean `38.253 us`; true idle was p50 `4.576 us`, p90 `5.280 us`, p95 `5.472 us`, mean `4.684 us`. +- Gap occupants: the largest clipped occupants were TensorRT mask Device-to-Device clone (`2433600B`, `13.196 us/gap`), next-frame input Device-to-Device copy (`1168128B`, `13.178 us/gap`), four vectorized elementwise kernels in aggregate (`11.570 us/gap`), boxes Device-to-Device clone (`1600B`, `4.613 us/gap`), `_select_topk_boxes_kernel` (`2.206 us/gap`), and logits Device-to-Device clone (`36400B`, `2.101 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`808.869 us/replay`), 12 Myelin MHA nodes (`586.449 us/replay`), 12 fused FP16/FP32 GEMM nodes (`481.233 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.621 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.588 us/replay`). +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.35` immediately after the profile. +- Learning: The current accepted workflow is already bottlenecked by the TensorRT CUDA graph body, with the next graph starting about `43 us` after the previous graph-node envelope ends and only about `4.6 us` median true idle. The remaining gap is almost entirely real GPU copy/postprocess work needed to hand off graph outputs and prepare the next input; CPU work is not blocking the next forward pass in steady state. + +### Rejected: Early Return In Limited Mask Resize + +- Hypothesis: The accepted limited mask resize launches the fixed seven detection rows and predicates memory operations with `det_index < count`, but still does coordinate arithmetic for invalid rows when a frame has fewer than seven detections. Adding an early device-side return for `det_index >= count` could skip that arithmetic without changing valid mask pixels. +- Change tested: Temporary code only; in `_resize_selected_masks_kernel(...)`, loaded `count` first, returned immediately when `det_index >= count`, simplified `valid` to only check pixel bounds, and loaded `query_indices[det_index]` unmasked. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: `py_compile` passed and Triton JIT compilation succeeded during the benchmark. For valid detection rows the interpolation and threshold math was unchanged; invalid rows are ignored by the fixed-count CPU conversion because it slices all returned arrays by the copied GPU `valid_count`. +- Result on requested command: early-return runs measured `frames=538 elapsed=2.21s fps=243.57` and `frames=538 elapsed=2.21s fps=243.13`. After reverting to the accepted predicated kernel, same-session accepted runs measured `frames=538 elapsed=2.22s fps=242.64` and `frames=538 elapsed=2.21s fps=243.39`, showing some session drift but no stable improvement from the branch. +- Learning: The extra runtime branch in the Triton resize kernel does not tighten end-to-end depth-2 throughput. The invalid-row arithmetic is either hidden or cheaper than the branch/codegen side effects, so keep the accepted predicated count-aware resize kernel. + +### Profile: TensorRT H1688 GEMM Graph-Node Nsight Compute Snapshot + +- Hypothesis: The current graph-node profile shows `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` as the next largest unprofiled TensorRT graph-body family after the MHA and top GEMM groups. Profiling it can show whether there is an obvious custom-kernel replacement opportunity or whether it shares the same packaged-engine small-grid limitation. +- Profile: `/tmp/rfdetr_trt_h1688gemm_graphnode_basic_ncu_20260523_215113.ncu-rep`, details text `/tmp/rfdetr_trt_h1688gemm_graphnode_basic_ncu_20260523_215113_details.txt`, raw CSV `/tmp/rfdetr_trt_h1688gemm_graphnode_basic_ncu_20260523_215113_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1`, skipped `900` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.23s fps=37.82`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use `(2, 48, 1)x(128, 1, 1)`, grid size `96`, block size `128`, `194` registers/thread, `32.768 KiB` static shared memory per block, and `1.20` waves/SM. Theoretical occupancy is `25%`, achieved occupancy is about `22.34-22.52%`, and active warps/SM are about `7.15-7.21`. +- Throughput: Duration was about `77.66-78.30 us` under NCU replay. Compute throughput was about `45.19-46.04%`, memory throughput about `31.26-31.85%`, DRAM throughput about `29.12-29.70%`, L1/TEX throughput about `62.54-63.62%`, and L2 throughput about `22.72-22.92%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.07` after the profile in the same drifted session. +- Learning: This TensorRT graph-body family is also occupancy-limited by registers and shared memory, with a small fixed grid and only about `1.2` waves/SM. It reinforces that the remaining large costs are inside the serialized TensorRT plan/tactics rather than the custom postprocess kernels; replacing a single small postprocess kernel cannot recover this repeated graph-body time. + +### Profile: TensorRT Myelin GELU/Transpose Nsight Compute Snapshot + +- Hypothesis: The current graph-node profile shows the Myelin fused `__myl_DivCastErfCastAddMulMulReshTranAdd_*` family at about `150 us/replay` aggregated across four graph nodes. Profiling it can distinguish a memory-bound TensorRT elementwise/transpose cost from the occupancy-limited GEMM/MHA tactic costs already observed. +- Profile: `/tmp/rfdetr_trt_myelin_erf_graphnode_basic_ncu_20260523_215453.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_erf_graphnode_basic_ncu_20260523_215453_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_erf_graphnode_basic_ncu_20260523_215453_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_DivCastErfCastAddMulMulReshTranAdd_.*`, skipped `1200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.43s fps=37.29`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `761`, block size `256`, `33` registers/thread, no static or dynamic shared memory, and `4.76` waves/SM. Achieved occupancy is about `88.49-89.82%`, with active warps/SM about `28.32-28.74`. +- Throughput: Duration was about `41.02-41.41 us` under NCU replay. Compute throughput was about `61.02-61.61%`, memory throughput and DRAM throughput were about `79.94-80.49%`, L1/TEX throughput about `26.53-26.77%`, and L2 throughput about `25.92-26.14%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.63` after the profile. +- Learning: Unlike the GEMM/MHA/H1688 families, this Myelin fused elementwise/transpose kernel is already high-occupancy and DRAM-bound. A custom replacement would need to reduce or eliminate global memory traffic at the model graph/export level; standalone kernel tuning is unlikely to beat TensorRT here. + +### Rejected: Per-Result Pinned Host Views Without NumPy Copies + +- Hypothesis: The accepted fixed-count RFDETR conversion path copies GPU tensors into reusable pinned CPU buffers, synchronizes once, then copies each NumPy view again with `.copy()` so queued predictions are not backed by reusable storage. Allocating per-result pinned host tensors and returning NumPy views backed by those tensors could remove the extra host copies while preserving safe result ownership. +- Change tested: Temporary code only; in `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)`, replaced the thread-local reusable conversion buffers with fresh pinned tensors for `xyxy`, confidence, class IDs, masks, and count, then returned `tensor.numpy()` views without `.copy()`. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness/safety: `py_compile` passed. PyTorch-backed NumPy views keep their base CPU tensor alive after the local tensor variable is released, so this avoids the unsafe case of returning views into reusable thread-local buffers that can be overwritten while predictions wait in the dispatch queue. +- Result on requested command: per-result pinned view runs measured `frames=538 elapsed=2.20s fps=244.13` and `frames=538 elapsed=2.21s fps=243.24`. After reverting to the accepted reusable pinned buffer plus NumPy-copy path, the same-session run measured `frames=538 elapsed=2.21s fps=243.23`. +- Learning: Removing the final host copies does not compensate for per-frame pinned allocation cost and possible allocator/cache effects. Keep the accepted reusable pinned staging buffers with owned NumPy copies; they preserve queued-result safety without adding host allocation churn. + +### Profile: TensorRT Myelin LayerNorm Nsight Compute Snapshot + +- Hypothesis: The current graph-node profile shows the Myelin fused `__myl_MulAddCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_*` family at about `148 us/replay` aggregated across about twenty graph nodes. Profiling it can show whether this repeated layernorm-style work is memory-bound, partial-wave limited, or a plausible custom-kernel target. +- Profile: `/tmp/rfdetr_trt_myelin_layernorm_graphnode_basic_ncu_20260523_220225.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_layernorm_graphnode_basic_ncu_20260523_220225_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_layernorm_graphnode_basic_ncu_20260523_220225_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_MulAddCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_.*`, skipped `1200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.18s fps=103.93`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `170`, block size `128`, `39` registers/thread, `128 B` dynamic shared memory, and only `0.53` waves/SM. Achieved occupancy is about `53.20-55.56%`, with active warps/SM about `17.03-17.78`. +- Throughput: Duration was about `12.10-12.42 us` under NCU replay. Compute throughput was about `21.59-22.03%`, memory/DRAM throughput about `39.74-40.26%`, L1/TEX throughput about `32.52-33.11%`, and L2 throughput about `20.40-20.91%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.13` after the profile. +- Learning: This repeated Myelin layernorm-style family is partial-wave limited and modestly memory-bound, but each individual launch is only about `12 us` under NCU and the cost is spread across many nodes. A custom replacement would need to remove or fuse many repeated model-graph nodes; replacing one kernel family outside a rebuilt TensorRT graph is not practical in the accepted serialized engine. + +### Profile: TensorRT 128x64 FP16/FP32 GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node profile shows `sm75_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize128x64x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` as the largest still-unprofiled TensorRT graph-body family, at about `225 us/replay` aggregated across about twelve graph nodes. Profiling it can confirm whether this smaller GEMM family shares the low-occupancy small-grid limitation seen in the other TensorRT GEMM/MHA families. +- Profile: `/tmp/rfdetr_trt_gemm128x64_graphnode_basic_ncu_20260523_220525.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x64_graphnode_basic_ncu_20260523_220525_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm128x64_graphnode_basic_ncu_20260523_220525_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `1200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.26s fps=37.73`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `36`, block size `128`, `186` registers/thread, `24.576 KiB` dynamic shared memory, and only `0.45` waves/SM. Achieved occupancy is about `13.72-13.96%`, with active warps/SM about `4.39-4.47`. +- Throughput: Duration was about `32.06-32.54 us` under NCU replay. Compute throughput was about `31.56-31.63%`, memory throughput about `27.57-27.63%`, DRAM throughput about `12.92-13.17%`, L1/TEX throughput about `55.14-55.26%`, and L2 throughput about `19.41-19.73%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.07` after the profile. +- Learning: This smaller GEMM family is also severely small-grid/occupancy limited, even more than the larger profiled GEMMs. The repeated transformer body remains constrained by TensorRT tactic/export structure on T4; any large gain would need a correctness-equivalent engine/export that changes many GEMM/MHA tactics together, not a local postprocess or single-kernel patch. + +### Profile: TensorRT 32x32 TN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node profile shows `sm75_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize32x32x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` at about `135 us/replay` aggregated across about sixteen graph nodes. Profiling it can show whether these very small GEMMs are launch-shape limited enough to explain part of the graph-body idle and low occupancy. +- Profile: `/tmp/rfdetr_trt_gemm32x32_tn_graphnode_basic_ncu_20260523_220826.ncu-rep`, details text `/tmp/rfdetr_trt_gemm32x32_tn_graphnode_basic_ncu_20260523_220826_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm32x32_tn_graphnode_basic_ncu_20260523_220826_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `1200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.05s fps=38.29`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `32`, block size `128`, `63` registers/thread, `8.192 KiB` dynamic shared memory, and only `0.10` waves/SM. Achieved occupancy is about `13.18-13.53%`, with active warps/SM about `4.22-4.33`. +- Throughput: Duration ranged from `12.74-17.12 us` under NCU replay. Compute throughput was about `10.64-14.31%`, memory throughput about `14.14-19.07%`, DRAM throughput about `7.07-11.38%`, L1/TEX throughput about `28.29-38.14%`, and L2 throughput about `9.24-13.13%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.71` after the profile. +- Learning: This very small GEMM family is dominated by partial-wave occupancy, with only `32` blocks for `40` T4 SMs. The evidence continues to point at TensorRT export/tactic structure: the graph body contains many tiny transformer GEMM nodes that cannot fill the GPU independently. Improving them would require a different correct engine or broader graph fusion, not a local wrapper or postprocess patch. + +### Profile: TensorRT Depthwise HMMA Convolution Nsight Compute Snapshot + +- Hypothesis: The current graph-node profile shows `sm50_xmma_convolution_depthwiseHMMA_FP16NHWCx8_TR3_TS3_STRIDEH1_STRIDEW1_execute_kernel_trt` at about `132 us/replay` aggregated across four graph nodes. Profiling it can check whether the convolution portion has the same small-grid issue as the GEMM/MHA nodes or a different memory/occupancy profile. +- Profile: `/tmp/rfdetr_trt_depthwise_conv_graphnode_basic_ncu_20260523_221124.ncu-rep`, details text `/tmp/rfdetr_trt_depthwise_conv_graphnode_basic_ncu_20260523_221124_details.txt`, raw CSV `/tmp/rfdetr_trt_depthwise_conv_graphnode_basic_ncu_20260523_221124_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.15s fps=38.02`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `1521`, block size `256`, `61` registers/thread, no shared memory, and `9.51` waves/SM. Achieved occupancy is about `85.14-85.65%`, with active warps/SM about `27.25-27.41`. +- Throughput: Duration was about `65.15-65.89 us` under NCU replay. Compute and memory throughput were both about `49.22-49.37%`, DRAM throughput about `38.37-38.56%`, L1/TEX throughput about `85.49-85.77%`, and L2 throughput about `25.47-25.71%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.05` after the profile. +- Learning: Unlike the repeated transformer GEMM/MHA kernels, this depthwise convolution fills the T4 well and is mainly L1/TEX and memory-pipeline pressured. It is not the dominant architectural bottleneck; the larger opportunity remains the many small-grid transformer GEMM/MHA nodes in the serialized TensorRT graph. + +### Profile: TensorRT Fprop Implicit GEMM Nsight Compute Snapshot + +- Hypothesis: After the depthwise convolution profile, the current graph-node trace shows `sm75_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage1_warpsize2x2x1_g1_tensor16x8x8_t1r3s3_execute_kernel_trt` as the next largest unprofiled TensorRT graph-body family, at about `139 us/replay` aggregated across six graph nodes. Profiling it can show whether this convolution-lowered GEMM is another small-grid TensorRT tactic or a memory-bound kernel with local tuning potential. +- Profile: `/tmp/rfdetr_trt_fprop_implicit_gemm_graphnode_basic_ncu_20260523_221615.ncu-rep`, details text `/tmp/rfdetr_trt_fprop_implicit_gemm_graphnode_basic_ncu_20260523_221615_details.txt`, raw CSV `/tmp/rfdetr_trt_fprop_implicit_gemm_graphnode_basic_ncu_20260523_221615_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `900` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.26s fps=37.72`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `22`, block size `128`, `120` registers/thread, `16.384 KiB` dynamic shared memory per block, and only `0.14` waves/SM. Theoretical occupancy is `50%`, achieved occupancy is about `12.89-12.99%`, and active warps/SM are about `4.13-4.16`. +- Throughput: Duration was about `42.27-42.85 us` under NCU replay. Compute throughput was about `21.48-21.73%`, memory throughput about `21.88-22.14%`, DRAM throughput about `5.41-5.65%`, L1/TEX throughput about `46.31-46.50%`, and L2 throughput about `11.46-11.61%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.37` after the profile. +- Learning: This fprop implicit GEMM is another TensorRT-internal small-grid tactic: only `22` CTAs for `40` T4 SMs, low achieved occupancy, and low DRAM pressure. It reinforces that the remaining graph-body loss comes from many serialized small TensorRT nodes; local postprocess or wrapper changes cannot fill these bubbles without a correctness-equivalent engine/export that changes the internal tactic structure. + +### Profile: TensorRT 64x64 TN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x64x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` at about `105 us/replay` aggregated across seven graph nodes. Profiling it can show whether this mid-sized GEMM is more compute-bound than the 32x32 TN and 128x64 GEMM families, or whether it also suffers from partial-wave occupancy. +- Profile: `/tmp/rfdetr_trt_gemm64x64_tn_graphnode_basic_ncu_20260523_221853.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x64_tn_graphnode_basic_ncu_20260523_221853_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm64x64_tn_graphnode_basic_ncu_20260523_221853_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `1200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.44s fps=37.27`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `32`, block size `128`, `110` registers/thread, `16.384 KiB` dynamic shared memory per block, and only `0.20` waves/SM. Theoretical occupancy is `50%`, achieved occupancy is about `13.38-13.43%`, and active warps/SM are about `4.28-4.30`. +- Throughput: Duration was about `26.66-27.17 us` under NCU replay. Compute throughput was about `22.74-23.29%`, memory and DRAM throughput about `27.03-27.19%`, L1/TEX throughput about `47.04-48.20%`, and L2 throughput about `17.02-17.36%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.02` after the profile. +- Learning: The 64x64 TN GEMM still launches fewer CTAs than the T4 has SMs and reaches only about `13.4%` achieved occupancy. This is consistent with the other TensorRT transformer GEMM profiles: the remaining model-forward bubbles are caused by many small serialized TensorRT tactics, not by CPU pacing or the fused GPU postprocess. + +### Rejected: TensorRT Persistent Cache Limit On T4 + +- Hypothesis: `IExecutionContext.persistent_cache_limit` might let TensorRT use persisting L2 activation cache for the repeated RFDETR graph replay and reduce the CUDA graph body without changing model outputs or pipeline scheduling. +- Diagnostic: Temporary graph-only harness only; captured a fresh TensorRT `execute_async_v3(...)` CUDA graph for each requested cache limit and timed five batches of `1000` graph replays. Tested limits were `0`, `262144`, `1048576`, `4194304`, `8388608`, and `16777216` bytes. Pipeline depth was not varied and depth `3` was not tested. +- Result: The T4 reports `cudaDeviceProp.persistingL2CacheMaxSize(0 bytes)`, so every nonzero limit emitted a TensorRT API usage error and the context's actual `persistent_cache_limit` remained `0`. The valid zero-limit baseline measured `4.089558 ms` graph-only mean (`244.53 fps`) in this session; nonzero attempted settings measured slower while still reporting `actual=0`, consistent with session drift and error overhead rather than a usable cache effect. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.22s fps=242.29` after the diagnostic in the same drifted session. +- Learning: Persistent L2 caching is not available on this T4 path, so `persistent_cache_limit` cannot reduce the RFDETR TensorRT graph body. Keep the execution context cache limit at the TensorRT default and do not add this knob to the runtime helper. + +### Profile: TensorRT 128x128 Split-K GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_execute_split_k_kernel_trt` at about `100 us/replay` aggregated across thirteen graph nodes. Profiling it can show whether TensorRT's split-k variant improves occupancy enough to be a useful clue for rebuilding/tuning the larger GEMM families. +- Profile: `/tmp/rfdetr_trt_gemm128x128_splitk_graphnode_basic_ncu_20260523_222610.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x128_splitk_graphnode_basic_ncu_20260523_222610_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm128x128_splitk_graphnode_basic_ncu_20260523_222610_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `1200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.14s fps=38.06`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `72`, block size `128`, `66` registers/thread, `8.704 KiB` dynamic shared memory per block, and only `0.26` waves/SM. Achieved occupancy is about `24.77-25.01%`, with active warps/SM about `7.93-8.00`. +- Throughput: Duration was about `18.59-19.39 us` under NCU replay. Compute throughput was about `10.45-10.63%`, memory and DRAM throughput about `50.90-53.20%`, L1/TEX throughput about `58.70-61.30%`, and L2 throughput about `17.35-18.06%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.62` after the profile. +- Learning: TensorRT's split-k variant improves occupancy relative to the tiniest GEMMs but is still partial-wave limited and more memory-pressure-bound than compute-bound. It supports the same conclusion: the remaining forward graph is constrained by many serialized small TensorRT tactics, and useful gains require a different correct TensorRT export/tactic mix rather than local postprocess tuning. + +### Diagnostic: TensorRT Engine Inspector Metadata + +- Hypothesis: TensorRT's engine inspector might expose layer-to-tactic metadata for the packaged RFDETR engine, giving a path to target specific tactic choices or replace a small set of TensorRT graph-body nodes with custom kernels. +- Diagnostic: Loaded the accepted `rfdetr-seg-nano` TRT engine through the local `inference_models` path and exported engine inspector output to `/tmp/rfdetr_trt_engine_inspector_json_20260523_2228.txt` and `/tmp/rfdetr_trt_engine_inspector_oneline_20260523_2228.txt`. This was metadata-only; pipeline depth was not varied and depth `3` was not tested. +- Result: The engine reports TensorRT `10.12.0.36`, `261` layers, `4` I/O tensors, and `ProfilingVerbosity.LAYER_NAMES_ONLY`. I/O tensors are `input` FLOAT `(1, 3, 312, 312)`, `dets` FLOAT `(1, 100, 4)`, `labels` FLOAT `(1, 100, 91)`, and mask output `4186` FLOAT `(1, 100, 78, 78)`, all LINEAR format. +- Inspector output: The inspector exports layer names such as backbone attention MatMul/MHA and Myelin fused layers, but it does not include tactic IDs, tactic parameters, or enough per-layer implementation detail to drive tactic-level patching from the packaged plan alone. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.93` after the diagnostic. +- Learning: The packaged engine confirms the structural source of the observed graph-body kernels, but it was not built with detailed profiling verbosity. Further TensorRT-forward gains likely require a correctness-equivalent ONNX/export plus a fresh verbose engine build/tactic search, not a local runtime tweak to the existing plan. + +### Profile: TensorRT Myelin Transpose/LayerNorm Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_TranCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_*` at about `85 us/replay` aggregated across four graph nodes. Profiling it can distinguish high-occupancy memory-bound Myelin work from the repeated small-grid GEMM/MHA tactics. +- Profile: `/tmp/rfdetr_trt_myelin_trancast_layernorm_graphnode_basic_ncu_20260523_223005.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_trancast_layernorm_graphnode_basic_ncu_20260523_223005_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_trancast_layernorm_graphnode_basic_ncu_20260523_223005_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_TranCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_.*`, skipped `600` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.34s fps=37.51`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `1521`, block size `(32, 4, 1)`, `26` registers/thread, `128 B` dynamic shared memory per block, and `4.75` waves/SM. Achieved occupancy is about `91.29-92.44%`, with active warps/SM about `29.21-29.58`. +- Throughput: Duration was about `27.10-27.20 us` under NCU replay. Compute throughput was about `44.63-44.93%`, memory and DRAM throughput about `73.35-73.72%`, and L2 throughput about `26.52-26.60%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.22s fps=242.48` after the profile. +- Learning: This Myelin family fills the GPU well and is memory-bandwidth pressured, unlike the small TensorRT GEMM/MHA families. Replacing it outside the TensorRT graph is unlikely to help unless a rebuilt export can remove or fuse the memory traffic with adjacent graph operations. + +### Profile: TensorRT 128x64x32 GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x64x32_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` at about `64 us/replay` aggregated across four graph nodes. Profiling it can confirm whether the remaining medium-sized GEMM families are still partial-wave limited. +- Profile: `/tmp/rfdetr_trt_gemm128x64x32_graphnode_basic_ncu_20260523_223256.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x64x32_graphnode_basic_ncu_20260523_223256_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm128x64x32_graphnode_basic_ncu_20260523_223256_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `600` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.26s fps=37.74`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `32`, block size `128`, `110` registers/thread, `12.288 KiB` dynamic shared memory per block, and only `0.20` waves/SM. Achieved occupancy is about `13.78-13.94%`, with active warps/SM about `4.41-4.46`. +- Throughput: Duration was about `25.50-25.98 us` under NCU replay. Compute throughput was about `24.90-25.65%`, memory throughput about `22.73-23.42%`, DRAM throughput about `20.78-20.91%`, and L2 throughput about `13.86-14.12%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.60` after the profile. +- Learning: This GEMM is another underfilled TensorRT tactic, with only `32` CTAs for `40` T4 SMs and about `13.8%` achieved occupancy. It further narrows the remaining opportunity to TensorRT tactic/export structure rather than postprocess, CPU scheduling, or pipeline depth. + +### Profile: TensorRT Indexed Fprop Implicit GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x64_stage1_warpsize4x1x1_g1_tensor16x8x8_alignc8_execute_kernel_trt` as a single TensorRT graph node around `48 us/replay`. Profiling it can show whether this indexed convolution tactic is occupancy-limited like the other TensorRT GEMMs or limited by indexed memory access. +- Profile: `/tmp/rfdetr_trt_indexed_fprop_graphnode_basic_ncu_20260523_223528.ncu-rep`, details text `/tmp/rfdetr_trt_indexed_fprop_graphnode_basic_ncu_20260523_223528_details.txt`, raw CSV `/tmp/rfdetr_trt_indexed_fprop_graphnode_basic_ncu_20260523_223528_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.12s fps=38.11`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `72`, block size `128`, `126` registers/thread, `4.096 KiB` dynamic shared memory per block, and `0.45` waves/SM. Achieved occupancy is about `22.60-22.81%`, with active warps/SM about `7.23-7.30`. +- Throughput: Duration was about `83.30-83.49 us` under NCU replay. Compute throughput was about `36.28-36.48%`, memory throughput about `40.33-40.58%`, DRAM throughput about `16.66-16.72%`, and L2 throughput about `33.66-33.73%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.00` after the profile. +- Learning: The indexed fprop node has more CTAs than the smallest GEMMs but is still low-occupancy and moderately memory/L2 pressured. It remains a TensorRT-forward tactic issue; the surrounding pipeline is already keeping the CPU and postprocess off the critical path. + +### Profile: TensorRT 128x128 FP16/FP32 NT GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_nt_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_aligna4_alignc4_execute_kernel_trt` as a single TensorRT graph node around `45 us/replay`. Profiling it can show whether this FP32-accumulating NT GEMM is limited by compute throughput, register pressure, or small-grid occupancy. +- Profile: `/tmp/rfdetr_trt_gemm128x128_nt_f16f32_graphnode_basic_ncu_20260523_223846.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x128_nt_f16f32_graphnode_basic_ncu_20260523_223846_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm128x128_nt_f16f32_graphnode_basic_ncu_20260523_223846_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.25s fps=37.75`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `48`, block size `128`, `226` registers/thread, `17.408 KiB` dynamic shared memory per block, and `0.60` waves/SM. Achieved occupancy is about `16.31-16.64%`, with active warps/SM about `5.22-5.32`. +- Throughput: Duration was about `61.76-62.75 us` under NCU replay. Compute throughput was about `27.42-27.58%`, memory throughput about `30.14-30.30%`, DRAM throughput about `11.88-12.18%`, and L2 throughput about `14.61-14.85%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.57` after the profile. +- Learning: This FP32-accumulating NT GEMM is register-heavy and still small-grid limited, with only `48` CTAs and about `16.5%` achieved occupancy. The remaining isolated TensorRT GEMMs continue to reinforce that local wrapper/postprocess work is no longer the limiting surface. + +### Profile: TensorRT 64x32 TN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x32x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` at about `39 us/replay` aggregated across four graph nodes. Profiling it can show whether the small 64x32 GEMM tail has the same underfilled launch geometry as the larger TN GEMMs. +- Profile: `/tmp/rfdetr_trt_gemm64x32_tn_graphnode_basic_ncu_20260523_224056.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x32_tn_graphnode_basic_ncu_20260523_224056_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm64x32_tn_graphnode_basic_ncu_20260523_224056_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `400` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.09s fps=38.19`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `32`, block size `128`, `83` registers/thread, `12.288 KiB` dynamic shared memory per block, and only `0.16` waves/SM. Achieved occupancy is about `13.59-13.73%`, with active warps/SM about `4.35-4.40`. +- Throughput: Duration was about `15.94-16.19 us` under NCU replay. Compute throughput was about `12.82-12.89%`, memory throughput about `16.95-17.05%`, DRAM throughput about `9.96-10.34%`, and L2 throughput about `10.53-10.71%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.22` after the profile. +- Learning: This tail GEMM is again partial-wave limited, with only `32` CTAs and about `13.7%` achieved occupancy. Even the remaining smaller TensorRT graph-body kernels follow the same pattern: many individually short but serialized, underfilled GEMMs. + +### Profile: TensorRT 128x128x64 NN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x128x64_stage1_warpsize2x2x1_tensor16x8x8_aligna4_alignc4_execute_kernel_trt` as a single TensorRT graph node around `39 us/replay`. Profiling it can show whether increasing the K tile changes the occupancy pattern seen in the other 128x128 GEMMs. +- Profile: `/tmp/rfdetr_trt_gemm128x128x64_nn_graphnode_basic_ncu_20260523_224352.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x128x64_nn_graphnode_basic_ncu_20260523_224352_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm128x128x64_nn_graphnode_basic_ncu_20260523_224352_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.15s fps=38.02`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `48`, block size `128`, `228` registers/thread, `32.768 KiB` dynamic shared memory per block, and `0.60` waves/SM. Achieved occupancy is about `16.09-16.30%`, with active warps/SM about `5.15-5.22`. +- Throughput: Duration was about `69.79-70.59 us` under NCU replay. Compute throughput was about `25.86-25.99%`, memory and DRAM throughput about `24.90-25.05%`, and L2 throughput about `11.36-11.51%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.02` after the profile. +- Learning: The larger K tile does not escape the same TensorRT tactic constraint: high register/shared-memory use, a `48`-CTA grid, and about `16%` achieved occupancy. This further supports focusing any future large gain attempt on a new engine/export/tactic search rather than local runtime scheduling. + +### Profile: TensorRT 64x32 TT GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_tt_n_tilesize64x32x64_stage1_warpsize2x2x1_tensor16x8x8_aligna4_alignc4_execute_kernel_trt` at about `38 us/replay` aggregated across four graph nodes. Profiling it can show whether the TT tail is another partial-wave GEMM or has a different memory pattern. +- Profile: `/tmp/rfdetr_trt_gemm64x32_tt_graphnode_basic_ncu_20260523_224618.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x32_tt_graphnode_basic_ncu_20260523_224618_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm64x32_tt_graphnode_basic_ncu_20260523_224618_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `400` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.03s fps=38.35`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `16`, block size `128`, `80` registers/thread, `12.288 KiB` dynamic shared memory per block, and only `0.08` waves/SM. Achieved occupancy is about `13.46-13.60%`, with active warps/SM about `4.31-4.35`. +- Throughput: Duration was about `16.96-17.09 us` under NCU replay. Compute throughput was about `9.61-9.64%`, memory throughput about `12.54-12.56%`, DRAM throughput about `5.60-5.82%`, and L2 throughput about `5.76-5.82%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.15` after the profile. +- Learning: This TT tail is one of the clearest small-grid cases: only `16` CTAs for `40` T4 SMs and `0.08` waves/SM. The remaining graph-body tail is dominated by many such short, underfilled TensorRT kernels. + +### Rejected: CUDA Device Max Connections Sweep + +- Hypothesis: The accepted TensorRT CUDA graph uses multiple internal streams per replay. Changing `CUDA_DEVICE_MAX_CONNECTIONS` before CUDA context creation might alter stream-to-hardware-queue mapping and reduce graph-body bubbles or graph-to-graph scheduling jitter. +- Diagnostic: Ran the accepted workflow command in fresh processes with `CUDA_DEVICE_MAX_CONNECTIONS=1`, `8`, and `32`, always with pipeline depth fixed at `2`; depth `3` was not tested. An initial concurrent launch of the three variants was discarded because the processes contended for the same GPU and produced invalid FPS. +- Result: Same-session default baseline measured `frames=538 elapsed=2.21s fps=243.78`. Sequential valid runs measured `243.65 fps` for `CUDA_DEVICE_MAX_CONNECTIONS=1`, `243.29 fps` for `8`, and `242.67 fps` for `32`. A default rerun afterward measured `frames=538 elapsed=2.21s fps=243.47`. +- Learning: CUDA connection count does not improve the accepted depth-2 workflow and can slightly regress it. The remaining graph-body gaps are not recoverable through this stream scheduling environment knob; keep the default CUDA connection behavior. + +### Diagnostic: TensorRT Runtime Layer Profiler Probe + +- Hypothesis: TensorRT's runtime `IProfiler` could expose per-layer times for the existing packaged plan, giving a layer-name mapping for the Nsight graph-node kernels without rebuilding a verbose engine. +- Diagnostic: Temporary Python harness only; attached a custom `trt.IProfiler` to the accepted execution context and ran non-CUDA-graph `execute_async_v3(...)` through `model.forward(..., use_cuda_graph=False)`. Tested both reporting modes: `enqueue_emits_profile=True` and `enqueue_emits_profile=False` followed by `report_to_profiler()`. Pipeline depth was not varied and depth `3` was not tested. +- Result: With `enqueue_emits_profile=True`, TensorRT completed the forwards but reported `0` layer records; the exported placeholder was `/tmp/rfdetr_trt_runtime_profiler_layers_20260523_2250.txt`. With `enqueue_emits_profile=False`, `report_to_profiler()` returned `False` and TensorRT emitted `invalid resource handle`; the placeholder export was `/tmp/rfdetr_trt_runtime_profiler_layers_report_20260523_2251.txt`. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.91` after the diagnostic. +- Learning: Runtime profiling on the existing async context does not provide usable layer timing. The packaged plan was built with `LAYER_NAMES_ONLY` inspector verbosity and cannot supply tactic-level detail through this route; use Nsight graph-node evidence or rebuild a verbose/correct engine for deeper mapping. + +### Rejected: Detailed-Verbosity TensorRT Rebuild From Cached ONNX + +- Hypothesis: Rebuilding the cached ONNX with `ProfilingVerbosity.DETAILED` could expose tactic-level inspector metadata and potentially identify a viable TensorRT tactic/export direction. If the rebuilt engine matched accepted outputs and graph-only speed, it could become the basis for a deeper tactic search. +- Diagnostic: Built a temporary FP16 static-batch engine from `/tmp/cache/models-cache/coco-dataset-vdnr1-41-e9a19d93/5362b72bfb9f01d2e0b8cba2048d932c/weights.onnx` with a `(1, 3, 312, 312)` optimization profile, `12 GiB` workspace, and `ProfilingVerbosity.DETAILED`. The build wrote `/tmp/rfdetr_trt_rebuild_t4_fp16_detailed_diag/engine.plan` after about `107.5s`; inspector exports are `/tmp/rfdetr_rebuild_detailed_engine_inspector_json_20260523_2258.txt` and `/tmp/rfdetr_rebuild_detailed_engine_inspector_oneline_20260523_2258.txt`. Pipeline depth was not varied and depth `3` was not tested. +- Result: The detailed rebuild has `263` layers and detailed tactic metadata, while the accepted packaged plan has `261` layers and `LAYER_NAMES_ONLY`. Raw output comparison on the same preprocessed image showed significant drift from the accepted plan: max absolute deltas were about `0.8629` for boxes, `2.5078` for logits, and `67.1992` for masks. Graph-only replay also did not improve: accepted plan averaged `4.087844 ms` (`244.63 fps`) while the detailed rebuild averaged `4.099385 ms` (`243.94 fps`). +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.95` after the diagnostic. +- Learning: A local detailed rebuild can expose tactic metadata, but this ONNX-to-TRT path does not preserve accepted raw outputs and is slightly slower in graph-only replay. It is useful for inspection only; do not replace the accepted engine with this rebuild. + +### Profile: TensorRT Myelin Fused Softmax/Slice Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_AddReshReshSlicSlicMaxSubExpSlicSlicAddDivMulSlicTranSlicTranGridCastReshMulReshGridCastEtc_*` at about `33 us/replay` aggregated across four graph nodes. Profiling it can show whether the remaining Myelin tail is memory-bound/high-occupancy like the other large Myelin kernels or underfilled like the GEMM tail. +- Profile: `/tmp/rfdetr_trt_myelin_softmax_tail_graphnode_basic_ncu_20260523_230442.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_softmax_tail_graphnode_basic_ncu_20260523_230442_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_softmax_tail_graphnode_basic_ncu_20260523_230442_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_AddReshReshSlicSlicMaxSubExpSlicSlicAddDivMulSlicTranSlicTranGridCastReshMulReshGridCastEtc_.*`, skipped `400` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.07s fps=38.23`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `100`, block size `256`, `33` registers/thread, no shared memory, and `0.62` waves/SM. Achieved occupancy is about `58.27-58.41%`, with active warps/SM about `18.64-18.69`. +- Throughput: Duration was about `15.17-15.87 us` under NCU replay. Compute throughput was about `30.16-30.48%`, memory throughput about `22.39-22.70%`, DRAM throughput about `9.73-11.18%`, and L2 throughput about `5.17-5.62%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.00` after the profile. +- Learning: This fused Myelin tail is not the same kind of severe small-grid problem as the GEMM tail. It is a short TensorRT-internal fused op with moderate occupancy and modest memory pressure; replacing it outside the TensorRT graph is unlikely to produce a meaningful end-to-end gain. + +### Profile: TensorRT H1688 TN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `trt_turing_h1688gemm_128x128_ldg8_tn_v1` as a single graph node around `33 us/replay`. Profiling it can show whether the tail H1688 TN tactic behaves like the previously profiled H1688 ReLU NN family or like the smallest GEMM tail. +- Profile: `/tmp/rfdetr_trt_h1688_tn_graphnode_basic_ncu_20260523_230657.ncu-rep`, details text `/tmp/rfdetr_trt_h1688_tn_graphnode_basic_ncu_20260523_230657_details.txt`, raw CSV `/tmp/rfdetr_trt_h1688_tn_graphnode_basic_ncu_20260523_230657_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:trt_turing_h1688gemm_128x128_ldg8_tn_v1`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.10s fps=38.14`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `36`, block size `128`, `190` registers/thread, `32.768 KiB` static shared memory per block, and `0.45` waves/SM. Achieved occupancy is about `12.57-12.61%`, with active warps/SM about `4.02-4.03`. +- Throughput: Duration was about `58.62-58.66 us` under NCU replay. Compute throughput was about `44.31-44.61%`, memory throughput about `27.36-27.53%`, DRAM throughput about `25.95-26.19%`, and L2 throughput about `20.95-20.97%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.54` after the profile. +- Learning: This tail H1688 GEMM is also limited by high register/shared-memory usage and a small fixed grid. It matches the broader TensorRT-forward diagnosis: the remaining large opportunity is a different correct engine/tactic mix, not local postprocess or CPU pipeline work. + +### Profile: TensorRT Myelin Mean-Reduction Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_MulAddCastMeanSubMulMean_*` at about `30 us/replay` aggregated across three graph nodes. Profiling it can show whether this smaller layernorm-style Myelin tail is dominated by partial-wave occupancy or by memory traffic. +- Profile: `/tmp/rfdetr_trt_myelin_mean_tail_graphnode_basic_ncu_20260523_231042.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_mean_tail_graphnode_basic_ncu_20260523_231042_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_mean_tail_graphnode_basic_ncu_20260523_231042_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_MulAddCastMeanSubMulMean_.*`, skipped `400` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.03s fps=38.35`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `170`, block size `(32, 4, 1)`, `32` registers/thread, `64 B` dynamic shared memory per block, and `0.53` waves/SM. Achieved occupancy is about `48.08-49.43%`, with active warps/SM about `15.39-15.82`. +- Throughput: Duration was about `14.91-15.62 us` under NCU replay. Compute throughput was about `14.38-14.90%`, memory and DRAM throughput about `48.88-49.57%`, and L2 throughput about `26.23-27.66%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.86` after the profile. +- Learning: This Myelin mean-reduction tail is modestly occupied and memory/DRAM pressured. Like the other Myelin layernorm-style kernels, it would need graph-level fusion or a different export to remove memory traffic; it is not a promising local postprocess/wrapper replacement target. + +### Profile: Current Depth-2 Graph-Node Nsight Systems Refresh + +- Request: Capture another current Nsight Systems report for timeline analysis while keeping pipeline depth fixed at `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_graphnode_refresh_20260523_231415.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_graphnode_refresh_20260523_231415.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_graphnode_refresh_20260523_231415_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_graphnode_refresh_20260523_231415_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_graphnode_refresh_20260523_231415_stats_cuda_api_sum.csv`. +- Result under graph-node profiling overhead: `frames=538 elapsed=2.33s fps=230.72`. +- Graph-node structure: The trace contains `602` TensorRT graph replays with exactly `242` graph activities per replay: `239` kernels and `3` graph memsets. After skipping `64` capture warmup replays plus `100` settling launches, each replay used `6` CUDA streams. +- Graph-node timing after the same skip: envelope duration was p50 `4123.887 us`, p90 `4182.987 us`, p95 `4186.202 us`, mean `4132.353 us`; union GPU-busy time inside the graph envelope was p50 `3911.090 us`, p90 `3968.352 us`, p95 `3971.243 us`, mean `3918.704 us`; internal no-activity idle was p50 `213.532 us`, p90 `216.155 us`, p95 `217.728 us`, mean `213.650 us`. The mean graph busy fraction was `94.83%`, with mean overlap factor `1.0406`. +- Graph-to-graph spacing after the same skip: start-to-start interval was p50 `4166.366 us`, p90 `4225.431 us`, p95 `4227.996 us`, mean `4174.746 us`; graph end-to-next-start gap was p50 `42.336 us`, p90 `44.063 us`, p95 `44.287 us`, mean `42.354 us`. +- Gap decomposition: non-graph GPU work inside the graph-to-graph gap was p50 `37.951 us`, p90 `39.359 us`, p95 `40.006 us`, mean `37.913 us`; true idle was p50 `4.384 us`, p90 `5.056 us`, p95 `5.190 us`, mean `4.441 us`. +- Gap occupants: the largest clipped occupants were the next-frame input Device-to-Device copy (`1168128B`, `13.139 us/gap`), TensorRT mask Device-to-Device clone (`2433600B`, `13.128 us/gap`), four vectorized elementwise kernels in aggregate (`11.517 us/gap`), boxes Device-to-Device clone (`1600B`, `4.671 us/gap`), `_select_topk_boxes_kernel` (`2.156 us/gap`), and logits Device-to-Device clone (`36400B`, `2.103 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`809.182 us/replay`), 12 Myelin MHA nodes (`586.631 us/replay`), 12 fused FP16/FP32 GEMM nodes (`481.765 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.743 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.539 us/replay`). +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.46` after the profile. +- Learning: The fresh report reproduces the intended shape: the next TensorRT CUDA graph starts about `42 us` after the previous graph envelope ends, and only about `4-5 us` of that gap is true idle. CPU work is already overlapped with GPU forward replay in steady state; remaining FPS is constrained by TensorRT graph-body kernels plus the unavoidable small copy/postprocess work between replays. + +### Profile: TensorRT 64x32 FP16/FP32 TN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x32x64_stage1_warpsize2x2x1_tensor16x8x8_aligna2_alignc2_execute_kernel_trt` as a single graph node at about `36 us/replay`. Profiling it can check whether this FP32-accumulating tail differs from the already-profiled FP16 64x32 TN/TT tails or shares their small-grid limitation. +- Profile: `/tmp/rfdetr_trt_gemm64x32_f16f32_tn_graphnode_basic_ncu_20260523_231921.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x32_f16f32_tn_graphnode_basic_ncu_20260523_231921_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm64x32_f16f32_tn_graphnode_basic_ncu_20260523_231921_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:sm75_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x32x64.*`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=13.98s fps=38.48`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `33`, block size `128`, `112` registers/thread, `12.29 KiB` dynamic shared memory per block, and only `0.21` waves/SM. Theoretical occupancy is `50%`, but achieved occupancy is only about `13.05-13.25%`, with active warps/SM about `4.18-4.24`. +- Throughput: Duration was about `28.22-28.61 us` under NCU replay. Compute throughput was about `18.08-18.43%`, memory throughput about `26.63-27.16%`, DRAM throughput about `8.62-8.82%`, and L2 throughput about `7.20-7.26%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.22s fps=242.45` after the profile in the same drifted session. +- Learning: This FP32-accumulating 64x32 TN tail has the same underfilled launch shape as the other TensorRT GEMM tails: fewer CTAs than T4 SMs, low achieved occupancy, and no sign of a local postprocess or CPU-pipeline bottleneck. It reinforces that further large gains need a correctness-equivalent TensorRT engine/tactic/export change that alters many graph-body GEMMs together. + +### Profile: TensorRT Myelin Transpose Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_Tran_0xda0f5a944af3ec9d0f264b0c175d90c9` as a single graph node at about `27 us/replay`. Profiling it can distinguish whether this remaining tail is another small-grid occupancy problem or a memory-bandwidth-bound Myelin transform. +- Profile: `/tmp/rfdetr_trt_myelin_tran_tail_graphnode_basic_ncu_20260523_232308.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_tran_tail_graphnode_basic_ncu_20260523_232308_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_tran_tail_graphnode_basic_ncu_20260523_232308_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_Tran_0xda0f5a944af3ec9d0f264b0c175d90c9`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.05s fps=38.29`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `761`, block size `256`, `16` registers/thread, no dynamic or static shared memory, and `4.76` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `106.53-110.08%`, and active warps/SM are about `34.09-35.23`. +- Throughput: Duration was about `24.77-25.06 us` under NCU replay. Compute throughput was about `22.07-22.44%`, memory and DRAM throughput about `91.42-92.34%`, and L2 throughput about `28.62-28.96%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.42` after the profile. +- Learning: This Myelin transpose tail is already well-filled and DRAM-bandwidth-bound, unlike the small-grid GEMM tails. Removing it would require graph-level fusion or a different TensorRT export/tactic layout that avoids the memory movement; it is not a local wrapper or pipeline scheduling issue. + +### Profile: TensorRT Myelin SiLU/Transpose Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_AddSqrtDivMulCastMulAddTranSilu_0x22714f29f052967a02878862fc5fe9d8` at about `25 us/replay` aggregated across five graph nodes. Profiling it can show whether this smaller Myelin tail is another bandwidth-bound transform or a short compute/occupancy tail. +- Profile: `/tmp/rfdetr_trt_myelin_silu_tran_tail_graphnode_basic_ncu_20260523_232616.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_silu_tran_tail_graphnode_basic_ncu_20260523_232616_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_silu_tran_tail_graphnode_basic_ncu_20260523_232616_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_AddSqrtDivMulCastMulAddTranSilu_0x22714f29f052967a02878862fc5fe9d8`, skipped `500` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.20s fps=37.89`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `338`, block size `256`, `17` registers/thread, no dynamic or static shared memory, and `2.11` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `85.49-86.33%`, and active warps/SM are about `27.36-27.63`. +- Throughput: Duration was about `8.48-8.67 us` under NCU replay. Compute throughput was about `25.98-26.14%`, memory throughput about `16.94-17.28%`, DRAM throughput about `16.76-17.28%`, and L2 throughput about `7.71-7.86%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.94` after the profile. +- Learning: This repeated Myelin SiLU/transpose tail is short and well occupied; it is neither the severe small-grid GEMM problem nor a dominant memory-bandwidth node. It is another TensorRT-internal graph-body cost that would only be worth removing as part of a broader export/tactic/fusion change. + +### Profile: TensorRT Myelin TopK Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_Topk_0x570b8f0fdb8e2c6c3fd1a15f1595090d` as a single graph node at about `16 us/replay`. Profiling it can show whether this TensorRT-internal TopK is a small-grid scheduling tail and whether it is a plausible future custom/export target. +- Profile: `/tmp/rfdetr_trt_myelin_topk_tail_graphnode_basic_ncu_20260523_232914.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_topk_tail_graphnode_basic_ncu_20260523_232914_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_topk_tail_graphnode_basic_ncu_20260523_232914_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_Topk_0x570b8f0fdb8e2c6c3fd1a15f1595090d`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.02s fps=38.37`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `1`, block size `256`, `99` registers/thread, `16.9 KiB` static shared memory per block, no dynamic shared memory, and only `0.01` waves/SM. Theoretical occupancy is `50%`, achieved occupancy is about `25.91-25.98%`, and active warps/SM are about `8.29-8.31`. +- Throughput: Duration was about `31.55-31.68 us` under NCU replay. Compute and memory throughput were both about `1.48%`, DRAM throughput about `0.29-0.31%`, and L2 throughput about `0.13-0.15%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.36` after the profile. +- Learning: This TopK node is a clear single-block TensorRT-internal tail. It is too small to dominate alone, but it is one of the few remaining graph nodes with obviously poor device-wide utilization. A future engine/export replacement that avoids this internal TopK or folds it into adjacent work could help, but it cannot be patched from the current Python postprocess path because it is inside the serialized TensorRT graph. + +### Rejected: Unsorted TensorRT TopK Rebuild + +- Hypothesis: The weak internal TensorRT TopK node maps to ONNX `/transformer/TopK`, which selects `k=100` from `/transformer/ReduceMax_output_0` on axis `1` with `largest=1` and `sorted=1`. Rebuilding a temporary engine with that ONNX TopK changed to `sorted=0` might let TensorRT choose a cheaper tactic while preserving the unordered proposal set. +- Diagnostic: Temporary artifacts only. Patched the cached ONNX into `/tmp/rfdetr_trt_topk_unsorted_rebuild_diag/weights_topk_unsorted.onnx`, built a static `(1, 3, 312, 312)` FP16 detailed-verbosity engine at `/tmp/rfdetr_trt_topk_unsorted_rebuild_diag/engine.plan`, and confirmed the accepted engine inspector maps the TopK layer to `__myl_Topk_myl29_17`. Pipeline depth was not varied and depth `3` was not tested. +- Build result: The modified engine built successfully after `107.6s`, comparable to the prior sorted detailed rebuild. +- Correctness diagnostic: On the same deterministic CUDA input, the unsorted rebuild changed raw outputs relative to the sorted local rebuild: max absolute deltas were `0.0658` for boxes, `0.1055` for logits, and `1.2207` for masks. It also still drifted strongly from the accepted packaged engine, like the earlier sorted local rebuild. +- Graph-only timing: Accepted packaged plan averaged `4.101024 ms` (`243.84 fps`) over five 1000-replay batches. The sorted detailed rebuild averaged `4.116528 ms` (`242.92 fps`), while the unsorted TopK rebuild averaged `4.139058 ms` (`241.60 fps`). +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.22s fps=242.75` after the diagnostic. +- Learning: Simply changing the ONNX TopK `sorted` attribute is not a viable export optimization: it changes rebuilt raw outputs and makes graph-only replay slower. The TopK node remains a useful target for a deeper correctness-equivalent export/engine redesign, but this cheap attribute change should not be used. + +### Rejected: TensorRT Version-Compatible Rebuild With Accepted Tactic Source Mask + +- Hypothesis: The accepted packaged plan reports tactic source mask `8`, while the local TensorRT builder default reports mask `24`. In this TensorRT runtime, mask `8` corresponds to `EDGE_MASK_CONVOLUTIONS` only, and mask `24` corresponds to `EDGE_MASK_CONVOLUTIONS | JIT_CONVOLUTIONS`. Rebuilding the public ONNX with `VERSION_COMPATIBLE`, FP16, optimization level `3`, and tactic source mask `8` might better match the accepted plan and possibly recover correctness or graph replay speed. +- Diagnostic: Temporary engine only; built `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_tactics8_diag/engine.plan` from the cached ONNX with static input shape `(1, 3, 312, 312)`, `BuilderFlag.FP16`, `BuilderFlag.VERSION_COMPATIBLE`, `builder_optimization_level=3`, workspace `4 GiB`, `ProfilingVerbosity.DETAILED`, and `config.set_tactic_sources(8)`. Pipeline depth was not varied and depth `3` was not tested. +- Build result: The engine built successfully after `107.85s`, produced a `187,962,380` byte plan, and matched accepted metadata for `261` layers, `4` aux streams, tactic source mask `8`, and `18,289,152` bytes device memory. +- Correctness diagnostic: Matching the tactic source mask did not fix local rebuild drift. On the same deterministic CUDA input, the tactic-8 rebuild differed from the accepted packaged plan with max absolute deltas of about `0.9500` for boxes, `2.7617` for logits, and `33.3047` for masks. It also changed raw outputs relative to the default-source version-compatible rebuild, with max deltas of about `0.0688` for boxes, `0.0352` for logits, and `0.7969` for masks. +- Graph-only timing: Accepted packaged plan averaged `4.111510 ms` (`243.22 fps`) over five 1000-replay batches. The default-source version-compatible rebuild averaged `4.130025 ms` (`242.13 fps`), while the tactic-8 rebuild averaged `4.155827 ms` (`240.63 fps`). +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.22s fps=242.56` after the diagnostic. +- Learning: The accepted plan's tactic source mask alone is not enough to reproduce its behavior or speed from the public ONNX. Restricting tactic sources to mask `8` makes the local rebuild slower and still incorrect, so the remaining engine-level path still requires the exact accepted export/build source or a new official correctness-equivalent package. + +### Profile: TensorRT Myelin Reshape/Transpose Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_ReshTranReshTranReshTranAddAddSqrtDivMulCastMulAddSlicReshTranAddConc_*` as a single graph node at about `17 us/replay`. Profiling it can show whether this fused reshape/transpose/layernorm-style tail is another memory-movement Myelin cost or an underfilled scheduling problem. +- Profile: `/tmp/rfdetr_trt_myelin_reshtran_tail_graphnode_basic_ncu_20260523_234529.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_reshtran_tail_graphnode_basic_ncu_20260523_234529_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_reshtran_tail_graphnode_basic_ncu_20260523_234529_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_ReshTranReshTranReshTranAddAddSqrtDivMulCastMulAddSlicReshTranAddConc_0x63650d344d7f66675d2a8e193f7d7647`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.11s fps=38.13`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `1014`, block size `256`, `21` registers/thread, no dynamic or static shared memory, and `6.34` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `90.47-91.46%`, and active warps/SM are about `28.95-29.27`. +- Throughput: Duration was about `19.87-20.06 us` under NCU replay. Compute throughput was about `29.77-30.13%`, memory and DRAM throughput about `78.57-78.94%`, and L2 throughput about `27.31-27.58%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.40` after the profile. +- Learning: This Myelin fused reshape/transpose tail is well-filled and mostly DRAM-bandwidth-bound, not a small-grid occupancy problem. It reinforces the split in the remaining graph body: Myelin transform tails consume memory bandwidth, while the larger repeated GEMM/MHA tails are limited by underfilled TensorRT tactic geometry. + +### Profile: TensorRT 32x32 FP16/FP32 TN GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize32x32x64_stage1_warpsize2x2x1_tensor16x8x8_aligna4_alignc4_execute_kernel_trt` at about `10 us/replay` aggregated across two graph nodes. Profiling it can show whether this FP32-accumulating 32x32 tail shares the severe underfilled geometry seen in the FP16 32x32 TN family. +- Profile: `/tmp/rfdetr_trt_gemm32x32_f16f32_tn_a4_graphnode_basic_ncu_20260523_234911.ncu-rep`, details text `/tmp/rfdetr_trt_gemm32x32_f16f32_tn_a4_graphnode_basic_ncu_20260523_234911_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm32x32_f16f32_tn_a4_graphnode_basic_ncu_20260523_234911_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `200` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.24s fps=37.77`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use block size `128`, `62` registers/thread, `8.19 KiB` dynamic shared memory per block, and only `0.03-0.10` waves/SM. Sampled grid sizes were `8`, `32`, and `8` CTAs, all below the T4's `40` SMs. Theoretical occupancy is `100%`, but achieved occupancy is only about `13.48-13.70%`, with active warps/SM about `4.31-4.38`. +- Throughput: Duration was about `13.73-16.13 us` under NCU replay. Compute throughput was about `3.94-13.35%`, memory throughput about `4.44-18.56%`, DRAM throughput about `2.45-6.83%`, and L2 throughput about `1.15-7.45%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.65` after the profile. +- Learning: This FP32-accumulating 32x32 GEMM tail is another extreme partial-wave TensorRT tactic. It has low memory pressure and too few CTAs to fill the T4, so it belongs with the repeated GEMM/MHA tactic-geometry bottleneck rather than the well-filled Myelin memory-movement tails. + +### Profile: TensorRT 64x64 FP16/FP32 NN Fused GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize64x64x64_stage1_warpsize2x2x1_tensor16x8x8_fused` as a single graph node at about `16 us/replay`. Profiling it can show whether this fused FP32-accumulating tail is another register/shared-memory limited small-grid TensorRT tactic. +- Profile: `/tmp/rfdetr_trt_gemm64x64_f16f32_nn_fused_graphnode_basic_ncu_20260523_235245.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x64_f16f32_nn_fused_graphnode_basic_ncu_20260523_235245_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm64x64_f16f32_nn_fused_graphnode_basic_ncu_20260523_235245_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.04s fps=38.32`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `32`, block size `128`, `138` registers/thread, `16.38 KiB` dynamic shared memory per block, and only `0.27` waves/SM. Theoretical occupancy is `37.5%`, but achieved occupancy is only about `13.11-13.23%`, with active warps/SM about `4.19-4.23`. +- Throughput: Duration was about `19.65-19.74 us` under NCU replay. Compute throughput was about `15.48-15.52%`, memory throughput about `18.56-18.61%`, DRAM throughput about `13.36-13.44%`, and L2 throughput about `12.31-12.33%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.85` after the profile. +- Learning: This fused 64x64 FP16/FP32 GEMM is also an underfilled TensorRT tactic: fewer CTAs than T4 SMs plus high register/shared-memory usage. It adds another data point that the remaining graph-body limiter is broad TensorRT GEMM tactic geometry rather than one isolated kernel or pipeline bubble. + +### Profile: TensorRT Myelin Compare/Select Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_GtrNotNotMoveTranSele_*` as a single graph node at about `14 us/replay`. Profiling it can show whether this fused compare/select transform is a memory-movement Myelin tail or another underfilled scheduling problem. +- Profile: `/tmp/rfdetr_trt_myelin_select_tail_graphnode_basic_ncu_20260523_235628.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_select_tail_graphnode_basic_ncu_20260523_235628_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_select_tail_graphnode_basic_ncu_20260523_235628_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_GtrNotNotMoveTranSele_0x0f229fc5c543c6b367671dfabd631179`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.04s fps=38.31`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `88`, block size `1024`, `32` registers/thread, `16.90 KiB` static shared memory per block, no dynamic shared memory, and `2.20` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `98.88-99.50%`, and active warps/SM are about `31.64-31.84`. +- Throughput: Duration was about `22.82-23.23 us` under NCU replay. Compute and memory throughput were about `29.49-29.64%`, DRAM throughput about `11.73-11.88%`, and L2 throughput about `6.10-6.18%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.01` after the profile. +- Learning: This Myelin compare/select tail is well occupied and does not share the small-grid GEMM pathology. It is a moderate fused TensorRT transform tail; the larger remaining optimization target remains the repeated underfilled GEMM/MHA tactic structure. + +### Profile: TensorRT 32x32 FP16/FP32 TN Aligna2 GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace also includes `sm75_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize32x32x64_stage1_warpsize2x2x1_tensor16x8x8_aligna2_alignc2_execute_kernel_trt`. Profiling the aligna2 variant checks whether it behaves differently from the already-profiled aligna4 32x32 FP16/FP32 TN tail or shares the same underfilled TensorRT tactic shape. +- Profile: `/tmp/rfdetr_trt_gemm32x32_f16f32_tn_a2_graphnode_basic_ncu_20260524_000025.ncu-rep`, details text `/tmp/rfdetr_trt_gemm32x32_f16f32_tn_a2_graphnode_basic_ncu_20260524_000025_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm32x32_f16f32_tn_a2_graphnode_basic_ncu_20260524_000025_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.13s fps=38.09`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `12`, block size `128`, `78` registers/thread, `8.19 KiB` dynamic shared memory per block, no static shared memory, and only `0.05` waves/SM. Theoretical occupancy is `75%`, but achieved occupancy is only about `13.36%`, with active warps/SM about `4.27`. +- Throughput: Duration averaged `17.46 us` under NCU replay, with observed launches at about `17.44-17.47 us`. Compute throughput averaged `7.30%`, memory throughput `10.41%`, DRAM throughput `3.62%`, and L2 throughput `2.63%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.44` after the profile. +- Learning: The aligna2 32x32 FP16/FP32 TN variant is another partial-wave TensorRT GEMM. It is neither memory-bandwidth limited nor a postprocess/pipeline issue; it reinforces that the remaining graph-body cost comes from many underfilled TensorRT tactics that would need a correctness-equivalent engine/export/tactic change to improve materially. + +### Profile: Current Depth-2 Low-Bubble Nsight Systems Refresh + +- Request: Capture another current Nsight Systems report for timeline analysis with the run constrained to pipeline depth `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_lowbubble_refresh_20260524_000348.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_lowbubble_refresh_20260524_000348.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_lowbubble_refresh_20260524_000348_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_lowbubble_refresh_20260524_000348_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_lowbubble_refresh_20260524_000348_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under graph-node profiling overhead: `frames=538 elapsed=2.37s fps=226.77`. +- Graph-node structure: The trace contains `602` TensorRT graph replays with exactly `242` graph activities per replay: `239` kernels and `3` graph memsets. After skipping `64` capture warmup replays plus `100` settling launches, each replay used `6` CUDA streams. +- Graph-node timing after the same skip: envelope duration was p50 `4123.198 us`, p90 `4182.699 us`, p95 `4189.612 us`, mean `4129.826 us`; union GPU-busy time inside the graph envelope was p50 `3910.258 us`, p90 `3969.108 us`, p95 `3972.361 us`, mean `3916.385 us`; internal no-activity idle was p50 `212.814 us`, p90 `217.222 us`, p95 `220.858 us`, mean `213.441 us`. The mean graph busy fraction was `94.83%`, with mean overlap factor `1.0406`. +- Graph-to-graph spacing after the same skip: start-to-start interval was p50 `4165.566 us`, p90 `4225.648 us`, p95 `4231.568 us`, mean `4178.192 us`; graph end-to-next-start gap was p50 `42.656 us`, p90 `43.884 us`, p95 `44.326 us`, mean `48.476 us`. +- Gap decomposition: non-graph GPU work inside the graph-to-graph gap was p50 `37.920 us`, p90 `39.717 us`, p95 `40.166 us`, mean `38.533 us`; true idle was p50 `4.608 us`, p90 `5.343 us`, p95 `5.536 us`, mean `9.943 us`. +- Gap occupants: the largest clipped occupants were TensorRT mask Device-to-Device clone (`2433600B`, `13.182 us/gap`), next-frame input Device-to-Device copy (`1168128B`, `13.135 us/gap`), four vectorized elementwise kernels in aggregate (`11.666 us/gap`), boxes Device-to-Device clone (`1600B`, `4.571 us/gap`), logits Device-to-Device clone (`36400B`, `2.103 us/gap`), and `_select_topk_boxes_kernel` (`1.819 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`808.487 us/replay`), 12 Myelin MHA nodes (`586.063 us/replay`), 12 fused FP16/FP32 GEMM nodes (`481.290 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.471 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.547 us/replay`). +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.45` after the profile. +- Learning: The trace is still graph-body bound. The TensorRT CUDA graph of the next frame starts about `43 us` after the previous graph envelope ends, and only about `5 us` p50 of that gap is true idle. The CPU and postprocess path are already overlapped tightly enough that further meaningful FPS gains need to reduce the TensorRT graph body itself or eliminate one of the required D2D/postprocess operations between replays. + +### Profile: TensorRT ResizeVectorized H2x4 Bilinear Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `ResizeVectorizedH2x4BilinearKernel` as one remaining unprofiled graph-body node at about `13 us/replay`. Profiling it can determine whether this TensorRT resize is another underfilled scheduling tail worth replacing with a custom kernel, or a reasonably efficient internal kernel that should be left alone. +- Profile: `/tmp/rfdetr_trt_resizevectorized_graphnode_basic_ncu_20260524_000751.ncu-rep`, details text `/tmp/rfdetr_trt_resizevectorized_graphnode_basic_ncu_20260524_000751_details.txt`, raw CSV `/tmp/rfdetr_trt_resizevectorized_graphnode_basic_ncu_20260524_000751_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:ResizeVectorizedH2x4BilinearKernel`, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.16s fps=37.98`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `1521`, block size `128`, `38` registers/thread, no dynamic or static shared memory, and `4.75` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `83.79-84.25%`, and active warps/SM are about `26.81-26.96`. +- Throughput: Duration was about `20.32-20.48 us` under NCU replay. Compute throughput was about `59.78-60.25%`, memory and DRAM throughput about `44.32-44.98%`, L2 throughput about `34.61-34.77%`, and L1/TEX throughput about `63.48-63.93%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.28` after the profile. +- Learning: This TensorRT resize is not another severe partial-wave problem. It has broad grid coverage, high occupancy, and meaningful compute/cache throughput, so patching it with a custom kernel is unlikely to move end-to-end FPS. Keep attention on the repeated GEMM/MHA tactic geometry and the few remaining low-occupancy TensorRT graph nodes. + +### Profile: TensorRT Myelin Add/Transpose Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_AddSqrtDivMulCastMulAddTran_0x644e5dde0cc8be797459a1d1a6b5f114` as another remaining Myelin graph-body tail at about `11 us/replay`. Profiling it can check whether there is a low-occupancy transform left that could plausibly be replaced outside TensorRT. +- Profile: `/tmp/rfdetr_trt_myelin_addtran_tail_graphnode_basic_ncu_20260524_000954.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_addtran_tail_graphnode_basic_ncu_20260524_000954_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_addtran_tail_graphnode_basic_ncu_20260524_000954_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact Myelin kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.04s fps=38.33`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `104`, block size `416`, `37` registers/thread, `6.76 KiB` static shared memory per block, no dynamic shared memory, and `1.30` waves/SM. Theoretical occupancy is `81.25%`, achieved occupancy is about `67.86-69.06%`, and active warps/SM are about `21.72-22.10`. +- Throughput: Duration was about `20.35-20.70 us` under NCU replay. Compute and memory throughput were both about `28.22-28.39%`, DRAM throughput about `24.16-24.61%`, L2 throughput about `10.01-10.19%`, and L1/TEX throughput about `46.41-46.69%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.04` after the profile. +- Learning: This Myelin add/transpose tail is not a severe single-block or partial-wave issue. It has moderate occupancy and moderate memory pressure, and it is TensorRT-internal graph-body work, so replacing it from Python would require changing the serialized engine boundary rather than the postprocess pipeline. It is lower priority than the repeated underfilled GEMM/MHA tactic family. + +### Profile: TensorRT Myelin Cast Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_Cast_0x3e286122172daaa85048151e7b4c1ac0` as another unprofiled TensorRT graph-body tail at about `20 us/replay`. Profiling it can determine whether this cast is a low-occupancy scheduling tail or a bandwidth-style Myelin transform. +- Profile: `/tmp/rfdetr_trt_myelin_cast_tail_graphnode_basic_ncu_20260524_001347.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_cast_tail_graphnode_basic_ncu_20260524_001347_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_cast_tail_graphnode_basic_ncu_20260524_001347_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact Myelin kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=14.14s fps=38.05`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `468`, block size `512`, `28` registers/thread, `8.45 KiB` static shared memory per block, no dynamic shared memory, and `5.85` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `91.00-93.77%`, and active warps/SM are about `29.12-30.01`. +- Throughput: Duration was about `34.72-35.36 us` under NCU replay. Compute throughput was about `40.68-41.29%`, memory and DRAM throughput about `61.16-61.63%`, L2 throughput about `23.68-24.11%`, and L1/TEX throughput about `55.71-56.30%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.22s fps=242.79` after the profile. +- Learning: This Myelin cast tail is broad and well occupied, with significant DRAM traffic. It is not a local replacement candidate; avoiding it would require changing the TensorRT graph/export layout so the cast is fused away or no longer needed. + +### Profile: TensorRT Smaller MHA Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows a second MHA kernel, `_gemm_mha_v2_0x5b02928230f6ad2b25b3a387bdc36a22`, at about `24 us/replay` aggregated across four graph nodes. Profiling it can check whether the smaller MHA family shares the same partial-wave TensorRT tactic geometry as the dominant MHA kernel. +- Profile: `/tmp/rfdetr_trt_mha_small_graphnode_basic_ncu_20260524_001635.ncu-rep`, details text `/tmp/rfdetr_trt_mha_small_graphnode_basic_ncu_20260524_001635_details.txt`, raw CSV `/tmp/rfdetr_trt_mha_small_graphnode_basic_ncu_20260524_001635_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact MHA kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.16s fps=104.27`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `32`, block size `128`, `96` registers/thread, `18.43 KiB` dynamic shared memory per block, no static shared memory, and only `0.27` waves/SM. Theoretical occupancy is `37.50%`, but achieved occupancy is only about `13.38-13.74%`, with active warps/SM about `4.28-4.40`. +- Throughput: Duration was about `10.91-11.14 us` under NCU replay. Compute throughput was about `11.07-11.35%`, memory throughput about `11.97-12.26%`, DRAM throughput about `6.28-6.42%`, L2 throughput about `5.65-5.77%`, and L1/TEX throughput about `23.93-24.52%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.90` after the profile. +- Learning: The smaller MHA node has the same small-grid/register/shared-memory-limited shape as the dominant MHA and GEMM families. It is too small to matter alone, but it strengthens the conclusion that the remaining graph-body bottleneck is TensorRT tactic geometry across many nodes rather than a single removable wrapper or postprocess kernel. + +### Profile: TensorRT 128x128 FP16/FP32 NN Fused GEMM Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_fused` as the third largest graph-body aggregate at about `481 us/replay` across twelve nodes. Profiling the exact kernel can distinguish whether this major FP32-accumulating fused GEMM is limited by partial waves, register pressure, or memory bandwidth. +- Profile: `/tmp/rfdetr_trt_gemm128x128_f16f32_nn_fused_graphnode_basic_ncu_20260524_001949.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x128_f16f32_nn_fused_graphnode_basic_ncu_20260524_001949_details.txt`, raw CSV `/tmp/rfdetr_trt_gemm128x128_f16f32_nn_fused_graphnode_basic_ncu_20260524_001949_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact fused GEMM kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.17s fps=104.07`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `72`, block size `128`, `240` registers/thread, `17.41 KiB` dynamic shared memory per block, no static shared memory, and `0.90` waves/SM. Theoretical occupancy is only `25%`, and achieved occupancy is about `21.61-21.78%`, with active warps/SM about `6.92-6.97`. +- Throughput: Duration was about `82.24-83.17 us` under NCU replay. Compute throughput was about `48.50-48.59%`, memory throughput about `31.89-31.93%`, DRAM throughput about `14.90-15.13%`, L2 throughput about `21.29-21.51%`, and L1/TEX throughput about `63.76-63.87%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.58` after the profile. +- Learning: This major fused GEMM is not primarily a memory-bandwidth tail; it is register-heavy TensorRT tactic work with low occupancy despite a wider grid than the smallest GEMMs. Improving it requires a different correct tactic/engine/export choice across many nodes, not a Python wrapper or postprocess kernel change. + +### Profile: TensorRT Exact Myelin LayerNorm Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_MulAddCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_0x941b85d82830f659989e5c17314ed213` as the largest exact Myelin aggregate at about `146 us/replay` across twenty graph nodes. Profiling this exact node can separate layernorm-style memory traffic from the register-bound GEMM/MHA tactic bottleneck. +- Profile: `/tmp/rfdetr_trt_myelin_layernorm_exact_graphnode_basic_ncu_20260524_002330.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_layernorm_exact_graphnode_basic_ncu_20260524_002330_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_layernorm_exact_graphnode_basic_ncu_20260524_002330_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact Myelin kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.11s fps=105.34`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `170`, block size `128`, `39` registers/thread, `128 B` dynamic shared memory per block, no static shared memory, and `0.53` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `53.51-55.38%`, and active warps/SM are about `17.12-17.72`. +- Throughput: Duration was about `12.26-12.42 us` under NCU replay. Compute throughput was about `21.52-22.27%`, memory and DRAM throughput about `38.84-40.49%`, L2 throughput about `20.46-20.65%`, and L1/TEX throughput about `32.57-33.54%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.94` after the profile. +- Learning: This exact Myelin layernorm tail is moderately occupied and DRAM-oriented. It is not the same problem as the register-heavy TensorRT GEMMs; removing it would require graph-level fusion or export/layout changes that reduce TensorRT-internal memory movement. + +### Profile: TensorRT Exact Myelin GELU/ERF Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_DivCastErfCastAddMulMulReshTranAdd_0xc69f0a3834aa94ce6ae359f8c4d44526` as another high-cost Myelin aggregate at about `152 us/replay` across four graph nodes. Profiling the exact kernel can show whether this GELU/ERF transform is bandwidth-bound or another low-occupancy scheduling tail. +- Profile: `/tmp/rfdetr_trt_myelin_erf_exact_graphnode_basic_ncu_20260524_002654.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_erf_exact_graphnode_basic_ncu_20260524_002654_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_erf_exact_graphnode_basic_ncu_20260524_002654_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact Myelin kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.26s fps=102.33`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `761`, block size `256`, `33` registers/thread, no dynamic or static shared memory, and `4.76` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `89.31-89.46%`, and active warps/SM are about `28.58-28.63`. +- Throughput: Duration was about `41.02-41.38 us` under NCU replay. Compute throughput was about `61.32-61.58%`, memory and DRAM throughput about `79.77-81.01%`, L2 throughput about `25.93-26.16%`, and L1/TEX throughput about `26.65-26.76%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.63` after the profile. +- Learning: This exact Myelin GELU/ERF tail is broad, well occupied, and DRAM-heavy. It is a graph-level memory-movement/fusion opportunity inside TensorRT, not a local postprocess or pipeline scheduling problem. + +### Profile: TensorRT Exact Myelin Tran-Cast LayerNorm Tail Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows `__myl_TranCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_0xbd699bb3af0cc97bf94b3c245a0c7f14` as another exact Myelin aggregate at about `82 us/replay` across four graph nodes. Profiling it can check whether this transform is another memory-bound TensorRT graph-body tail or a low-occupancy scheduling issue. +- Profile: `/tmp/rfdetr_trt_myelin_trancast_layernorm_exact_graphnode_basic_ncu_20260524_003108.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_trancast_layernorm_exact_graphnode_basic_ncu_20260524_003108_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_trancast_layernorm_exact_graphnode_basic_ncu_20260524_003108_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact Myelin kernel name, skipped `100` matching launches, collected `3` profiled launches with the `basic` set, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.18s fps=103.85`, profiling overhead only and not comparable to normal benchmark FPS. +- Findings: The sampled launches use grid size `1521`, block size `128`, `26` registers/thread, `128 B` dynamic shared memory per block, no static shared memory, and `4.75` waves/SM. Theoretical occupancy is `100%`, achieved occupancy is about `90.53-92.83%`. +- Throughput: Duration was about `26.98-27.23 us` under NCU replay. Compute throughput was about `44.19-44.47%`, memory and DRAM throughput about `72.98-74.79%`, L2 throughput about `26.47-26.76%`, and L1/TEX throughput about `69.67-70.01%`. +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.54` after the profile. +- Learning: This transform is wide, highly occupied, and memory-bandwidth oriented. It is not a scheduler bubble in the Python pipeline; reducing it would require TensorRT graph/export fusion or a different correct engine boundary. + +### Profile: Depth-2 Low-Bubble Nsight Systems Refresh, Depth 2 Only + +- Request: Capture a fresh Nsight Systems report for user timeline analysis while keeping the pipeline constrained to depth `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_003258.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_003258.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_003258_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_003258_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_003258_stats_cuda_api_sum.csv`. +- Result under graph-node profiling overhead: `frames=538 elapsed=2.36s fps=228.05`. +- Graph-node structure: The trace contains `602` TensorRT graph replays with exactly `242` graph activities per replay: `239` kernels and `3` graph memsets. After skipping `64` capture warmup replays plus `100` settling launches, each replay used `6` CUDA streams. +- Graph-node timing after the same skip: envelope duration was p50 `4126.894 us`, p90 `4190.682 us`, p95 `4193.456 us`, mean `4136.538 us`; union GPU-busy time inside the graph envelope was p50 `3913.008 us`, p90 `3974.130 us`, p95 `3977.632 us`, mean `3921.785 us`; internal no-activity idle was p50 `214.430 us`, p90 `217.563 us`, p95 `219.314 us`, mean `214.753 us`. The mean graph busy fraction was `94.81%`, with mean overlap factor `1.0406`. +- Graph-to-graph spacing after the same skip: start-to-start interval was p50 `4169.278 us`, p90 `4233.052 us`, p95 `4235.914 us`, mean `4178.963 us`; graph end-to-next-start gap was p50 `42.431 us`, p90 `43.967 us`, p95 `44.351 us`, mean `42.405 us`. +- Gap decomposition: non-graph GPU work inside the graph-to-graph gap was p50 `37.983 us`, p90 `39.551 us`, p95 `40.134 us`, mean `37.949 us`; true idle was p50 `4.416 us`, p90 `5.088 us`, p95 `5.158 us`, mean `4.456 us`. +- Gap occupants: the largest clipped occupants were next-frame input Device-to-Device copy (`1168128B`, `13.160 us/gap`), TensorRT mask Device-to-Device clone (`2433600B`, `13.131 us/gap`), four vectorized elementwise kernels in aggregate (`11.609 us/gap`), boxes Device-to-Device clone (`1600B`, `4.584 us/gap`), `_select_topk_boxes_kernel` (`2.272 us/gap`), and logits Device-to-Device clone (`36400B`, `2.101 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`809.932 us/replay`), 12 Myelin MHA nodes (`587.327 us/replay`), 12 fused FP16/FP32 GEMM nodes (`482.073 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.841 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.882 us/replay`). +- Non-profiled sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.58` after the profile. +- Learning: The latest trace remains graph-body bound. The CUDA graph of the next frame starts about `42 us` after the previous graph envelope ends, and only about `4-5 us` p50 of that gap is true idle; the remaining gap is required D2D handoff and fused postprocess work. Meaningful FPS gains now need to reduce the TensorRT graph body or eliminate required D2D clone/input-copy work without reintroducing correctness drift. + +### Rejected: Single-Lookup CUDA Graph Cache Hit Path + +- Hypothesis: The CUDA graph cache hit path takes the cache lock twice per replay (`cache_key in cache` followed by `cache[cache_key]`). Collapsing this to one lookup that also refreshes LRU order might shave host scheduling work from the remaining `4-5 us` true-idle handoff gap without changing GPU work. +- Change tested: Temporary code only; added `TRTCudaGraphCache.get(...)` and used it in `_execute_trt_engine(...)` so cache hits perform one lock acquisition and one ordered-dict lookup. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared CUDA graph execution against standard non-graph TensorRT execution over all `538` frames after postprocess: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `245.17`, `244.61`, `243.12`, `243.53`, `243.10`, `243.67`, `242.55`, and `242.26` FPS, averaging about `243.50` FPS. This is within the accepted path's normal noise band and does not improve the current checkpoint. +- Learning: Python cache lookup overhead is not a measurable limiter at the current graph-bound point. The true-idle gap is already too small for this cleanup to produce a stable FPS gain, so the hot path was reverted to the existing simpler cache API. + +### Diagnostic: Accepted TensorRT Engine Inspector Dump + +- Hypothesis: The latest nsys traces show the run is graph-body bound. Dumping the accepted TensorRT plan's engine inspector data can map the dominant graph kernels back to layer families before attempting a custom-kernel or rebuild target. +- Artifacts: Engine inspector JSON-style layer list `/tmp/rfdetr_trt_engine_inspector_20260524_003258_json.txt`, one-line layer list `/tmp/rfdetr_trt_engine_inspector_20260524_003258_oneline.txt`, and IO/meta summary `/tmp/rfdetr_trt_engine_inspector_20260524_003258_meta.json`. The accepted engine required `runtime.engine_host_code_allowed = True` for deserialization because it is a lean-runtime/version-compatible plan. +- Engine summary: The accepted T4 plan has `261` layers, `4` IO tensors, `4` TensorRT auxiliary streams, and about `18.29 MB` reported device memory. IO tensors are `input` `(1,3,312,312)` float, `dets` `(1,100,4)` float, `labels` `(1,100,91)` float, and mask output `4186` `(1,100,78,78)` float. +- Layer-family counts: The inspector layer list contains `12` fused QKV attention matmuls, `16` `_gemm_mha_v2` layers, `12` attention-output matmuls, `14` Myelin FC layers, `12` MLP `fc2` matmuls, `20` full layernorm-style `__myl_MulAddCastMeanSubMulMeanAddSqrtDivMulCastMulAdd` layers, `4` partial layernorm/mean layers, `12` GELU/SiLU-style Myelin activation layers, `18` convolution layers, one TensorRT TopK layer, and one segmentation-head resize layer. +- Learning: The graph-body bottleneck aligns with repeated transformer-block GEMM/MHA/layernorm families rather than a single postprocess or pipeline gap. The likely next high-impact attempts are engine/export/tactic changes or a much larger custom replacement of repeated transformer subgraphs; small Python-side cache and handoff changes are below measurement noise at the current depth-2 schedule. + +### Diagnostic: Current Official RFDETR Package Metadata + +- Hypothesis: A newer official T4 TensorRT package would be the safest remaining graph-body optimization path because local rebuilds from the public ONNX/Torch packages have repeatedly failed the accepted class/box/mask correctness gate. +- Diagnostic: Re-queried the Roboflow weights provider for `rfdetr-seg-nano`; it still resolves to `coco-dataset-vdnr1/41` and exposes `6` packages. +- Result: The current package list is unchanged: L4 TRT FP32 `3e3ddd85586b43e4fac6d319fb2927fd`, ONNX FP32 `5362b72bfb9f01d2e0b8cba2048d932c`, L4 TRT FP16 `89d1f41e2af4f4f3ffcdfb77e774d26a`, Torch FP32 `8b8da2fe824240522a39f3cde41aafae`, T4 TRT FP32 `bbc2cc23adf6f5e71a9241956081da96`, and T4 TRT FP16 `c70f32369a54d61e06ef4e6b56c82524`. +- Learning: There is still no newer official T4 TensorRT artifact to test. The accepted T4 FP16 package remains the only known correctness-equivalent engine source, so further engine-level gains still need an exact accepted export source or a new official package. + +### Diagnostic: TensorRT Output Allocator Probe + +- Hypothesis: `IExecutionContext.set_output_allocator(...)` might expose an output-buffer ownership hook that could eventually remove or restructure the CUDA graph output clones without changing prediction tensors. +- Diagnostic: Attached a Python `trt.IOutputAllocator` to the accepted execution context for outputs `dets`, `labels`, and `4186`, then ran one standard non-graph TensorRT forward on a benchmark frame. Pipeline depth was not varied and depth `3` was not tested. +- Result: TensorRT called `notify_shape(...)` once and `reallocate_output_async(...)` once for each of the three outputs, reporting shapes `(1,100,4)`, `(1,100,91)`, and `(1,100,78,78)`. The probe returned the existing memory pointer, and the standard forward completed successfully. +- Learning: The hook is active for ordinary `execute_async_v3`, but it does not solve the accepted CUDA graph replay lifetime problem by itself. A captured graph still needs fixed output addresses; changing those addresses per replay would require graph update or a different graph-state pool, and prior borrowed/pool variants lost. Do not treat output allocators alone as a clone-removal optimization for the current graph path. + +### Rejected: CUDA Driver API Input D2D Copy + +- Hypothesis: The graph-to-graph gap still contains the next-frame input Device-to-Device copy. Calling `cuMemcpyDtoDAsync_v2` directly on the PyTorch stream might reduce PyTorch dispatch overhead for this copy and shrink the remaining handoff gap. +- Change tested: Temporary env-gated path only; `RFDETR_DRIVER_D2D_INPUT_COPY=true` loaded `libcuda.so.1` with `ctypes` and copied the TensorRT graph input buffer with `cuMemcpyDtoDAsync_v2` on the graph replay stream. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared CUDA graph execution against standard non-graph TensorRT execution over all `538` frames after postprocess: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: An initial three-run comparison measured gated `243.81`, `243.62`, and `243.17` FPS versus default `243.31`, `243.11`, and `243.24` FPS. A later alternating four-pair depth-2 run measured gated `244.53`, `243.81`, `243.32`, and `243.56` FPS versus default `242.50`, `243.69`, `243.22`, and `243.83` FPS. Combined, the gated path averaged about `243.69` FPS versus default about `243.27` FPS, roughly a `0.17%` difference inside normal benchmark noise. +- Learning: The direct CUDA driver copy is correct but not a stable performance win, and it adds an undesirable `ctypes`/driver-ABI dependency in the hot path. The experiment was reverted; the accepted PyTorch stream copy remains the simpler and effectively equivalent path. + +### Profile: Depth-2 Low-Bubble Nsight Systems Refresh After Driver-Copy Revert + +- Request: Capture a fresh Nsight Systems report for user timeline analysis after rejecting the direct CUDA driver input-copy experiment. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_011216.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_011216.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_011216_stats_cuda_gpu_kern_sum_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_011216_stats_cuda_gpu_mem_time_sum_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_011216_stats_cuda_api_sum_cuda_api_sum.csv`. +- Result under graph-node profiling overhead: `frames=538 elapsed=2.33s fps=231.24`. A non-profiled sanity check immediately after the profile measured `frames=538 elapsed=2.20s fps=244.46`. +- Graph-node structure: The trace contains `602` TensorRT graph replays. Reconstructing sorted graph-node activity chunks gives exactly `242` graph activities per replay: `239` kernels and `3` graph memsets, using `6` CUDA streams. +- Graph-node timing after skipping `64` capture warmup replays plus `100` settling launches: envelope duration was p50 `4126.397 us`, p90 `4183.783 us`, p95 `4187.587 us`, mean `4132.945 us`; union GPU-busy time inside the graph envelope was p50 `3912.257 us`, p90 `3968.498 us`, p95 `3971.696 us`, mean `3917.802 us`; internal no-activity idle was p50 `214.667 us`, p90 `220.699 us`, p95 `222.041 us`, mean `215.143 us`. The mean graph busy fraction was `94.79%`, with mean overlap factor `1.0407`. +- Graph-to-graph spacing after the same skip: start-to-start interval was p50 `4168.861 us`, p90 `4225.980 us`, p95 `4230.582 us`, mean `4177.701 us`; graph end-to-next-start gap was p50 `42.271 us`, p90 `43.712 us`, p95 `44.396 us`, mean `44.702 us`. +- Gap decomposition: non-graph GPU work inside the graph-to-graph gap was p50 `37.823 us`, p90 `39.616 us`, p95 `40.357 us`, mean `38.536 us`; true idle was p50 `4.352 us`, p90 `5.088 us`, p95 `5.248 us`, mean `6.166 us`. +- Gap occupants: the largest clipped occupants were TensorRT mask Device-to-Device clone (`2433600B`, `13.126 us/gap`), next-frame input Device-to-Device copy (`1168128B`, `13.112 us/gap`), vectorized elementwise aggregate (`11.619 us/gap`), boxes Device-to-Device clone (`1600B`, `4.585 us/gap`), logits Device-to-Device clone (`36400B`, `2.100 us/gap`), and `_select_topk_boxes_kernel` (`1.933 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`808.846 us/replay`), 12 Myelin MHA nodes (`586.353 us/replay`), 12 fused FP16/FP32 GEMM nodes (`481.540 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.579 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.520 us/replay`). +- Learning: The fresh depth-2 trace is still graph-body bound. The next TensorRT CUDA graph starts about `42 us` after the previous graph envelope ends, and only about `4-5 us` p50 of that gap is true idle. The remaining gap is dominated by required graph-output clones, the next input copy, and the fused postprocess selector; the run is already within roughly `1%` of the pure CUDA graph forward-pass ceiling. + +### Rejected: Skip RFDETR TRT Postprocess `record_stream` + +- Hypothesis: RFDETR TRT postprocess records each TensorRT output clone on the postprocess stream before launching fused postprocess kernels. Since the tensors stay alive until the current stream waits on the postprocess stream, removing this host-side caching-allocator bookkeeping might reduce the remaining CPU launch overhead without affecting stream ordering. +- Change tested: Temporary code only; removed the `for result_element in model_results: result_element.record_stream(self._post_process_stream)` loop from `RFDetrForInstanceSegmentationTRT.post_process(...)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared CUDA graph execution against standard non-graph TensorRT execution over all `538` frames after postprocess: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `244.43`, `243.11`, `243.22`, `243.18`, and `243.26` FPS, averaging about `243.44` FPS. This is within the accepted path's normal noise band and does not improve the current checkpoint. +- Learning: `record_stream` bookkeeping is not a measurable limiter at the current graph-bound point. Keeping it preserves the conservative cross-stream tensor lifetime pattern used by the model without sacrificing FPS, so the experiment was reverted. + +### Rejected: Avoid Zero-Filling Fused Selector Query Indices + +- Hypothesis: `fused_select_topk_boxes(...)` initializes the `query_indices` output with `torch.zeros(...)`, even though consumed entries are written by `_select_topk_boxes_kernel` before use and all consumers are bounded by the selected-count tensor. Switching this allocation to `torch.empty(...)` might remove a tiny CUDA memset or allocator initialization from the graph-to-graph handoff. +- Change tested: Temporary code only; changed the CUDA `query_indices` allocation in `fused_select_topk_boxes(...)` from `torch.zeros(...)` to `torch.empty(...)`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared CUDA graph execution against standard non-graph TensorRT execution over all `538` frames after postprocess: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `244.03`, `244.51`, `242.45`, `242.99`, and `243.97` FPS, averaging about `243.59` FPS. This is still within the accepted path's normal noise band and does not clearly improve the current checkpoint. +- Learning: The query-index zero fill is not a measurable limiter at the current graph-bound point. The allocation was restored to the existing zeroed form because the change has no stable FPS benefit and the explicit initialization is more conservative for future consumers. + +### Rejected: Disable TensorRT Context Profile Emission + +- Hypothesis: The accepted TensorRT execution context reports `enqueue_emits_profile=True` and `nvtx_verbosity=LAYER_NAMES_ONLY` even without an attached profiler. Disabling profile emission and lowering NVTX verbosity on graph and non-graph contexts might reduce TensorRT enqueue/capture overhead or graph replay metadata work. +- Change tested: Temporary code only; set `context.enqueue_emits_profile = False` and attempted to set `context.nvtx_verbosity = trt.ProfilingVerbosity.NONE` for standard and CUDA-graph TensorRT execution contexts. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared CUDA graph execution against standard non-graph TensorRT execution over all `538` frames after postprocess: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: depth `2` measured `243.82`, `243.81`, `242.68`, `242.93`, and `243.54` FPS, averaging about `243.36` FPS. This is within the accepted path's normal noise band and does not improve the current checkpoint. +- Learning: TensorRT context profile emission and NVTX layer-name verbosity are not measurable FPS limiters for the accepted CUDA graph replay path. The experiment was reverted to keep the generic TRT helper behavior unchanged. + +### Diagnostic: Current TensorRT Graph Replay Ceiling + +- Hypothesis: Recent depth-2 nsys traces show only about `4-5 us` p50 true idle between graph replays. Measuring the replay-only path directly can quantify how much headroom remains outside the TensorRT graph body. +- Diagnostic: Loaded one frame from `vehicles_312px.mp4`, preprocessed it once, captured the RFDETR TensorRT CUDA graph, then measured `1000` iterations with CUDA events on the graph stream. Pipeline depth was not varied and depth `3` was not tested. +- Results: graph replay only measured `4.066539 ms` (`245.91 fps`); input copy plus graph replay measured `4.076774 ms` (`245.29 fps`); input copy plus graph replay plus output clones measured `4.105473 ms` (`243.58 fps`); the model `forward(...)` CUDA graph path measured `4.107662 ms` (`243.45 fps`). +- Learning: The accepted workflow's normal `243-245 fps` range is already essentially at the TensorRT forward-path ceiling once required input handoff and output clones are included. Further local FPS gains need to reduce the TensorRT graph body or safely remove/replace output clone ownership, not tune CPU scheduling or Python dispatch. + +### Diagnostic: Official Package Metadata Refresh After Graph-Ceiling Check + +- Hypothesis: Since local ONNX rebuilds and runtime knobs have not produced a correct faster graph body, a newer official T4 TensorRT package would be the most credible remaining graph-body optimization path. +- Diagnostic: Re-queried the Roboflow weights provider through the local `inference_models` path for `rfdetr-seg-nano`. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Result: The provider still resolves `rfdetr-seg-nano` to `coco-dataset-vdnr1/41` and exposes the same `6` packages: L4 TRT FP32 `3e3ddd85586b43e4fac6d319fb2927fd`, ONNX FP32 `5362b72bfb9f01d2e0b8cba2048d932c`, L4 TRT FP16 `89d1f41e2af4f4f3ffcdfb77e774d26a`, Torch FP32 `8b8da2fe824240522a39f3cde41aafae`, T4 TRT FP32 `bbc2cc23adf6f5e71a9241956081da96`, and T4 TRT FP16 `c70f32369a54d61e06ef4e6b56c82524`. +- Sanity check: The accepted depth-2 benchmark command measured `frames=538 elapsed=2.20s fps=244.52` after the metadata refresh. +- Learning: There is still no newer official T4 FP16 package to test. The accepted T4 FP16 engine remains the only correctness-equivalent package source, so additional graph-body gains still require a new official artifact or an exact export source for a correctness-preserving rebuild. + +### Diagnostic: TensorRT Graph-Body Stream Imbalance + +- Hypothesis: The graph-node trace's `~215 us` internal no-activity time and `6` TensorRT streams per replay might hide a schedulable parallelism opportunity inside the captured TensorRT graph body. +- Diagnostic: Reused `/tmp/rfdetr_depth2_lowbubble_depth2only_20260524_011216.sqlite`, reconstructed `602` graph replays as sorted chunks of `242` graph activities, and analyzed the `438` steady-state replays after skipping `64` capture warmups plus `100` settling launches. Pipeline depth was not varied and depth `3` was not tested. +- Result: Every replay had exactly `239` kernels, `3` graph memsets, and `6` CUDA streams. The same main stream carried the critical work in all steady-state replays, averaging `3383.099 us` busy time with `157` activities and no start/end slack. The auxiliary streams were much smaller and started late: the largest aux stream averaged `412.518 us` busy time, started around `3449.0 us` into the replay, and ended about `65.855 us` before graph end; the remaining aux streams averaged `114.130`, `79.514`, `66.973`, and `20.897 us` busy time. +- Internal gap shape: There were about `220` tiny no-activity gaps per replay. Individual internal gaps were p50 `0.864 us`, p90 `1.472 us`, p95 `3.488 us`, p99 `3.872 us`, mean `0.977 us`, max `11.296 us`; the max gap per replay was p50 `4.175 us`, p90 `6.399 us`, p95 `7.520 us`, mean `4.679 us`. +- Kernel-family time per replay: broad aggregation showed GEMM/convolution kernels at about `2686.582 us/replay`, Myelin kernels at `757.320 us/replay`, MHA kernels at `607.977 us/replay`, graph memsets at `13.166 us/replay`, and other kernels at `12.084 us/replay`. +- Learning: The TensorRT graph body is dominated by one long main-stream chain of GEMM/MHA/Myelin work, with auxiliary streams only contributing a late tail. The internal idle is fragmented into sub-microsecond launch/dependency gaps rather than one large schedulable bubble. More explicit aux-stream or Python scheduling tweaks are unlikely to help; meaningful graph-body gains still need different correct TensorRT tactics/export or replacing repeated transformer-block work. + +### Profile: TensorRT Top GEMM Warp/Scheduler Nsight Compute Snapshot + +- Hypothesis: The latest low-bubble trace shows the run is already paced by the TensorRT CUDA graph, with only about `4-5 us` p50 true idle between graph replays. Profiling the single largest FP16 GEMM with deeper Nsight Compute scheduler sections can distinguish whether the graph-body ceiling is memory bandwidth, scheduler underfill, or a CPU/pipeline handoff artifact. +- Profile: `/tmp/rfdetr_trt_topgemm_warp_sched_ncu_20260524_013956.ncu-rep`, details text `/tmp/rfdetr_trt_topgemm_warp_sched_ncu_20260524_013956_details.txt`, and raw CSV `/tmp/rfdetr_trt_topgemm_warp_sched_ncu_20260524_013956_raw.csv`. The NCU command used graph profiling mode `node`, matched `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt`, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.08s fps=105.91`, profiling overhead only and not comparable to normal benchmark FPS. +- Launch shape: The sampled launch used grid `(6, 3, 3)` for `54` CTAs, block size `128`, `166` registers/thread, `16.38 KiB` dynamic shared memory per block, and `0.45` waves/SM on the `40`-SM T4. Theoretical occupancy was `37.50%`, limited by registers, while achieved occupancy was `17.37%` with `5.56` active warps/SM. +- Scheduler and stall findings: The kernel had only `21.86%` cycles with at least one eligible warp, issued `0.22` warp/scheduler/cycle, and had `78.14%` no-eligible cycles. Active warps per scheduler averaged `1.39`, eligible warps per scheduler averaged `0.25`, and warp cycles per issued instruction were `6.33`. Nsight Compute reported the top stall as fixed-latency execution dependency at about `2.4` cycles, roughly `37.2%` of cycles between issued instructions. +- Throughput: Duration was `66.752 us` under NCU replay. Compute throughput was `45.78%`, memory throughput `27.67%`, DRAM throughput `13.89%`, L1/TEX throughput `55.37%`, L2 throughput `20.18%`, and measured memory throughput about `44.13 GB/s`. +- Learning: The dominant TensorRT graph node is not memory-bandwidth limited and not waiting on CPU pipeline work. It is a small-grid, register-heavy TensorRT tactic with too few eligible warps to keep schedulers full. This confirms the depth-2 run is now bottlenecked by CUDA graph forward-pass tactic geometry plus the small required postprocess/handoff kernels; reducing graph-to-graph bubbles further is below useful measurement noise unless a correctness-equivalent engine/tactic/export change shortens the graph body. + +### Rejected: TensorRT Persistent Activation Cache Limit + +- Hypothesis: Since Nsight Compute shows the dominant TensorRT GEMM is scheduler/occupancy limited rather than CPU-bound, enabling TensorRT's `IExecutionContext.persistent_cache_limit` might improve graph-body locality or reduce repeated activation traffic without changing model outputs. +- Change tested: Temporary env-gated hook only; when `INFERENCE_TRT_PERSISTENT_CACHE_LIMIT` was set, `_execute_trt_engine(...)` and `_capture_cuda_graph(...)` assigned that byte value to the TensorRT execution context's `persistent_cache_limit`. The benchmark was run at depth `2` only; depth `3` was not tested. +- Result on requested command: default measured `244.45` FPS. Nonzero cache limits were rejected by TensorRT on this Tesla T4 because `cudaDeviceProp.persistingL2CacheMaxSize` is `0`: `1048576` bytes measured `243.42` FPS, `2097152` bytes measured `243.60` FPS, and `4194304` bytes measured `242.71` FPS, all with `IExecutionContext::setPersistentCacheLimit` API usage errors. +- Learning: The accepted T4 environment cannot use TensorRT persistent L2 activation caching. This runtime knob cannot reduce the current graph-body bottleneck, and the temporary hook was reverted. + +### Rejected: CUDA Driver API Output D2D Copies + +- Hypothesis: The accepted CUDA graph replay path clones all three TensorRT graph output buffers with PyTorch `clone()`. Calling `cuMemcpyDtoDAsync_v2` directly for output copies might reduce PyTorch dispatch overhead in the graph-to-graph handoff while preserving cloned-output lifetime isolation. +- Change tested: Temporary env-gated path only; with `RFDETR_DRIVER_D2D_OUTPUT_COPY=1`, the graph cache-hit path allocated `torch.empty_like(...)` outputs and copied each graph-owned TensorRT output into them with `cuMemcpyDtoDAsync_v2` on the graph replay stream. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared default graph-output clones against driver-D2D output copies over all `538` frames from `vehicles_312px.mp4` at the raw TensorRT output level: `bad_raw_outputs=0`, `max_delta=0.0`, and per-output max deltas `[0.0, 0.0, 0.0]`. +- Result on requested command: An alternating depth-2 sweep measured default `244.13` and `242.67` FPS versus driver-D2D `243.92` and `242.64` FPS. The driver path does not improve throughput and adds an undesirable `ctypes`/driver-ABI dependency in the hot path. +- Learning: PyTorch `clone()` is already effectively optimal for these TensorRT output copies in the current schedule. The remaining handoff cost is not Python copy dispatch; it is the required D2D traffic and lifetime decoupling around the graph replay. The experiment was reverted. + +### Profile: TensorRT Dominant MHA Warp/Scheduler Nsight Compute Snapshot + +- Hypothesis: The second-largest graph-body family is the repeated TensorRT `_gemm_mha_v2` kernel. Profiling it with the same scheduler sections as the top GEMM can show whether MHA is also limited by TensorRT tactic occupancy and fixed-latency stalls, or whether it exposes a different memory/cache target. +- Profile: `/tmp/rfdetr_trt_mha_warp_sched_ncu_20260524_015605.ncu-rep`, details text `/tmp/rfdetr_trt_mha_warp_sched_ncu_20260524_015605_details.txt`, and raw CSV `/tmp/rfdetr_trt_mha_warp_sched_ncu_20260524_015605_raw.csv`. The NCU command used graph profiling mode `node`, matched `_gemm_mha_v2_0x7daddb359f728ff2e600188f192f4549`, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.18s fps=103.79`, profiling overhead only and not comparable to normal benchmark FPS. +- Launch shape: The sampled MHA launch used grid `(1, 6, 11)` for `66` CTAs, block size `128`, `245` registers/thread, `24.58 KiB` dynamic shared memory per block, and `0.82` waves/SM on the `40`-SM T4. Theoretical occupancy was `25%`, limited by both registers and shared memory, while achieved occupancy was `20.49%` with `6.56` active warps/SM. +- Scheduler and stall findings: The kernel had only `25.84%` cycles with at least one eligible warp, issued `0.26` warp/scheduler/cycle, and had `74.16%` no-eligible cycles. Active warps per scheduler averaged `1.62`, eligible warps per scheduler averaged `0.30`, and warp cycles per issued instruction were `6.29`. Nsight Compute reported fixed-latency execution dependency stalls at about `1.9` cycles, roughly `30.8%` of cycles between issued instructions. +- Throughput: Duration was `106.27 us` under NCU replay. Compute throughput was `35.52%`, memory throughput `25.09%`, DRAM throughput `6.49%`, L1/TEX throughput `50.18%`, L2 throughput `13.51%`, and measured memory throughput about `20.33 GB/s`. +- Learning: The dominant MHA node has the same core pathology as the top GEMM: too few CTAs and too few eligible warps to fill T4 schedulers, with low DRAM pressure. The remaining graph-body bottleneck is repeated TensorRT transformer tactic geometry; reducing it likely requires a correctness-equivalent engine/export/tactic change or replacing repeated transformer subgraphs, not more CPU/GPU pipelining around the already tight depth-2 handoff. + +### Profile: TensorRT Fused FP16/FP32 GEMM Warp/Scheduler Nsight Compute Snapshot + +- Hypothesis: The third-largest graph-body family is the repeated fused FP16-input/FP32-accumulating GEMM. Profiling it with scheduler sections checks whether it shares the same underfilled tactic shape as the dominant FP16 GEMM and MHA nodes. +- Profile: `/tmp/rfdetr_trt_fusedgemm_warp_sched_ncu_20260524_015854.ncu-rep`, details text `/tmp/rfdetr_trt_fusedgemm_warp_sched_ncu_20260524_015854_details.txt`, and raw CSV `/tmp/rfdetr_trt_fusedgemm_warp_sched_ncu_20260524_015854_raw.csv`. The NCU command used graph profiling mode `node`, matched `sm75_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_fused`, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.35s fps=100.48`, profiling overhead only and not comparable to normal benchmark FPS. +- Launch shape: The sampled fused GEMM launch used grid `(6, 12, 1)` for `72` CTAs, block size `128`, `240` registers/thread, `17.41 KiB` dynamic shared memory per block, and `0.90` waves/SM on the `40`-SM T4. Theoretical occupancy was `25%`, limited by registers, while achieved occupancy was `21.66%` with `6.93` active warps/SM. +- Scheduler and stall findings: The kernel had only `29.14%` cycles with at least one eligible warp, issued `0.29` warp/scheduler/cycle, and had `70.86%` no-eligible cycles. Active warps per scheduler averaged `1.70`, eligible warps per scheduler averaged `0.36`, and warp cycles per issued instruction were `5.85`. Nsight Compute reported fixed-latency execution dependency stalls at about `1.9` cycles, roughly `32.1%` of cycles between issued instructions. +- Throughput: Duration was `82.75 us` under NCU replay. Compute throughput was `48.67%`, memory throughput `32.02%`, DRAM throughput `14.94%`, L1/TEX throughput `64.03%`, L2 throughput `21.39%`, and measured memory throughput about `47.18 GB/s`. +- Learning: The fused GEMM is also not memory-bandwidth limited; it is another partial-wave, register-heavy TensorRT tactic with low scheduler eligibility. The three largest graph-body families now point to the same optimization class: only a correctness-equivalent engine/export/tactic change, or a much larger replacement of repeated transformer GEMM/MHA subgraphs, is likely to move the forward-pass ceiling materially. + +### Rejected: Vertical-Only RFDETR Resize Prototype + +- Hypothesis: The benchmark video frames are `312 x 176` and the TensorRT input is `312 x 312`, so preprocessing preserves width and only changes height. A vertical-only bilinear resize could avoid PIL's generic two-axis resize while preserving the effective RFDETR input contract. +- Diagnostic: Temporary prototype only; compared a half-pixel vertical bilinear NumPy resize against `PIL.Image.resize(..., Image.Resampling.BILINEAR)` on benchmark frames. The closest round-half-up variant still differed from PIL by up to `1` uint8 value with `1433` differing channel values on the first frame, so it is not pixel-exact. +- Micro-result: On `256` benchmark frames, `np.asarray(Image.fromarray(frame).resize((312, 312), Image.Resampling.BILINEAR))` measured about `0.839 ms/frame`, while the NumPy vertical-only implementation measured about `3.472 ms/frame`. +- Learning: A naive vertical-only Python/NumPy resize is both slower than PIL's C resize and not exactly pixel-compatible. Given prior OpenCV/tensor resize attempts failed prediction compatibility, this route is not worth patching into the hot path without a compiled PIL-equivalent vertical kernel. + +### Profile: Depth-2 Local Low-Bubble Nsight Systems Refresh + +- Request: Capture a fresh Nsight Systems report for timeline analysis while keeping the workflow at pipeline depth `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_local_lowbubble_20260524_020847.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_local_lowbubble_20260524_020847.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_local_lowbubble_20260524_020847_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_local_lowbubble_20260524_020847_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_local_lowbubble_20260524_020847_stats_cuda_api_sum.csv`. The capture used `PYTHONPATH=/app/inference_models` and `--pipeline_depth 2`. +- Result under CUDA graph node tracing overhead: `frames=538 elapsed=2.35s fps=228.88`. A normal non-profiled sanity run immediately afterward measured `frames=538 elapsed=2.21s fps=243.90`. +- Graph-node structure: The trace contains `602` TensorRT graph replays with exactly `242` graph activities per replay: `239` kernels and `3` graph memsets. After skipping `64` capture warmup replays plus `100` settling replays, all `438` steady-state replays used `6` CUDA streams. +- Graph-node timing after the same skip: envelope duration was p50 `4123.950 us`, p90 `4184.730 us`, p95 `4189.491 us`, mean `4126.647 us`; union GPU-busy time inside the graph envelope was p50 `3911.685 us`, p90 `3970.583 us`, p95 `3974.831 us`, mean `3913.363 us`; internal no-activity idle was p50 `212.734 us`, p90 `215.974 us`, p95 `217.933 us`, mean `213.284 us`. The mean graph busy fraction was `94.83%`. +- Graph-to-graph spacing after the same skip: graph end-to-next-start gap was p50 `42.399 us`, p90 `43.884 us`, p95 `44.326 us`, mean `42.393 us`. Non-graph GPU work inside the gap was p50 `37.856 us`, p90 `39.263 us`, p95 `39.813 us`, mean `37.884 us`; true idle was p50 `4.416 us`, p90 `5.132 us`, p95 `5.280 us`, mean `4.509 us`. +- Gap occupants: the largest clipped occupants were next-frame input Device-to-Device copy (`1168128B`, `13.186 us/gap`), TensorRT mask Device-to-Device clone (`2433600B`, `13.140 us/gap`), four vectorized elementwise kernels in aggregate (`11.616 us/gap`), boxes Device-to-Device clone (`1600B`, `4.516 us/gap`), `_select_topk_boxes_kernel` (`2.167 us/gap`), and logits Device-to-Device clone (`36400B`, `2.097 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`807.967 us/replay`), 12 Myelin MHA nodes (`585.914 us/replay`), 12 fused FP16/FP32 GEMM nodes (`480.681 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.449 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.357 us/replay`). +- Learning: The requested depth-2 schedule is already tightly graph paced. The graph of the next frame starts about `42 us` after the prior graph envelope ends, and only about `4-5 us` p50 is true idle; the rest is required D2D handoff and small fused postprocess work. The remaining bottleneck is still the TensorRT CUDA graph body, not CPU scheduling or pipeline depth. + +### Profile: TensorRT H1688 GEMM Warp/Scheduler Nsight Compute Snapshot + +- Hypothesis: The latest low-bubble trace shows five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes at about `193 us/replay`, making this the fifth-largest TensorRT graph-body family. Profiling it with scheduler sections can determine whether it is a memory-bandwidth target, a better custom-kernel/plugin target, or the same TensorRT tactic-occupancy problem as the larger GEMM/MHA nodes. +- Profile: `/tmp/rfdetr_trt_h1688gemm_warp_sched_ncu_20260524_021351.ncu-rep`, details text `/tmp/rfdetr_trt_h1688gemm_warp_sched_ncu_20260524_021351_details.txt`, and raw CSV `/tmp/rfdetr_trt_h1688gemm_warp_sched_ncu_20260524_021351_raw.csv`. The NCU command used graph profiling mode `node`, matched `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1`, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU overhead: `frames=538 elapsed=5.20s fps=103.44`, profiling overhead only and not comparable to normal benchmark FPS. A non-profiled sanity check afterward measured `frames=538 elapsed=2.20s fps=244.09`. +- Launch shape: The sampled launch used grid `(2, 48, 1)` for `96` CTAs, block size `128`, `194` registers/thread, `32.77 KiB` static shared memory per block, no dynamic shared memory, and `1.20` waves/SM on the `40`-SM T4. Theoretical occupancy was `25%`, limited by both registers and shared memory, while achieved occupancy was `22.47%` with `7.19` active warps/SM. +- Scheduler and stall findings: The kernel had only `16.41%` cycles with at least one eligible warp, issued `0.16` warp/scheduler/cycle, and had `83.59%` no-eligible cycles. Active warps per scheduler averaged `1.77`, eligible warps per scheduler averaged `0.21`, and warp cycles per issued instruction were `10.78`. Nsight Compute reported fixed-latency execution dependency stalls at about `3.5` cycles, roughly `32.7%` of cycles between issued instructions. +- Throughput: Duration was `78.82 us` under NCU replay. Compute throughput was `45.62%`, memory throughput `31.46%`, DRAM throughput `28.81%`, L1/TEX throughput `62.98%`, L2 throughput `22.59%`, and measured memory throughput about `91.22 GB/s`. +- Learning: This H1688 GEMM is also not a memory-bandwidth limiter and not a CPU/pipeline artifact. It is another register/shared-memory-limited TensorRT tactic with very low eligible-warp rate. The fifth-largest graph family therefore reinforces the same conclusion as the top GEMM, MHA, and fused GEMM: small local postprocess or stream changes cannot materially move the ceiling; the remaining large lever is a correctness-equivalent TensorRT tactic/export/engine change or a broad replacement of repeated transformer graph-body kernels. + +### Diagnostic: Accepted TensorRT Plan Batch Constraint + +- Hypothesis: The TensorRT graph body is dominated by many underfilled GEMM/MHA kernels. If the accepted plan supported batch size greater than `1`, batching adjacent video frames could increase per-launch work and improve T4 occupancy without changing per-frame prediction math. +- Diagnostic: Loaded the accepted local `rfdetr-seg-nano` TRT package through `PYTHONPATH=/app/inference_models` and inspected `model._trt_config`, `engine.num_optimization_profiles`, TensorRT IO tensor shapes, and profile min/opt/max shapes. Pipeline depth was not varied and depth `3` was not tested. +- Result: The accepted package reports `TRTConfig(static_batch_size=1, dynamic_batch_size_min=None, dynamic_batch_size_opt=None, dynamic_batch_size_max=None)`. The engine has one optimization profile. Input `input` is fixed at `(1, 3, 312, 312)`, and profile min/opt/max are all `(1, 3, 312, 312)`. Outputs are also fixed at batch `1`: `dets` `(1, 100, 4)`, `labels` `(1, 100, 91)`, and mask output `4186` `(1, 100, 78, 78)`. +- Learning: Runtime frame batching is not available for the accepted correctness-equivalent T4 FP16 plan. Filling the under-occupied transformer graph kernels through batching would require a new static/dynamic-batch engine built from a behavior-equivalent export source; the available public ONNX/Torch sources have already failed the accepted prediction-compatibility gate, so batching is not a safe local workflow/code change for the current benchmark. + +### Diagnostic: CUDA Graph Upload And Package Refresh + +- Hypothesis: If PyTorch exposed CUDA graph upload or graph-instantiation flags, uploading the captured TensorRT graph before steady replay might reduce graph-launch overhead without changing TensorRT tactics, output ownership, or prediction math. +- Diagnostic: Inspected `torch.cuda.CUDAGraph()` in the local PyTorch `2.6.0+cu124` runtime. The public graph object only exposes `debug_dump`, `enable_debug_mode`, `replay`, and `reset`. The internal `torch._utils_internal.upload_graph(graph)` helper exists but its implementation is an empty `pass`. The raw CUDA graph executable handle is not exposed through the Python object. Pipeline depth was not varied and depth `3` was not tested. +- Result: There is no usable local Python hook for `cudaGraphUpload`, `cudaGraphInstantiate` flags, or graph-launch policy tuning in this runtime. The accepted path already avoids kernel-by-kernel launches through PyTorch graph replay, so this cannot be turned into a safe library-code optimization here. +- Package refresh: Re-queried the Roboflow weights provider through `PYTHONPATH=/app/inference_models` for `rfdetr-seg-nano`. It still resolves to `coco-dataset-vdnr1/41` and exposes the same six packages: L4 TRT FP32 `3e3ddd85586b43e4fac6d319fb2927fd`, ONNX FP32 `5362b72bfb9f01d2e0b8cba2048d932c`, L4 TRT FP16 `89d1f41e2af4f4f3ffcdfb77e774d26a`, Torch FP32 `8b8da2fe824240522a39f3cde41aafae`, T4 TRT FP32 `bbc2cc23adf6f5e71a9241956081da96`, and T4 TRT FP16 `c70f32369a54d61e06ef4e6b56c82524`. +- Learning: The current graph-launch stack is not exposing another Python-level CUDA graph tuning lever, and there is still no newer official T4 FP16 or dynamic-batch TensorRT package to test. The remaining graph-body optimization path continues to require a behavior-equivalent export/engine source or a deeper TensorRT/plugin replacement of repeated transformer kernels. + +### Diagnostic: CUDA Graph Launch Lead Time + +- Hypothesis: Even though the GPU graph-to-graph gap is already only about `42 us`, the host might still be launching CUDA graphs late enough that reducing Python or CUDA runtime API overhead could remove the last bubble. +- Diagnostic: Reused `/tmp/rfdetr_depth2_local_lowbubble_20260524_020847.sqlite`, paired the `602` `cudaGraphLaunch_v10000` runtime calls with the `602` TensorRT graph-node activity chunks, and analyzed the steady-state launches after skipping `64` capture-warmup replays plus `100` settling replays. Pipeline depth was not varied and depth `3` was not tested. +- Result: `cudaGraphLaunch_v10000` API duration was p50 `437.104 us`, p90 `464.642 us`, mean `443.636 us`. The next graph launch API usually began before the current graph finished on the GPU: current graph GPU end to next graph-launch API start was p50 `-1619.952 us`, p90 `-1449.055 us`, mean `-1585.506 us`. Despite that host-side lead, the GPU graph end to next graph start remained p50 `42.399 us`, p90 `43.903 us`, mean `42.393 us`. +- Learning: The remaining graph-to-graph gap is not caused by late CPU graph launches. The CPU is already queuing the next graph roughly `1.6 ms` before the previous graph finishes; the GPU start is gated by stream dependencies and required handoff work, mainly output clones, next-frame input copy, sigmoid/selector, and the fused mask path. Further CPU launch-path tuning is unlikely to improve FPS unless it also removes required GPU work or shortens the TensorRT graph body. + +### Profile: TensorRT Internal TopK Scheduler Nsight Compute Snapshot + +- Hypothesis: The current graph-node trace shows TensorRT's internal Myelin TopK node as a single-block tail. A scheduler/warp-state NCU snapshot can determine whether this is a plausible future custom-plugin/export target or just an unimportant memory tail. +- Profile: `/tmp/rfdetr_trt_myelin_topk_scheduler_ncu_20260524_023209.ncu-rep`, details text `/tmp/rfdetr_trt_myelin_topk_scheduler_ncu_20260524_023209_details.txt`, raw CSV `/tmp/rfdetr_trt_myelin_topk_scheduler_ncu_20260524_023209_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:__myl_Topk_0x570b8f0fdb8e2c6c3fd1a15f1595090d`, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `89-95` FPS, which is profiling overhead only and not comparable to normal benchmark FPS. A normal non-profiled sanity run afterward measured `frames=538 elapsed=2.22s fps=242.86`. +- Findings: The sampled TopK launch used grid `(1, 1, 1)`, block `256`, `99` registers/thread, `16.90 KiB` static shared memory, and only `0.01` waves/SM on the `40`-SM T4. Theoretical occupancy was `50%`, achieved occupancy was `24.91%`, and achieved active warps were `7.97` per SM only because one block was resident on one SM. Scheduler issue rate was `0.34` warp/scheduler/cycle with `33.73%` cycles having at least one eligible warp and `66.27%` no-eligible cycles. Duration was `32.96 us` under NCU replay, with compute throughput `1.37%`, memory throughput `1.37%`, DRAM throughput `0.28%`, and L1/TEX throughput `58.12%`. +- Learning: This TopK node is a clearly underfilled TensorRT-internal kernel, but it is a small single node in the serialized graph and cannot be replaced from the current Python postprocess boundary. It is only worth pursuing as part of a correctness-equivalent export/plugin/engine redesign that can remove or fuse the TensorRT internal TopK; local pipeline, stream, or postprocess tweaks will not access it. + +### Rejected: Overlap TensorRT Output Clones With Next Input Copy + +- Hypothesis: The graph-to-graph gap contains both TensorRT output D2D clones and the next-frame input D2D copy. Cloning graph outputs on a separate copy stream, letting the graph stream copy the next input while previous output clones are in flight, and waiting on the copy stream only before replay overwrites graph-owned outputs might reduce the fixed handoff tail while preserving cloned-output lifetime isolation. +- Change tested: Temporary env-gated code only; with `RFDETR_TRT_OVERLAP_OUTPUT_CLONES_WITH_INPUT_COPY=true`, the CUDA graph cache-hit path created a persistent output-copy stream, copied the next input on the graph stream, waited for previous output clones before graph replay, then cloned current outputs on the copy stream. The cloned tensors carried their producer stream, and RFDETR postprocess waited on that stream instead of the inference stream. Pipeline depth remained fixed at `2`; depth `3` was not tested. +- Correctness: Compared the gated path against the default cloned-output path over all `538` frames after deferred fused postprocess: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command: Alternating depth-2 runs measured default `243.94`, `244.16`, and `243.50` FPS versus overlap `242.37`, `241.90`, and `242.08` FPS. +- Learning: The extra copy stream and producer-stream handoff cost more than any overlap between output clones and the next input copy. The accepted single graph-stream clone path preserves the better schedule; the temporary code was reverted. + +### Diagnostic: TensorRT Context Runtime Knob Surface + +- Hypothesis: After rejecting stream, clone, graph-upload, aux-stream, persistent-cache, allocator, and rebuild paths, there might still be an exposed TensorRT `IExecutionContext` runtime configuration knob that can affect captured graph replay without changing the serialized engine or predictions. +- Diagnostic: Loaded the accepted local `rfdetr-seg-nano` T4 FP16 plan through `PYTHONPATH=/app/inference_models`, inspected `ICudaEngine` and `IExecutionContext` public attributes, and called `IExecutionContext.get_runtime_config()`. Pipeline depth was not varied and depth `3` was not tested. +- Result: The accepted engine reports `num_aux_streams=4`, `profiling_verbosity=LAYER_NAMES_ONLY`, `tactic_sources=8`, `hardware_compatibility_level=NONE`, `device_memory_size_v2=18289152`, and one optimization profile. The execution context reports `active_optimization_profile=0`, `debug_sync=False`, `enqueue_emits_profile=True`, `nvtx_verbosity=LAYER_NAMES_ONLY`, `persistent_cache_limit=0`, `profiler=None`, `temporary_allocator=None`, and `get_runtime_config()` returns `None`. +- Learning: The remaining mutable context surface is the same set already tested or intentionally avoided: aux streams, profile selection, NVTX/profile emission, persistent cache, temporary allocator, output allocator, external device memory, input-consumed event, debug/profiler hooks, and shape/memory refresh. There is no additional Python-level TensorRT runtime-config control available for this accepted plan, so local work should continue to focus on correctness-equivalent engine/export/plugin changes or direct evidence gathering rather than another context property toggle. + +### Profile: TensorRT 128x64 FP16/FP32 GEMM Warp/Scheduler Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize128x64x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` as the fourth-largest TensorRT GEMM family at about `221 us/replay` across twelve graph nodes. A scheduler/warp-state NCU snapshot can confirm whether this family has the same underfilled tactic geometry as the top GEMM/MHA families or a different memory bottleneck. +- Profile: `/tmp/rfdetr_trt_gemm128x64_f16f32_warp_sched_ncu_20260524_025300.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x64_f16f32_warp_sched_ncu_20260524_025300_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm128x64_f16f32_warp_sched_ncu_20260524_025300_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `100-105` FPS, which is profiling overhead only and not comparable to normal benchmark FPS. A normal non-profiled sanity run afterward measured `frames=538 elapsed=2.20s fps=244.33`. +- Findings: The sampled launch used grid `(6, 6, 1)` for `36` CTAs, block `128`, `186` registers/thread, `24.58 KiB` dynamic shared memory, and only `0.45` waves/SM on the `40`-SM T4. Theoretical occupancy was `25%`, achieved occupancy was `12.63%`, and active warps per scheduler averaged `0.99`. Scheduler issue rate was `0.16` warp/scheduler/cycle, with only `15.89%` cycles having at least one eligible warp and `84.11%` no-eligible cycles. Duration was `32.90 us` under NCU replay, with compute throughput `30.11%`, memory throughput `26.30%`, DRAM throughput `12.52%`, L1/TEX throughput `52.60%`, and L2 throughput `19.83%`. +- Learning: This fourth-largest GEMM family is another register/shared-memory-heavy, partial-wave TensorRT tactic with very low eligible-warp rate, not a memory-bandwidth or CPU pipeline bottleneck. The evidence now covers the dominant GEMM/MHA families consistently: improving the forward-pass ceiling requires a different correctness-equivalent TensorRT tactic/export/plugin strategy across repeated transformer blocks, not another local postprocess or stream scheduling change. + +### Profile: TensorRT 32x32 FP16 TN GEMM Warp/Scheduler Snapshot + +- Hypothesis: The current graph-node trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize32x32x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` at about `133 us/replay` across sixteen graph nodes. The basic profile showed very small grids; a scheduler/warp-state snapshot can quantify whether this tail is one of the clearest partial-wave TensorRT tactic problems. +- Profile: `/tmp/rfdetr_trt_gemm32x32_f16tn_warp_sched_ncu_20260524_030232.ncu-rep`, details text `/tmp/rfdetr_trt_gemm32x32_f16tn_warp_sched_ncu_20260524_030232_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm32x32_f16tn_warp_sched_ncu_20260524_030232_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `100` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `100-104` FPS, which is profiling overhead only and not comparable to normal benchmark FPS. A normal non-profiled sanity run afterward measured `frames=538 elapsed=2.19s fps=245.78`. +- Findings: The sampled launch used grid `(4, 3, 1)` for only `12` CTAs, block `128`, `63` registers/thread, `8.19 KiB` dynamic shared memory, and `0.04` waves/SM on the `40`-SM T4. Theoretical occupancy was `100%`, but achieved occupancy was only `12.52%` because the grid cannot fill the device. Scheduler issue rate was `0.11` warp/scheduler/cycle, with only `10.66%` cycles having at least one eligible warp and `89.34%` no-eligible cycles. Duration was `12.86 us` under NCU replay, with compute throughput `3.77%`, memory throughput `4.99%`, DRAM throughput `4.09%`, L1/TEX throughput `20.10%`, and L2 throughput `4.29%`. +- Learning: This 32x32 TN GEMM family is an extreme example of the remaining TensorRT graph-body issue: the kernel itself could run at high occupancy if enough CTAs existed, but the serialized tactic launches far fewer blocks than the T4 has SMs. The lost time is tactic geometry inside the accepted engine, not memory bandwidth, CPU launch latency, or postprocess scheduling. + +### Profile: TensorRT Depthwise Convolution Warp/Scheduler Snapshot + +- Hypothesis: The low-bubble trace shows `sm50_xmma_convolution_depthwiseHMMA_FP16NHWCx8_TR3_TS3_STRIDEH1_STRIDEW1_execute_kernel_trt` at about `130 us/replay`, comparable to the 32x32 GEMM family. Profiling it with scheduler counters can distinguish whether this is another underfilled TensorRT tactic or a more normal, device-filling convolution tail. +- Profile: `/tmp/rfdetr_trt_depthwise_conv_warp_sched_ncu_20260524_031734.ncu-rep`, details text `/tmp/rfdetr_trt_depthwise_conv_warp_sched_ncu_20260524_031734_details.txt`, and raw CSV `/tmp/rfdetr_trt_depthwise_conv_warp_sched_ncu_20260524_031734_raw.csv`. The NCU command used graph profiling mode `node`, matched `regex:sm50_xmma_convolution_depthwiseHMMA_FP16NHWCx8_TR3_TS3_STRIDEH1_STRIDEW1_execute_kernel_trt`, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `91-93` FPS, which is profiling overhead only and not comparable to normal benchmark FPS. A normal non-profiled sanity run afterward measured `frames=538 elapsed=2.20s fps=244.01`. +- Findings: The sampled launch used grid `(1521, 1, 1)` for `1521` CTAs, block `256`, `61` registers/thread, no static or dynamic shared memory, and `9.51` waves/SM on the `40`-SM T4. Theoretical occupancy was `100%`, achieved occupancy was `86.25%`, active warps per scheduler averaged `6.89`, and eligible warps per scheduler averaged `1.46`. Scheduler issue rate was `0.47` warp/scheduler/cycle, with `46.51%` cycles having at least one eligible warp and `53.49%` no-eligible cycles. Duration was `66.85 us` under NCU replay, with compute throughput `47.98%`, memory throughput `47.98%`, DRAM throughput `37.10%`, L1/TEX throughput `83.09%`, L2 throughput `25.30%`, and measured memory throughput about `118.45 GB/s`. +- Learning: Unlike the small GEMM and TopK tails, this depthwise convolution fills the T4 well and is not a partial-wave tactic problem. It is still part of the TensorRT CUDA graph body, but the most promising remaining targets remain the repeated low-waves GEMM/MHA/TopK tactics or a correctness-equivalent export/engine/plugin replacement rather than another pipeline-depth, CPU, or postprocess scheduling change. + +### Diagnostic: Accepted TensorRT Engine Layer Profile + +- Hypothesis: The NCU kernel snapshots show the graph-body ceiling is repeated TensorRT transformer tactics, but kernel names alone do not identify the owning model layers. TensorRT engine inspector and per-layer profiling can map the expensive families back to the accepted engine's layer names and show whether the cost is concentrated enough for a future plugin/export replacement. +- Diagnostic: Loaded the accepted `rfdetr-seg-nano` T4 FP16 plan from `/tmp/cache/models-cache/rfdetr-seg-nano-da818f3b/c70f32369a54d61e06ef4e6b56c82524/engine.plan` with `engine_host_code_allowed=True` for inspection only. Engine inspector output was written to `/tmp/rfdetr_trt_engine_inspector_20260524_0325.json`. TensorRT reported `261` layers, `4` IO tensors, fixed input `(1, 3, 312, 312)`, outputs `dets`, `labels`, and `4186`, `num_aux_streams=4`, `tactic_sources=8`, and `profiling_verbosity=LAYER_NAMES_ONLY`. +- Layer profile: A direct non-graph TensorRT `IProfiler` run over `100` iterations with a fixed CUDA input wrote `/tmp/rfdetr_trt_layer_profile_direct_20260524_0330.json`. This is not the benchmark path and includes direct-enqueue/profiler overhead; it is only for layer attribution. Wall time was `5.577 ms/iter`, and summed TensorRT-reported layer time was `5.072 ms/iter`. +- Attribution: The largest grouped layer costs were backbone MHA (`16` layers, `0.697 ms/iter`), backbone MLP `fc2` matmuls (`12` layers, `0.584 ms/iter`), backbone Myelin `Fc`/MLP `fc1` layers (`13` layers, `0.565 ms/iter`), backbone QKV packed matmuls (`12` layers, `0.461 ms/iter`), segmentation head layers (`12` layers, `0.455 ms/iter`), decoder layers excluding the packed cross-attention value projection (`30` layers, `0.427 ms/iter`), backbone layernorm/fusion layers (`25` layers, `0.280 ms/iter`), and backbone attention output matmuls (`12` layers, `0.270 ms/iter`). The top single layer names were repeated `_gemm_mha_v2_myl2_*` MHA layers at about `0.054-0.057 ms/iter` each. +- Result: A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.33`. +- Learning: The accepted engine profile is broad but structurally concentrated in the DINO/backbone transformer stack, especially repeated MHA and MLP matmuls. Because the engine was built with `LAYER_NAMES_ONLY`, it does not expose per-layer tactic detail beyond names, but the layer attribution matches the NCU evidence and makes a small Python-side postprocess or pipeline-depth change unlikely to move the ceiling. Any major next gain needs a correctness-equivalent engine/export/plugin strategy for repeated backbone transformer layers rather than further CPU scheduling work. + +### Profile: Fused Mask Resize Postprocess Kernel + +- Hypothesis: The depth-2 trace still shows small non-graph postprocess work between graph replays, and the workflow uses a fixed deferred mask resize limit of `7`. Measuring the actual detection-count distribution and profiling `_resize_selected_masks_kernel` can show whether this fixed-limit fused postprocess path is a meaningful remaining optimization target. +- Detection-count diagnostic: A depth-2 workflow sink over all `538` benchmark frames measured count histogram `{1: 15, 2: 104, 3: 164, 4: 145, 5: 74, 6: 14, 7: 22}`, with min `1`, max `7`, mean `3.54`, `0` frames above `7`, and `22` frames exactly `7`. This confirms the current fixed limit is sufficient for this benchmark, but also that roughly half of the fixed seven mask rows are unused on an average frame. +- Profile: `/tmp/rfdetr_fused_resize_masks_sched_ncu_20260524_033256.ncu-rep`, details text `/tmp/rfdetr_fused_resize_masks_sched_ncu_20260524_033256_details.txt`, and raw CSV `/tmp/rfdetr_fused_resize_masks_sched_ncu_20260524_033256_raw.csv`. The NCU command matched `regex:_resize_selected_masks_kernel`, skipped `350` launches, collected `1` launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `86-91` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.21s fps=243.95`. +- Findings: The sampled launch used grid `(7, 215, 1)` for `1505` CTAs, block `64`, `46` registers/thread, no shared memory, and `2.35` waves/SM on the `40`-SM T4. Theoretical occupancy was `100%`, achieved occupancy was `78.06%`, active warps per scheduler averaged `6.31`, and eligible warps per scheduler averaged `2.21`. Scheduler issue rate was `0.62` warp/scheduler/cycle, with `61.63%` cycles having at least one eligible warp and `38.37%` no-eligible cycles. Duration was `13.34 us` under NCU replay, with compute throughput `48.83%`, memory throughput `37.35%`, DRAM throughput only `2.03%`, L1/TEX throughput `46.22%`, L2 throughput `3.65%`, and measured memory throughput about `6.48 GB/s`. +- Learning: The fixed-limit mask resize kernel does spend work on unused rows because average detections are below `7`, but the current fused Triton kernel is already well-filled and only about `13 us/frame`. The better target remains the multi-millisecond TensorRT CUDA graph body. A dynamic exact-count mask launch would need a CPU-visible count before launching the resize and would likely reintroduce a synchronization bubble larger than the saved GPU work, so the accepted fixed-limit depth-2 path remains the right local tradeoff. + +### Profile: Fused Top-K Selector Postprocess Kernel + +- Hypothesis: The low-bubble trace shows `_select_topk_boxes_kernel` as one of the small non-graph gap occupants. Profiling it with scheduler counters can determine whether this custom Triton selector is still a meaningful local optimization target or only a small fixed tail compared with the TensorRT graph body. +- Profile: `/tmp/rfdetr_fused_select_topk_sched_ncu_20260524_034219.ncu-rep`, details text `/tmp/rfdetr_fused_select_topk_sched_ncu_20260524_034219_details.txt`, and raw CSV `/tmp/rfdetr_fused_select_topk_sched_ncu_20260524_034219_raw.csv`. The NCU command matched `regex:_select_topk_boxes_kernel`, skipped `350` launches, collected `1` launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `86-91` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.30`. +- Findings: The sampled launch used grid `(1, 1, 1)`, block `256`, `128` registers/thread, `32 B` dynamic shared memory, and only `0.01` waves/SM on the `40`-SM T4. Theoretical occupancy was `50%`, achieved occupancy was `24.88%`, active warps per scheduler averaged `1.99`, and eligible warps per scheduler averaged `0.43`. Scheduler issue rate was `0.26` warp/scheduler/cycle, with `26.06%` cycles having at least one eligible warp and `73.94%` no-eligible cycles. Duration was `18.37 us` under NCU replay, with compute throughput `0.93%`, memory throughput `0.95%`, DRAM throughput `0.95%`, L1/TEX throughput `19.87%`, L2 throughput `0.68%`, and measured memory throughput about `3.03 GB/s`. +- Learning: The fused selector is intentionally a single-block global top-k reduction and is underfilled, but its absolute duration is tiny beside the `~4.1 ms` CUDA graph forward pass. Prior selector rewrites were either incorrect or slower; this profile confirms why further selector-only tuning has a low ceiling. The primary path remains a correctness-equivalent TensorRT engine/tactic/plugin change for repeated transformer graph-body layers, not more local postprocess scheduling. + +### Profile: Preprocess Resize GPU Kernel + +- Hypothesis: After profiling the fused selector and mask resize kernels, the remaining named non-graph GPU occupant from the low-bubble trace is `ResizeVectorizedH2x4BilinearKernel`. A scheduler profile can show whether preprocessing resize has underfilled-kernel headroom or is already an efficient small fixed cost. +- Profile: `/tmp/rfdetr_resize_vectorized_sched_ncu_20260524_035154.ncu-rep`, details text `/tmp/rfdetr_resize_vectorized_sched_ncu_20260524_035154_details.txt`, and raw CSV `/tmp/rfdetr_resize_vectorized_sched_ncu_20260524_035154_raw.csv`. The NCU command matched `regex:ResizeVectorizedH2x4BilinearKernel`, skipped `350` launches, collected `1` launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `86-90` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.23`. +- Findings: The sampled launch used grid `(1521, 1, 1)` for `1521` CTAs, block `128`, `38` registers/thread, no shared memory, and `4.75` waves/SM on the `40`-SM T4. Theoretical occupancy was `100%`, achieved occupancy was `83.99%`, active warps per scheduler averaged `6.69`, and eligible warps per scheduler averaged `1.86`. Scheduler issue rate was `0.64` warp/scheduler/cycle, with `63.86%` cycles having at least one eligible warp and `36.14%` no-eligible cycles. Duration was `21.63 us` under NCU replay, with compute throughput `55.61%`, memory throughput `41.38%`, DRAM throughput `41.38%`, L1/TEX throughput `59.07%`, L2 throughput `32.83%`, and measured memory throughput about `133.35 GB/s`. +- Learning: The preprocessing resize kernel is not a partial-wave or CPU-bubble problem. It fills the T4 well and is only a small fixed cost relative to the TensorRT graph forward pass. Earlier preprocessing rewrites failed correctness or were slower; this profile supports leaving the accepted resize path unchanged and keeping effort on a correctness-equivalent TensorRT graph-body improvement. + +### Rejected: Exact-Count Prediction D2H Copy + +- Hypothesis: The accepted RFDETR workflow conversion copies a fixed `7` mask rows plus metadata rows from GPU to reusable pinned host buffers before reading the deferred `valid_count`. Since the measured detection mean is only `3.54`, synchronizing the one-element count first and then copying only the exact number of prediction rows might reduce D2H traffic enough to improve throughput. +- Change tested: Temporary env-gated code only; with `RFDETR_EXACT_COUNT_D2H=1`, `_try_copy_limited_cuda_detection_tensors_to_pinned_numpy(...)` copied `count_tensor` to pinned host, synchronized, read `valid_count`, copied only `valid_count` boxes/confidences/classes/masks, synchronized again, and returned owned NumPy copies. Pipeline depth stayed fixed at `2`; depth `3` was not tested. +- Correctness: Compared default fixed-limit pinned conversion against exact-count conversion over all `538` frames: `bad_counts=0`, `bad_classes=0`, `bad_masks=0`, `bad_boxes_gt5=0`, `max_box_delta=0.0`, `max_conf_delta=0.0`. +- Result on requested command path: In the comparison run, default measured `frames=538 elapsed=2.200s fps=244.55`; exact-count D2H measured `frames=538 elapsed=2.843s fps=189.24`. After reverting the temporary code, a normal depth-2 sanity run measured `frames=538 elapsed=2.20s fps=244.45`. +- Learning: The saved D2H bytes are much less important than preserving one synchronized handoff point. Reading the count before prediction copies serializes CPU materialization and destroys overlap, so the accepted fixed-limit pinned copy path remains the right tradeoff for this benchmark. + +### Profile: TensorRT 64x64 TN Split-K GEMM Scheduler Snapshot + +- Hypothesis: The latest low-bubble graph-body coverage scan showed `sm75_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x64x64_stage1_warpsize2x2x1_tensor16x8x8_execute_split_k_kernel_trt` as the largest remaining unprofiled TensorRT graph-body family, at about `20.6 us/replay` across four graph nodes. A scheduler/occupancy NCU snapshot can determine whether it is another small partial-wave tactic tail or a different bottleneck class. +- Profile: `/tmp/rfdetr_trt_gemm64x64_tn_splitk_sched_ncu_20260524_040701.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x64_tn_splitk_sched_ncu_20260524_040701_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm64x64_tn_splitk_sched_ncu_20260524_040701_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `91-92` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.21s fps=243.45`. +- Findings: The sampled launch used grid `(2, 4, 2)` for only `16` CTAs, block `128`, `43` registers/thread, `4.61 KiB` dynamic shared memory per block, and just `0.05` waves/SM on the `40`-SM T4. Theoretical occupancy was `100%`, but achieved occupancy was only `12.50%` because the grid is much smaller than the device. Scheduler issue rate was `0.08` warp/scheduler/cycle, with only `8.41%` cycles having at least one eligible warp and `91.59%` no-eligible cycles. Duration was `10.37 us` under NCU replay, with compute throughput `2.59%`, memory throughput `9.91%`, DRAM throughput `9.91%`, L1/TEX throughput `11.08%`, L2 throughput `4.44%`, and measured memory throughput about `31.25 GB/s`. +- Learning: This split-K GEMM is another underfilled TensorRT tactic tail, but its absolute cost is small compared with the `~4.1 ms` CUDA graph forward pass. It is not a CPU-pipeline problem and not a memory-bandwidth limiter. The dominant remaining limit is still the accumulated TensorRT graph-body transformer work, so meaningful gains need a correctness-equivalent engine/tactic/plugin/export change rather than depth changes or more CPU/postprocess scheduling. + +### Profile: Continued Depth-2 Low-Bubble Nsight Systems Timeline + +- Request: Capture a fresh Nsight Systems timeline for analysis while keeping the benchmark constrained to pipeline depth `2`; depth `3` was not tested. +- Profile: `/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.nsys-rep`, exported SQLite `/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656.sqlite`, and CSV summaries `/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656_stats_cuda_gpu_kern_sum.csv`, `/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656_stats_cuda_gpu_mem_time_sum.csv`, and `/tmp/rfdetr_depth2_lowbubble_cont_20260524_041656_stats_cuda_api_sum.csv`. The capture used `PYTHONPATH=/app/inference_models`, CUDA graph node tracing, and `--pipeline_depth 2`. +- Result under graph-node profiling overhead: `frames=538 elapsed=2.33s fps=231.25`. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.05`. +- Graph-node structure: The trace contains `602` TensorRT CUDA graph replays with exactly `242` graph activities per replay: `239` kernels and `3` graph memsets. After skipping `64` capture-warmup replays plus `100` settling launches, all `438` steady-state replays used `6` CUDA streams. +- Graph-node timing after the same skip: envelope duration was p50 `4127.439 us`, p90 `4188.435 us`, p95 `4191.618 us`, mean `4127.492 us`; union GPU-busy time inside the graph envelope was p50 `3914.801 us`, p90 `3973.212 us`, p95 `3975.444 us`, mean `3913.531 us`; internal no-activity idle was p50 `213.629 us`, p90 `217.421 us`, p95 `219.496 us`, mean `213.961 us`. The mean graph busy fraction was `94.82%`. +- Graph-to-graph spacing after the same skip: start-to-start interval was p50 `4169.854 us`, p90 `4230.608 us`, p95 `4234.108 us`, mean `4170.715 us`; graph end-to-next-start gap was p50 `42.496 us`, p90 `44.063 us`, p95 `44.415 us`, mean `43.335 us`. Non-graph GPU work in that gap was p50 `38.112 us`, p90 `39.743 us`, p95 `40.511 us`, mean `38.337 us`; true idle was p50 `4.352 us`, p90 `5.088 us`, p95 `5.184 us`, mean `4.998 us`. +- Gap occupants: the largest clipped occupants were TensorRT mask Device-to-Device clone (`2433600B`, `13.453 us/gap`), next-frame input Device-to-Device copy (`1168128B`, `12.992 us/gap`), vectorized elementwise aggregate (`11.412 us/gap`), boxes Device-to-Device clone (`1600B`, `4.559 us/gap`), `_select_topk_boxes_kernel` (`2.237 us/gap`), and logits Device-to-Device clone (`36400B`, `2.096 us/gap`). +- Top graph occupants: aggregated by short kernel name, the largest graph-body costs were the 25 FP16 GEMM nodes (`806.687 us/replay`), 12 Myelin MHA nodes (`584.698 us/replay`), 12 fused FP16/FP32 GEMM nodes (`479.750 us/replay`), 12 smaller FP16/FP32 GEMM nodes (`221.081 us/replay`), and five `trt_turing_h1688gemm_128x128_ldg8_relu_nn_v1` nodes (`193.018 us/replay`). +- Learning: The continued trace still has the requested low-bubble shape: the next CUDA graph starts about `42-43 us` after the prior graph envelope ends, and only about `4-5 us` of that gap is true idle. The CPU is not blocking the model forward pass; the remaining FPS ceiling is the TensorRT graph body plus required D2D handoff and fused postprocess work. + +### Profile: TensorRT Implicit-GEMM Convolution Scheduler Snapshot + +- Hypothesis: The current depth-2 graph-body trace shows `sm75_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage1_warpsize2x2x1_g1_tensor16x8x8_t1r3s3_execute_kernel_trt` at about `126-128 us/replay` across six TensorRT graph nodes. It previously had only a basic NCU snapshot, so a scheduler/warp-state profile can determine whether this convolution-lowered GEMM is another small-grid tactic issue or a healthier convolution target. +- Profile: `/tmp/rfdetr_trt_fprop_implicit_gemm_sched_ncu_20260524_042109.ncu-rep`, details text `/tmp/rfdetr_trt_fprop_implicit_gemm_sched_ncu_20260524_042109_details.txt`, and raw CSV `/tmp/rfdetr_trt_fprop_implicit_gemm_sched_ncu_20260524_042109_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `100-104` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.31`. +- Findings: The sampled launch used grid `(2, 11, 1)` for only `22` CTAs, block `128`, `120` registers/thread, `16.38 KiB` dynamic shared memory per block, and only `0.14` waves/SM on the `40`-SM T4. Theoretical occupancy was `50%`, but achieved occupancy was only `12.46%` because the launch cannot fill the device. Scheduler issue rate was `0.17` warp/scheduler/cycle, with only `17.26%` cycles having at least one eligible warp and `82.74%` no-eligible cycles. Duration was `43.78 us` under NCU replay, with compute throughput `20.45%`, memory throughput `20.83%`, DRAM throughput only `5.16%`, L1/TEX throughput `43.35%`, L2 throughput `11.64%`, and measured memory throughput about `16.57 GB/s`. +- Learning: This implicit-GEMM convolution family is not a broad, well-filled convolution like the depthwise profile; it has the same partial-wave, low-eligible-warp tactic shape as the TensorRT GEMM/MHA families. The evidence further narrows remaining local options: the depth-2 pipeline is already CPU-overlapped, and the remaining material improvement requires a correctness-equivalent engine/export/tactic/plugin change that alters the serialized TensorRT graph body, not another postprocess or pipeline-depth change. + +### Profile: TensorRT 128x128 Split-K GEMM Scheduler Snapshot + +- Hypothesis: The current depth-2 graph-body trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_execute_split_k_kernel_trt` at about `92-93 us/replay` across thirteen TensorRT graph nodes. Profiling the split-K variant with scheduler counters can show whether TensorRT's split-K choice is actually improving device fill or just creating another partial-wave tail. +- Profile: `/tmp/rfdetr_trt_gemm128x128_splitk_sched_ncu_20260524_043054.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x128_splitk_sched_ncu_20260524_043054_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm128x128_splitk_sched_ncu_20260524_043054_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `100-103` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.21s fps=243.92`. +- Findings: The sampled launch used grid `(6, 2, 4)` for `48` CTAs, block `128`, `66` registers/thread, `8.70 KiB` dynamic shared memory per block, and only `0.17` waves/SM on the `40`-SM T4. Theoretical occupancy was `87.50%`, but achieved occupancy was only `15.15%` because the launch has little device-wide work. Scheduler issue rate was `0.11` warp/scheduler/cycle, with only `10.56%` cycles having at least one eligible warp and `89.44%` no-eligible cycles. Duration was `14.43 us` under NCU replay, with compute throughput `7.81%`, memory and DRAM throughput `31.07%`, L1/TEX throughput `14.71%`, L2 throughput `13.09%`, and measured memory throughput about `99.46 GB/s`. +- Learning: TensorRT's split-K variant gives more CTAs than the smallest GEMM tails, but it is still a low-waves, low-eligible-warp tactic and not a CPU or postprocess issue. This adds another data point that partial-wave TensorRT tactic geometry is distributed throughout the graph body. A useful engine-level improvement would need to change tactic selection or graph structure across many repeated transformer nodes, not tune the depth-2 pipeline. + +### Profile: TensorRT 128x64x32 GEMM Scheduler Snapshot + +- Hypothesis: The current depth-2 graph-body trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_nn_n_tilesize128x64x32_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` at about `60 us/replay` across four TensorRT graph nodes. It had only an older basic profile, so a scheduler profile can confirm whether this medium-sized GEMM family still follows the low-eligible-warp TensorRT tactic pattern. +- Profile: `/tmp/rfdetr_trt_gemm128x64x32_sched_ncu_20260524_044040.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x64x32_sched_ncu_20260524_044040_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm128x64x32_sched_ncu_20260524_044040_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `90-93` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.12`. +- Findings: The sampled launch used grid `(1, 32, 1)` for `32` CTAs, block `128`, `110` registers/thread, `12.29 KiB` dynamic shared memory per block, and only `0.20` waves/SM on the `40`-SM T4. Theoretical occupancy was `50%`, but achieved occupancy was only `12.45%`. Scheduler issue rate was `0.14` warp/scheduler/cycle, with only `13.68%` cycles having at least one eligible warp and `86.32%` no-eligible cycles. Duration was `27.30 us` under NCU replay, with compute throughput `22.12%`, memory throughput `20.20%`, DRAM throughput `18.01%`, L1/TEX throughput `40.40%`, L2 throughput `14.03%`, and measured memory throughput about `57.56 GB/s`. +- Learning: This medium 128x64x32 GEMM also has the same partial-wave TensorRT tactic geometry as the larger transformer GEMM/MHA families and the smaller tails. The remaining graph-body bottleneck is now strongly characterized across the dominant and mid-sized kernels: local Python, pipeline-depth, and fused-postprocess tuning are below the main ceiling unless they also replace or alter the serialized TensorRT transformer tactics. + +### Profile: TensorRT Indexed Implicit-GEMM Scheduler Snapshot + +- Hypothesis: The current depth-2 graph-body trace shows `sm75_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x64_stage1_warpsize4x1x1_g1_tensor16x8x8_alignc8_execute_kernel_trt` as a single TensorRT graph node at about `44 us/replay`. A scheduler profile can determine whether this indexed convolution tactic is limited by indexed memory access or by the same partial-wave device underfill seen in the GEMM families. +- Profile: `/tmp/rfdetr_trt_indexed_implicit_gemm_sched_ncu_20260524_045031.ncu-rep`, details text `/tmp/rfdetr_trt_indexed_implicit_gemm_sched_ncu_20260524_045031_details.txt`, and raw CSV `/tmp/rfdetr_trt_indexed_implicit_gemm_sched_ncu_20260524_045031_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `84-88` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.22s fps=242.47`. +- Findings: The sampled launch used grid `(12, 6, 1)` for `72` CTAs, block `128`, `126` registers/thread, `4.10 KiB` dynamic shared memory per block, and `0.45` waves/SM on the `40`-SM T4. Theoretical occupancy was `50%`, achieved occupancy was `22.36%`, and active warps per scheduler averaged `1.79`. Scheduler issue rate was `0.21` warp/scheduler/cycle, with only `21.45%` cycles having at least one eligible warp and `78.55%` no-eligible cycles. Duration was `85.22 us` under NCU replay, with compute throughput `35.55%`, memory throughput `39.55%`, DRAM throughput `16.53%`, L1/TEX throughput `79.08%`, L2 throughput `33.09%`, and measured memory throughput about `52.45 GB/s`. +- Learning: This indexed implicit-GEMM node has more work than the smallest GEMM tails and higher cache activity, but it is still limited by partial-wave occupancy and low scheduler eligibility rather than host launch timing. The graph body remains dominated by TensorRT tactic geometry across many serialized nodes; a material FPS improvement still requires a correctness-equivalent engine/export/plugin change rather than another depth-2 pipeline or local postprocess tweak. + +### Profile: TensorRT 128x128 FP16/FP32 NT GEMM Scheduler Snapshot + +- Hypothesis: The current depth-2 graph-body trace shows `sm75_xmma_gemm_f16f16_f16f32_f32_nt_n_tilesize128x128x32_stage1_warpsize2x2x1_tensor16x8x8_aligna4_alignc4_execute_kernel_trt` as a single TensorRT graph node around `41 us/replay`. A scheduler profile can determine whether this FP32-accumulating NT GEMM is a compute-throughput target or another register-heavy, small-grid tactic. +- Profile: `/tmp/rfdetr_trt_gemm128x128_f16f32_nt_sched_ncu_20260524_050035.ncu-rep`, details text `/tmp/rfdetr_trt_gemm128x128_f16f32_nt_sched_ncu_20260524_050035_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm128x128_f16f32_nt_sched_ncu_20260524_050035_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `85-89` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.20s fps=244.33`. +- Findings: The sampled launch used grid `(8, 6, 1)` for `48` CTAs, block `128`, `226` registers/thread, `17.41 KiB` dynamic shared memory per block, and `0.60` waves/SM on the `40`-SM T4. Theoretical occupancy was only `25%`, achieved occupancy was `15.01%`, and active warps per scheduler averaged `1.21`. Scheduler issue rate was `0.16` warp/scheduler/cycle, with only `16.15%` cycles having at least one eligible warp and `83.85%` no-eligible cycles. Duration was `64.10 us` under NCU replay, with compute throughput `26.55%`, memory throughput `29.20%`, DRAM throughput `11.59%`, L1/TEX throughput `58.39%`, L2 throughput `14.48%`, and measured memory throughput about `37.60 GB/s`. +- Learning: This FP32-accumulating NT GEMM is register-heavy and still scheduler-underfilled, matching the broader TensorRT transformer tactic pattern. It is not a standalone compute-throughput target that can be fixed from Python; improving it materially would require changing TensorRT tactic selection/export structure or replacing a larger repeated graph-body region. + +### Profile: TensorRT 64x32 TN GEMM Scheduler Snapshot + +- Hypothesis: The current depth-2 graph-body trace shows `sm75_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x32x64_stage1_warpsize2x2x1_tensor16x8x8_execute_kernel_trt` as another repeated TensorRT GEMM family. A scheduler profile can confirm whether this smaller TN kernel has exploitable compute headroom or is another low-wave tactic tail inside the serialized CUDA graph body. +- Profile: `/tmp/rfdetr_trt_gemm64x32_tn_sched_ncu_20260524_051029.ncu-rep`, details text `/tmp/rfdetr_trt_gemm64x32_tn_sched_ncu_20260524_051029_details.txt`, and raw CSV `/tmp/rfdetr_trt_gemm64x32_tn_sched_ncu_20260524_051029_raw.csv`. The NCU command used graph profiling mode `node`, matched the exact kernel name, skipped `350` matching launches, collected `1` profiled launch with `LaunchStats`, `Occupancy`, `SpeedOfLight`, `SchedulerStats`, `WarpStateStats`, and `MemoryWorkloadAnalysis`, and kept pipeline depth fixed at `2`; depth `3` was not tested. +- Result under NCU application-replay overhead: repeated profiling passes measured about `90-92` FPS, which is profiler overhead only. A normal non-profiled depth-2 sanity run afterward measured `frames=538 elapsed=2.21s fps=243.86`. +- Findings: The sampled launch used grid `(2, 8, 2)` for `32` CTAs, block `128`, `83` registers/thread, `12.29 KiB` dynamic shared memory per block, and only `0.16` waves/SM on the `40`-SM T4. Theoretical occupancy was `62.50%`, achieved occupancy was `12.39%`, and active warps per scheduler averaged `1.00`. Scheduler issue rate was `0.11` warp/scheduler/cycle, with only `11.38%` cycles having at least one eligible warp and `88.62%` no-eligible cycles. Duration was `16.93 us` under NCU replay, with compute throughput `11.42%`, memory throughput `15.11%`, DRAM throughput `9.18%`, L1/TEX throughput `30.22%`, L2 throughput `10.80%`, and measured memory throughput about `29.60 GB/s`. +- Learning: This smaller TN GEMM is another partial-wave TensorRT tactic rather than a bandwidth or host-scheduling problem. Its absolute cost is modest, but it reinforces the same pattern across the graph body: many serialized TensorRT kernels are individually underfilled, and the accepted depth-2 pipeline is already close to graph-replay-limited. Further meaningful FPS gains still need a correctness-equivalent TensorRT tactic/export/plugin change that reduces or replaces repeated graph-body work. + +### Rejected: Local T4 FP16 TensorRT Opt4 Version-Compatible Rebuild + +- Hypothesis: Local TensorRT rebuilds from the public ONNX package had tried builder optimization levels `0`, `1`, `2`, `3`, and `5`, but not `4`. Since the accepted graph body is limited by TensorRT tactic geometry, an opt4 version-compatible FP16 rebuild might choose a faster tactic mix while staying closer to the accepted lean-runtime plan format. +- Change tested: Built `/tmp/rfdetr_trt_rebuild_t4_fp16_opt4_vc_diag/engine.plan` from `/tmp/cache/models-cache/coco-dataset-vdnr1-41-e9a19d93/5362b72bfb9f01d2e0b8cba2048d932c/weights.onnx` with TensorRT `10.12.0.36`, static input shape `(1, 3, 312, 312)`, `BuilderFlag.FP16`, `BuilderFlag.VERSION_COMPATIBLE`, `builder_optimization_level=4`, `ProfilingVerbosity.DETAILED`, and a `4 GiB` workspace limit. Build artifacts are the engine plan, `/tmp/rfdetr_trt_rebuild_t4_fp16_opt4_vc_diag/build.log`, and a temporary package wrapper at `/tmp/rfdetr_trt_rebuild_t4_fp16_opt4_vc_diag/package`. Build time was `239.376s`, and the serialized plan size was `187,857,668` bytes. Pipeline depth was not varied and depth `3` was not tested. +- Correctness diagnostic: On frame `0` from `vehicles_312px.mp4`, raw output max absolute deltas versus the accepted T4 FP16 plan were `[0.56201171875, 5.63720703125, 80.71875]` for boxes, logits, and masks. Postprocess changed the accepted count from `4` to `5`; common-prefix classes were still `[7, 2, 2, 2]`, common-prefix max box delta was `0.0`, and common-prefix max confidence delta was `0.010726213455200195`. Because the detection count changed on the first benchmark frame, the candidate fails the required class/count/box/mask gate and was not promoted. +- Graph-only timing: With CUDA graph replay plus the normal input copy and output clones on a fixed preprocessed benchmark frame, the accepted packaged plan averaged `4.113741 ms` (`243.09 fps`) over five `1000`-replay batches. The opt4 version-compatible rebuild averaged `4.158185 ms` (`240.49 fps`) over the same measurement, so it was also slower than accepted even before considering correctness. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.21s fps=243.26` after the experiment. +- Learning: The previously untested opt4 builder level does not rescue the public ONNX rebuild path. Like the other local rebuilds, it drifts from the accepted official T4 FP16 package, and here it also regresses graph replay speed. The remaining graph-body optimization path still requires a behavior-equivalent export source, a newer official T4 package, or a larger plugin/export replacement of repeated transformer kernels rather than another builder optimization-level sweep. + +### Diagnostic: Official Package Metadata Refresh After Opt4 Rebuild + +- Hypothesis: Since the accepted depth-2 run is graph-body bound and all local ONNX/Torch rebuild paths have failed correctness or speed, a newer official T4 TensorRT package would be the safest remaining way to improve graph-body tactics without prediction drift. +- Diagnostic: Re-queried the Roboflow weights provider through `PYTHONPATH=/app/inference_models` for `rfdetr-seg-nano`. The refreshed metadata was saved to `/tmp/rfdetr_package_metadata_refresh_20260524.json`. Pipeline depth was not varied and depth `3` was not tested. +- Result: The provider still resolves `rfdetr-seg-nano` to `coco-dataset-vdnr1/41` and exposes the same six packages: L4 TRT FP32 `3e3ddd85586b43e4fac6d319fb2927fd`, ONNX FP32 `5362b72bfb9f01d2e0b8cba2048d932c`, L4 TRT FP16 `89d1f41e2af4f4f3ffcdfb77e774d26a`, Torch FP32 `8b8da2fe824240522a39f3cde41aafae`, T4 TRT FP32 `bbc2cc23adf6f5e71a9241956081da96`, and T4 TRT FP16 `c70f32369a54d61e06ef4e6b56c82524`. All TRT packages are static batch `1`; no dynamic-batch TensorRT package is available. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.18` after the metadata refresh. +- Learning: There is still no newer official T4 FP16 or dynamic-batch TensorRT package to test. The accepted official T4 FP16 package remains the only package source that has passed the benchmark correctness gate, so meaningful graph-body gains still require a behavior-equivalent export source or a larger plugin/export replacement of repeated transformer kernels. + +### Rejected: Local T4 FP16 TensorRT Sparse-Weights Rebuild + +- Hypothesis: The accepted graph body is limited by many small TensorRT GEMM/MHA tactics. Enabling `BuilderFlag.SPARSE_WEIGHTS` during a version-compatible FP16 rebuild from the public ONNX package might allow TensorRT to choose sparse tactics or otherwise change the tactic mix for repeated transformer layers. +- Change tested: Built `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_sparse_diag/engine.plan` from `/tmp/cache/models-cache/coco-dataset-vdnr1-41-e9a19d93/5362b72bfb9f01d2e0b8cba2048d932c/weights.onnx` with TensorRT `10.12.0.36`, static input shape `(1, 3, 312, 312)`, `BuilderFlag.FP16`, `BuilderFlag.VERSION_COMPATIBLE`, `BuilderFlag.SPARSE_WEIGHTS`, `builder_optimization_level=3`, `ProfilingVerbosity.DETAILED`, and a `4 GiB` workspace limit. Build artifacts are the engine plan, `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_sparse_diag/build.log`, and a temporary package wrapper at `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_sparse_diag/package`. Build time was `112.608s`, and the serialized plan size was `187,981,132` bytes. Pipeline depth was not varied and depth `3` was not tested. +- Correctness diagnostic: On frame `0` from `vehicles_312px.mp4`, raw output max absolute deltas versus the accepted T4 FP16 plan were `[0.56201171875, 5.62939453125, 80.75]` for boxes, logits, and masks. Postprocess changed the accepted count from `4` to `5`; common-prefix classes were `[7, 2, 2, 2]`, common-prefix max box delta was `0.0`, and common-prefix max confidence delta was `0.009788334369659424`. Because the detection count changed on the first benchmark frame, the candidate fails the required class/count/box/mask gate and was not promoted. +- Graph-only timing: With CUDA graph replay plus the normal input copy and output clones on a fixed preprocessed benchmark frame, the accepted packaged plan averaged `4.106009 ms` (`243.55 fps`) over five `1000`-replay batches. The sparse opt3 version-compatible rebuild averaged `4.158536 ms` (`240.47 fps`) over the same measurement, so it was slower than accepted even before considering correctness. +- Full workflow sanity check: The accepted depth-2 command measured `frames=538 elapsed=2.20s fps=244.38` after the experiment. +- Learning: Sparse-weight tactic enablement does not rescue the public ONNX rebuild path. It preserves the same first-frame prediction drift pattern as the other local rebuilds and does not improve graph replay speed. The remaining graph-body path still needs a behavior-equivalent export source, a newer official T4 plan, or a broader plugin/export replacement rather than another generic builder flag. + +### Rejected: Local T4 FP16 TensorRT Aux-Stream-Zero Rebuild + +- Hypothesis: The accepted CUDA graph replay uses TensorRT auxiliary streams and shows about `~214 us` graph-internal no-activity time per replay. Building a version-compatible FP16 plan with `max_aux_streams=0` might reduce TensorRT event/stream scheduling overhead or produce a shorter single-stream graph body, even though runtime aux-stream overrides were not helpful on the accepted serialized plan. +- Change tested: Built `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_aux0_diag/engine.plan` from `/tmp/cache/models-cache/coco-dataset-vdnr1-41-e9a19d93/5362b72bfb9f01d2e0b8cba2048d932c/weights.onnx` with TensorRT `10.12.0.36`, static input shape `(1, 3, 312, 312)`, `BuilderFlag.FP16`, `BuilderFlag.VERSION_COMPATIBLE`, `builder_optimization_level=3`, `ProfilingVerbosity.DETAILED`, `config.max_aux_streams=0`, and a `4 GiB` workspace limit. Build artifacts are the engine plan, `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_aux0_diag/build.log`, and a temporary package wrapper at `/tmp/rfdetr_trt_rebuild_t4_fp16_opt3_vc_aux0_diag/package`. Build time was `110.701s`, the serialized plan size was `187,797,164` bytes, and the candidate engine reported `num_aux_streams=0`. Pipeline depth was not varied and depth `3` was not tested. +- Correctness diagnostic: On frame `0` from `vehicles_312px.mp4`, raw output max absolute deltas versus the accepted T4 FP16 plan were `[0.56201171875, 5.62939453125, 80.765625]` for boxes, logits, and masks. Postprocess changed the accepted count from `4` to `5`; common-prefix classes were `[7, 2, 2, 2]`, common-prefix max box delta was `0.0`, and common-prefix max confidence delta was `0.009788334369659424`. Because the detection count changed on the first benchmark frame, the candidate fails the required class/count/box/mask gate and was not promoted. +- Graph-only timing: With CUDA graph replay plus the normal input copy and output clones on a fixed preprocessed benchmark frame, the accepted packaged plan reported `num_aux_streams=4` and averaged `4.105495 ms` (`243.58 fps`) over five `1000`-replay batches. The aux0 version-compatible rebuild averaged `4.149400 ms` (`241.00 fps`) over the same measurement, so build-time single-stream scheduling was slower than accepted even before considering correctness. +- Full workflow sanity check: The first accepted depth-2 run after the build measured `frames=538 elapsed=2.23s fps=241.16`, which appeared to be a transient slow run after TensorRT build activity. A repeat accepted depth-2 run immediately afterward measured `frames=538 elapsed=2.20s fps=244.86`, back in the normal steady band. +- Learning: Removing build-time TensorRT auxiliary streams does not improve the public ONNX rebuild and again preserves the first-frame prediction drift pattern. The accepted plan's auxiliary-stream structure is not the remaining local bottleneck; the useful path still requires a behavior-equivalent export or a broader replacement of repeated TensorRT transformer graph-body work. diff --git a/inference/core/interfaces/stream/inference_pipeline.py b/inference/core/interfaces/stream/inference_pipeline.py index 77282bf703..e93f50f534 100644 --- a/inference/core/interfaces/stream/inference_pipeline.py +++ b/inference/core/interfaces/stream/inference_pipeline.py @@ -479,6 +479,7 @@ def init_with_workflow( video_source_properties: Optional[Dict[str, float]] = None, workflow_init_parameters: Optional[Dict[str, Any]] = None, workflows_thread_pool_workers: int = 4, + max_inflight_workflow_batches: int = 2, cancel_thread_pool_tasks_on_exit: bool = True, video_metadata_input_name: str = "video_metadata", batch_collection_timeout: Optional[float] = None, @@ -547,6 +548,9 @@ def init_with_workflow( with custom plugins. workflows_thread_pool_workers (int): Number of workers for workflows thread pool which is used by workflows blocks to run background tasks. + max_inflight_workflow_batches (int): Number of workflow frame batches that can be processed concurrently + by the inference pipeline. Values greater than one allow CPU-heavy workflow result processing for + one frame batch to overlap with GPU work for another frame batch, while preserving output order. cancel_thread_pool_tasks_on_exit (bool): Flag to decide if unstated background tasks should be canceled at the end of InferencePipeline processing. By default, when video file ends or pipeline is stopped, tasks that has not started will be cancelled. @@ -690,6 +694,7 @@ def init_with_workflow( batch_collection_timeout=batch_collection_timeout, predictions_queue_size=predictions_queue_size, decoding_buffer_size=decoding_buffer_size, + inference_thread_pool_workers=max_inflight_workflow_batches, ) @classmethod @@ -710,6 +715,7 @@ def init_with_custom_logic( sink_mode: SinkMode = SinkMode.ADAPTIVE, predictions_queue_size: int = PREDICTIONS_QUEUE_SIZE, decoding_buffer_size: int = DEFAULT_BUFFER_SIZE, + inference_thread_pool_workers: int = 1, ) -> "InferencePipeline": """ This class creates the abstraction for making inferences from given workflow against video stream. @@ -780,6 +786,8 @@ def init_with_custom_logic( default value is taken from INFERENCE_PIPELINE_PREDICTIONS_QUEUE_SIZE env variable decoding_buffer_size (int): size of video source decoding buffer default value is taken from VIDEO_SOURCE_BUFFER_SIZE env variable + inference_thread_pool_workers (int): Number of frame batches that can be processed concurrently. + Output dispatching order is preserved. Other ENV variables involved in low-level configuration: * INFERENCE_PIPELINE_PREDICTIONS_QUEUE_SIZE - size of buffer for predictions that are ready for dispatching @@ -826,6 +834,7 @@ def init_with_custom_logic( on_pipeline_end=on_pipeline_end, batch_collection_timeout=batch_collection_timeout, sink_mode=sink_mode, + inference_thread_pool_workers=inference_thread_pool_workers, ) def __init__( @@ -841,6 +850,7 @@ def __init__( max_fps: Optional[float] = None, batch_collection_timeout: Optional[float] = None, sink_mode: SinkMode = SinkMode.ADAPTIVE, + inference_thread_pool_workers: int = 1, ): self._on_video_frame = on_video_frame self._video_sources = video_sources @@ -858,6 +868,12 @@ def __init__( self._on_pipeline_end = on_pipeline_end self._batch_collection_timeout = batch_collection_timeout self._sink_mode = sink_mode + try: + self._inference_thread_pool_workers = max( + 1, int(inference_thread_pool_workers) + ) + except (TypeError, ValueError): + self._inference_thread_pool_workers = 1 def start(self, use_main_thread: bool = True) -> None: self._stop = False @@ -910,26 +926,52 @@ def _execute_inference(self) -> None: status_update_handlers=self._status_update_handlers, ) logger.info(f"Inference thread started") + inference_executor = None try: - for video_frames in self._generate_frames(): - self._watchdog.on_model_inference_started( - frames=video_frames, - ) - predictions = self._on_video_frame(video_frames) - self._watchdog.on_model_prediction_ready( - frames=video_frames, - ) - self._predictions_queue.put((predictions, video_frames)) - send_inference_pipeline_status_update( - severity=UpdateSeverity.DEBUG, - event_type=INFERENCE_COMPLETED_EVENT, - payload={ - "frames_ids": [f.frame_id for f in video_frames], - "frames_timestamps": [f.frame_timestamp for f in video_frames], - "sources_id": [f.source_id for f in video_frames], - }, - status_update_handlers=self._status_update_handlers, + if self._inference_thread_pool_workers == 1: + for video_frames in self._generate_frames(): + predictions, video_frames = self._run_inference_on_video_frames( + video_frames=video_frames + ) + self._emit_inference_result( + predictions=predictions, + video_frames=video_frames, + ) + else: + inference_executor = ThreadPoolExecutor( + max_workers=self._inference_thread_pool_workers ) + pending_predictions = {} + next_submitted_batch = 0 + next_dispatched_batch = 0 + for video_frames in self._generate_frames(): + while ( + len(pending_predictions) >= self._inference_thread_pool_workers + ): + predictions, predicted_video_frames = pending_predictions.pop( + next_dispatched_batch + ).result() + self._emit_inference_result( + predictions=predictions, + video_frames=predicted_video_frames, + ) + next_dispatched_batch += 1 + pending_predictions[next_submitted_batch] = ( + inference_executor.submit( + self._run_inference_on_video_frames, + video_frames=video_frames, + ) + ) + next_submitted_batch += 1 + while next_dispatched_batch < next_submitted_batch: + predictions, predicted_video_frames = pending_predictions.pop( + next_dispatched_batch + ).result() + self._emit_inference_result( + predictions=predictions, + video_frames=predicted_video_frames, + ) + next_dispatched_batch += 1 except Exception as error: payload = { @@ -945,6 +987,8 @@ def _execute_inference(self) -> None: ) logger.exception(f"Encountered inference error: {error}") finally: + if inference_executor is not None: + inference_executor.shutdown(wait=True, cancel_futures=True) self._predictions_queue.put(None) send_inference_pipeline_status_update( severity=UpdateSeverity.INFO, @@ -953,6 +997,36 @@ def _execute_inference(self) -> None: ) logger.info(f"Inference thread finished") + def _run_inference_on_video_frames( + self, + video_frames: List[VideoFrame], + ) -> Tuple[List[AnyPrediction], List[VideoFrame]]: + self._watchdog.on_model_inference_started( + frames=video_frames, + ) + predictions = self._on_video_frame(video_frames) + self._watchdog.on_model_prediction_ready( + frames=video_frames, + ) + return predictions, video_frames + + def _emit_inference_result( + self, + predictions: List[AnyPrediction], + video_frames: List[VideoFrame], + ) -> None: + self._predictions_queue.put((predictions, video_frames)) + send_inference_pipeline_status_update( + severity=UpdateSeverity.DEBUG, + event_type=INFERENCE_COMPLETED_EVENT, + payload={ + "frames_ids": [f.frame_id for f in video_frames], + "frames_timestamps": [f.frame_timestamp for f in video_frames], + "sources_id": [f.source_id for f in video_frames], + }, + status_update_handlers=self._status_update_handlers, + ) + def _dispatch_inference_results(self) -> None: while True: inference_results: Optional[ diff --git a/inference/core/interfaces/stream/model_handlers/workflows.py b/inference/core/interfaces/stream/model_handlers/workflows.py index 820bf855fc..bed9488561 100644 --- a/inference/core/interfaces/stream/model_handlers/workflows.py +++ b/inference/core/interfaces/stream/model_handlers/workflows.py @@ -1,11 +1,19 @@ -from typing import List, Optional +from typing import Callable, Dict, List, Optional from inference.core.interfaces.camera.entities import VideoFrame from inference.core.workflows.execution_engine.core import ExecutionEngine -from inference.core.workflows.execution_engine.entities.base import VideoMetadata +from inference.core.workflows.execution_engine.entities.base import ( + ImageParentMetadata, + VideoMetadata, + WorkflowImageData, +) class WorkflowRunner: + def __init__(self) -> None: + self._fast_path_cache: Dict[ + int, Optional[Callable[[List[VideoFrame]], List[dict]]] + ] = {} def run_workflow( self, @@ -17,6 +25,13 @@ def run_workflow( serialize_results: bool = False, _is_preview: bool = False, ) -> List[dict]: + if not workflows_parameters and not serialize_results and not _is_preview: + fast_path = self._get_single_step_fast_path( + execution_engine=execution_engine, + image_input_name=image_input_name, + ) + if fast_path is not None: + return fast_path(video_frames) if workflows_parameters is None: workflows_parameters = {} # TODO: pass fps reflecting each stream to workflows_parameters @@ -58,3 +73,94 @@ def run_workflow( serialize_results=serialize_results, _is_preview=_is_preview, ) + + def _get_single_step_fast_path( + self, + execution_engine: ExecutionEngine, + image_input_name: str, + ) -> Optional[Callable[[List[VideoFrame]], List[dict]]]: + cache_key = id(execution_engine) + if cache_key not in self._fast_path_cache: + self._fast_path_cache[cache_key] = self._build_single_step_fast_path( + execution_engine=execution_engine, + image_input_name=image_input_name, + ) + return self._fast_path_cache[cache_key] + + @staticmethod + def _build_single_step_fast_path( + execution_engine: ExecutionEngine, + image_input_name: str, + ) -> Optional[Callable[[List[VideoFrame]], List[dict]]]: + inner_engine = getattr(execution_engine, "_engine", None) + compiled_workflow = getattr(inner_engine, "_compiled_workflow", None) + if compiled_workflow is None: + return None + if compiled_workflow.input_substitutions: + return None + if len(compiled_workflow.steps) != 1: + return None + if len(compiled_workflow.workflow_definition.inputs) != 1: + return None + if len(compiled_workflow.workflow_definition.outputs) != 1: + return None + step_name, initialised_step = next(iter(compiled_workflow.steps.items())) + manifest = initialised_step.manifest + if ( + getattr(manifest, "type", None) + != "roboflow_core/roboflow_instance_segmentation_model@v3" + ): + return None + if getattr(manifest, "images", None) != f"$inputs.{image_input_name}": + return None + output = compiled_workflow.workflow_definition.outputs[0] + if ( + output.name != "predictions" + or output.selector != f"$steps.{step_name}.predictions" + ): + return None + step = initialised_step.step + + def run_single_step_workflow(video_frames: List[VideoFrame]) -> List[dict]: + workflow_images = [] + for idx, video_frame in enumerate(video_frames): + video_metadata = VideoMetadata( + video_identifier=( + str(video_frame.source_id) + if video_frame.source_id + else "default_source" + ), + frame_number=video_frame.frame_id, + frame_timestamp=video_frame.frame_timestamp, + fps=video_frame.fps, + measured_fps=video_frame.measured_fps, + comes_from_video_file=video_frame.comes_from_video_file, + ) + parent_id = f"{image_input_name}.[{idx}]" + parent_metadata = ImageParentMetadata(parent_id=parent_id) + workflow_images.append( + WorkflowImageData( + parent_metadata=parent_metadata, + workflow_root_ancestor_metadata=parent_metadata, + numpy_image=video_frame.image, + video_metadata=video_metadata, + ) + ) + step_results = step.run( + images=workflow_images, + model_id=manifest.model_id, + confidence_mode=manifest.confidence_mode, + custom_confidence=manifest.custom_confidence, + class_agnostic_nms=manifest.class_agnostic_nms, + class_filter=manifest.class_filter, + iou_threshold=manifest.iou_threshold, + max_detections=manifest.max_detections, + max_candidates=manifest.max_candidates, + mask_decode_mode=manifest.mask_decode_mode, + tradeoff_factor=manifest.tradeoff_factor, + disable_active_learning=manifest.disable_active_learning, + active_learning_target_dataset=manifest.active_learning_target_dataset, + ) + return [{"predictions": result["predictions"]} for result in step_results] + + return run_single_step_workflow diff --git a/inference/core/models/inference_models_adapters.py b/inference/core/models/inference_models_adapters.py index 9e6f43b4d2..393c75481f 100644 --- a/inference/core/models/inference_models_adapters.py +++ b/inference/core/models/inference_models_adapters.py @@ -280,7 +280,10 @@ def map_inference_kwargs(self, kwargs: dict) -> dict: disable_static_crop=kwargs.get("disable_preproc_static_crop", False), ) kwargs["pre_processing_overrides"] = pre_processing_overrides - if "rle" in self._model.supported_mask_formats: + if ( + kwargs.get("response_mask_format") == "rle" + and "rle" in self._model.supported_mask_formats + ): kwargs["mask_format"] = "rle" return kwargs diff --git a/inference/core/utils/postprocess.py b/inference/core/utils/postprocess.py index 0ceafb8e75..76b89baf0b 100644 --- a/inference/core/utils/postprocess.py +++ b/inference/core/utils/postprocess.py @@ -52,11 +52,6 @@ def masks2poly(masks: np.ndarray) -> List[np.ndarray]: m_bool = np.ascontiguousarray(m_bool) m_uint8 = m_bool.view(np.uint8) - # Quickly skip empty masks - if not np.any(m_uint8): - segments.append(np.zeros((0, 2), dtype=np.float32)) - continue - segments.append(mask2poly(m_uint8)) return segments @@ -88,11 +83,6 @@ def masks2multipoly(masks: np.ndarray) -> List[np.ndarray]: m_bool = np.ascontiguousarray(m_bool) m_uint8 = m_bool.view(np.uint8) - # Quickly skip empty masks - if not np.any(m_uint8): - segments.append([np.zeros((0, 2), dtype=np.float32)]) - continue - segments.append(mask2multipoly(m_uint8)) return segments diff --git a/inference/core/workflows/core_steps/common/utils.py b/inference/core/workflows/core_steps/common/utils.py index b8f0bb763b..bb234189d8 100644 --- a/inference/core/workflows/core_steps/common/utils.py +++ b/inference/core/workflows/core_steps/common/utils.py @@ -118,7 +118,10 @@ def convert_inference_detections_batch_to_sv_detections( raw_predictions = filter_out_invalid_polygons(predictions=raw_predictions) parent_ids = [d.get(PARENT_ID_KEY, "") for d in raw_predictions] detection_ids = [ - d.get(DETECTION_ID_KEY, str(uuid.uuid4())) for d in raw_predictions + d[DETECTION_ID_KEY] + if DETECTION_ID_KEY in d + else str(uuid.uuid4()) + for d in raw_predictions ] detections[DETECTION_ID_KEY] = np.array(detection_ids) detections[PARENT_ID_KEY] = np.array(parent_ids) diff --git a/inference/core/workflows/core_steps/models/roboflow/instance_segmentation/v3.py b/inference/core/workflows/core_steps/models/roboflow/instance_segmentation/v3.py index 4e9e62eb16..7ae753d274 100644 --- a/inference/core/workflows/core_steps/models/roboflow/instance_segmentation/v3.py +++ b/inference/core/workflows/core_steps/models/roboflow/instance_segmentation/v3.py @@ -1,10 +1,20 @@ +import threading +import uuid from typing import List, Literal, Optional, Type, Union +import numpy as np +import supervision as sv +import torch from pydantic import ConfigDict, Field, PositiveInt, model_validator +from supervision.config import CLASS_NAME_DATA_FIELD from inference.core.entities.requests.inference import ( InstanceSegmentationInferenceRequest, ) +from inference.core.models.inference_models_adapters import ( + InferenceModelsInstanceSegmentationAdapter, +) +from inference_models.models.auto_loaders.entities import PreProcessingOverrides from inference.core.env import ( HOSTED_INSTANCE_SEGMENTATION_URL, LOCAL_INFERENCE_API_URL, @@ -20,7 +30,12 @@ convert_inference_detections_batch_to_sv_detections, filter_out_unwanted_classes_from_sv_detections_batch, ) -from inference.core.workflows.execution_engine.constants import INFERENCE_ID_KEY +from inference.core.workflows.execution_engine.constants import ( + DETECTION_ID_KEY, + IMAGE_DIMENSIONS_KEY, + INFERENCE_ID_KEY, + PARENT_ID_KEY, +) from inference.core.workflows.execution_engine.entities.base import ( Batch, OutputDefinition, @@ -60,6 +75,41 @@ documentation](https://inference.roboflow.com/quickstart/configure_api_key/). """ +_RFDETR_CONVERSION_BUFFERS = threading.local() + + +def _get_rfdetr_conversion_buffers( + count: int, + mask_shape: tuple, + xyxy_dtype: torch.dtype, + confidence_dtype: torch.dtype, + class_id_dtype: torch.dtype, + mask_dtype: torch.dtype, +) -> tuple: + existing_key = getattr(_RFDETR_CONVERSION_BUFFERS, "key", None) + existing_capacity = getattr(_RFDETR_CONVERSION_BUFFERS, "capacity", 0) + requested_key = ( + mask_shape, + xyxy_dtype, + confidence_dtype, + class_id_dtype, + mask_dtype, + ) + if existing_key == requested_key and existing_capacity >= count: + return _RFDETR_CONVERSION_BUFFERS.buffers + capacity = max(count, existing_capacity) + buffers = ( + torch.empty((capacity, 4), dtype=xyxy_dtype, pin_memory=True), + torch.empty((capacity,), dtype=confidence_dtype, pin_memory=True), + torch.empty((capacity,), dtype=class_id_dtype, pin_memory=True), + torch.empty((capacity,) + mask_shape, dtype=mask_dtype, pin_memory=True), + torch.empty((1,), dtype=torch.int32, pin_memory=True), + ) + _RFDETR_CONVERSION_BUFFERS.key = requested_key + _RFDETR_CONVERSION_BUFFERS.capacity = capacity + _RFDETR_CONVERSION_BUFFERS.buffers = buffers + return buffers + class BlockManifest(WorkflowBlockManifest): model_config = ConfigDict( @@ -301,6 +351,27 @@ def run_locally( disable_active_learning: Optional[bool], active_learning_target_dataset: Optional[str], ) -> BlockResult: + self._model_manager.add_model( + model_id=model_id, + api_key=self._api_key, + ) + if disable_active_learning is True and active_learning_target_dataset is None: + direct_result = self._try_run_rfdetr_trt_fast_path( + images=images, + class_filter=class_filter, + model_id=model_id, + confidence=confidence, + class_agnostic_nms=class_agnostic_nms, + iou_threshold=iou_threshold, + max_detections=max_detections, + max_candidates=max_candidates, + mask_decode_mode=mask_decode_mode, + tradeoff_factor=tradeoff_factor, + disable_active_learning=disable_active_learning, + active_learning_target_dataset=active_learning_target_dataset, + ) + if direct_result is not None: + return direct_result inference_images = [i.to_inference_format(numpy_preferred=True) for i in images] request = InstanceSegmentationInferenceRequest( api_key=self._api_key, @@ -318,10 +389,16 @@ def run_locally( tradeoff_factor=tradeoff_factor, source="workflow-execution", ) - self._model_manager.add_model( - model_id=model_id, - api_key=self._api_key, - ) + if disable_active_learning is True and active_learning_target_dataset is None: + direct_result = self._try_run_inference_models_fast_path( + images=images, + inference_images=inference_images, + request=request, + class_filter=class_filter, + model_id=model_id, + ) + if direct_result is not None: + return direct_result predictions = self._model_manager.infer_from_request_sync( model_id=model_id, request=request ) @@ -337,6 +414,374 @@ def run_locally( model_id=model_id, ) + def _try_run_inference_models_fast_path( + self, + images: Batch[WorkflowImageData], + inference_images: List[dict], + request: InstanceSegmentationInferenceRequest, + class_filter: Optional[List[str]], + model_id: str, + ) -> Optional[BlockResult]: + model = self._model_manager[model_id] + if not isinstance(model, InferenceModelsInstanceSegmentationAdapter): + return None + inference_kwargs = request.model_dump() + inference_kwargs.pop("image", None) + is_rfdetr_trt = ( + model._model.__class__.__name__ == "RFDetrForInstanceSegmentationTRT" + ) + if is_rfdetr_trt and inference_kwargs.get("response_mask_format") != "rle": + inference_kwargs["defer_cuda_stream_sync"] = True + pre_processed_images, preprocessing_metadata = model.preprocess( + image=inference_images, + **inference_kwargs, + ) + predictions = model.predict(pre_processed_images, **inference_kwargs) + post_process_kwargs = model.map_inference_kwargs(inference_kwargs) + if is_rfdetr_trt: + post_process_kwargs["defer_fused_postprocess_count"] = True + detections = model._model.post_process( + predictions, + preprocessing_metadata, + **post_process_kwargs, + ) + predictions = self._convert_inference_models_detections_to_sv_detections( + model=model, + detections=detections, + preprocessing_metadata=preprocessing_metadata, + inference_id=request.id, + ) + predictions = attach_prediction_type_info_to_sv_detections_batch( + predictions=predictions, + prediction_type="instance-segmentation", + ) + predictions = filter_out_unwanted_classes_from_sv_detections_batch( + predictions=predictions, + classes_to_accept=class_filter, + ) + predictions = attach_parents_coordinates_to_batch_of_sv_detections( + images=images, + predictions=predictions, + ) + return [ + { + "inference_id": request.id, + "predictions": prediction, + "model_id": model_id, + } + for prediction in predictions + ] + + def _try_run_rfdetr_trt_fast_path( + self, + images: Batch[WorkflowImageData], + class_filter: Optional[List[str]], + model_id: str, + confidence: Union[None, float, Literal["best", "default"]], + class_agnostic_nms: Optional[bool], + iou_threshold: Optional[float], + max_detections: Optional[int], + max_candidates: Optional[int], + mask_decode_mode: Literal["accurate", "tradeoff", "fast"], + tradeoff_factor: Optional[float], + disable_active_learning: Optional[bool], + active_learning_target_dataset: Optional[str], + ) -> Optional[BlockResult]: + model = self._model_manager[model_id] + if not isinstance(model, InferenceModelsInstanceSegmentationAdapter): + return None + if model._model.__class__.__name__ != "RFDetrForInstanceSegmentationTRT": + return None + pre_processing_overrides = PreProcessingOverrides( + disable_contrast_enhancement=False, + disable_grayscale=False, + disable_static_crop=False, + ) + pre_processed_images, preprocessing_metadata = model._model.pre_process( + images=[image.numpy_image for image in images], + input_color_format="bgr", + pre_processing_overrides=pre_processing_overrides, + defer_cuda_stream_sync=True, + ) + predictions = model._model.forward( + pre_processed_images, + defer_cuda_stream_sync=True, + ) + detections = model._model.post_process( + predictions, + preprocessing_metadata, + confidence=confidence, + mask_format="dense", + defer_cuda_stream_sync=True, + defer_fused_postprocess_count=True, + deferred_mask_resize_detection_limit=7, + ) + inference_id = str(uuid.uuid4()) + predictions = self._convert_inference_models_detections_to_sv_detections( + model=model, + detections=detections, + preprocessing_metadata=preprocessing_metadata, + inference_id=inference_id, + ) + predictions = attach_prediction_type_info_to_sv_detections_batch( + predictions=predictions, + prediction_type="instance-segmentation", + ) + predictions = filter_out_unwanted_classes_from_sv_detections_batch( + predictions=predictions, + classes_to_accept=class_filter, + ) + predictions = attach_parents_coordinates_to_batch_of_sv_detections( + images=images, + predictions=predictions, + ) + return [ + { + "inference_id": inference_id, + "predictions": prediction, + "model_id": model_id, + } + for prediction in predictions + ] + + @staticmethod + def _convert_inference_models_detections_to_sv_detections( + model: InferenceModelsInstanceSegmentationAdapter, + detections, + preprocessing_metadata, + inference_id: Optional[str], + ) -> List[sv.Detections]: + result = [] + for detections_element, metadata in zip(detections, preprocessing_metadata): + valid_count = None + if detections_element.image_metadata is not None: + valid_count = detections_element.image_metadata.get("valid_count") + if valid_count is not None: + fixed_arrays = RoboflowInstanceSegmentationModelBlockV3._try_copy_limited_cuda_detection_tensors_to_pinned_numpy( + detections_element=detections_element, + ) + if fixed_arrays is not None: + xyxy, confidence, class_id, masks = fixed_arrays + else: + valid_count = int(valid_count.detach().cpu().item()) + xyxy_tensor = detections_element.xyxy[:valid_count] + confidence_tensor = detections_element.confidence[:valid_count] + class_id_tensor = detections_element.class_id[:valid_count] + recovered_masks = RoboflowInstanceSegmentationModelBlockV3._recover_limited_rfdetr_masks( + detections_element=detections_element, + valid_count=valid_count, + ) + if recovered_masks is None: + recovered_masks = detections_element.mask + mask_tensor = recovered_masks[:valid_count] + pinned_arrays = RoboflowInstanceSegmentationModelBlockV3._try_copy_cuda_detection_tensors_to_pinned_numpy( + xyxy_tensor=xyxy_tensor, + confidence_tensor=confidence_tensor, + class_id_tensor=class_id_tensor, + mask_tensor=mask_tensor, + valid_count=valid_count, + ) + if pinned_arrays is not None: + xyxy, confidence, class_id, masks = pinned_arrays + else: + xyxy = xyxy_tensor.detach().cpu().numpy() + confidence = confidence_tensor.detach().cpu().numpy() + class_id = class_id_tensor.detach().cpu().numpy() + masks = mask_tensor.detach().cpu().numpy() + else: + xyxy_tensor = detections_element.xyxy + confidence_tensor = detections_element.confidence + class_id_tensor = detections_element.class_id + mask_tensor = detections_element.mask + pinned_arrays = RoboflowInstanceSegmentationModelBlockV3._try_copy_cuda_detection_tensors_to_pinned_numpy( + xyxy_tensor=xyxy_tensor, + confidence_tensor=confidence_tensor, + class_id_tensor=class_id_tensor, + mask_tensor=mask_tensor, + valid_count=valid_count, + ) + if pinned_arrays is not None: + xyxy, confidence, class_id, masks = pinned_arrays + else: + xyxy = xyxy_tensor.detach().cpu().numpy() + confidence = confidence_tensor.detach().cpu().numpy() + class_id = class_id_tensor.detach().cpu().numpy() + masks = mask_tensor.detach().cpu().numpy() + class_names = np.array( + [ + ( + model.class_names[int(class_id_element)] + if 0 <= int(class_id_element) < len(model.class_names) + else str(int(class_id_element)) + ) + for class_id_element in class_id + ] + ) + sv_detections = sv.Detections( + xyxy=xyxy, + mask=masks, + confidence=confidence, + class_id=class_id, + data={CLASS_NAME_DATA_FIELD: class_names}, + ) + sv_detections[DETECTION_ID_KEY] = np.array( + [str(uuid.uuid4()) for _ in range(len(sv_detections))] + ) + sv_detections[PARENT_ID_KEY] = np.array([""] * len(sv_detections)) + sv_detections[IMAGE_DIMENSIONS_KEY] = np.array( + [[metadata.original_size.height, metadata.original_size.width]] + * len(sv_detections) + ) + if inference_id is not None: + sv_detections[INFERENCE_ID_KEY] = np.array( + [inference_id] * len(sv_detections) + ) + result.append(sv_detections) + return result + + @staticmethod + def _recover_limited_rfdetr_masks( + detections_element, + valid_count: int, + ) -> Optional[torch.Tensor]: + image_metadata = detections_element.image_metadata + if image_metadata is None: + return None + mask_resize_detection_limit = image_metadata.get("mask_resize_detection_limit") + if mask_resize_detection_limit is None or valid_count <= int( + mask_resize_detection_limit + ): + return None + source_masks = image_metadata.get("source_masks") + query_indices = image_metadata.get("query_indices") + count = image_metadata.get("valid_count") + output_height = image_metadata.get("output_height") + output_width = image_metadata.get("output_width") + if ( + source_masks is None + or query_indices is None + or count is None + or output_height is None + or output_width is None + ): + return None + from inference_models.models.rfdetr.fused_postprocess import ( + fused_resize_selected_masks, + ) + + return fused_resize_selected_masks( + image_masks=source_masks, + query_indices=query_indices, + count=count, + output_height=output_height, + output_width=output_width, + ) + + @staticmethod + def _try_copy_cuda_detection_tensors_to_pinned_numpy( + xyxy_tensor: torch.Tensor, + confidence_tensor: torch.Tensor, + class_id_tensor: torch.Tensor, + mask_tensor: torch.Tensor, + valid_count: Optional[int], + ) -> Optional[tuple]: + if valid_count is None or not xyxy_tensor.is_cuda or not mask_tensor.is_cuda: + return None + if ( + not confidence_tensor.is_cuda + or not class_id_tensor.is_cuda + or mask_tensor.ndim != 3 + ): + return None + try: + buffers = _get_rfdetr_conversion_buffers( + count=max(valid_count, 1), + mask_shape=tuple(mask_tensor.shape[1:]), + xyxy_dtype=xyxy_tensor.dtype, + confidence_dtype=confidence_tensor.dtype, + class_id_dtype=class_id_tensor.dtype, + mask_dtype=mask_tensor.dtype, + ) + except RuntimeError: + return None + xyxy_buffer, confidence_buffer, class_id_buffer, mask_buffer, _ = buffers + if valid_count: + xyxy_buffer[:valid_count].copy_(xyxy_tensor, non_blocking=True) + confidence_buffer[:valid_count].copy_(confidence_tensor, non_blocking=True) + class_id_buffer[:valid_count].copy_(class_id_tensor, non_blocking=True) + mask_buffer[:valid_count].copy_(mask_tensor, non_blocking=True) + torch.cuda.current_stream(xyxy_tensor.device).synchronize() + return ( + xyxy_buffer[:valid_count].numpy().copy(), + confidence_buffer[:valid_count].numpy().copy(), + class_id_buffer[:valid_count].numpy().copy(), + mask_buffer[:valid_count].numpy().copy(), + ) + + @staticmethod + def _try_copy_limited_cuda_detection_tensors_to_pinned_numpy( + detections_element, + ) -> Optional[tuple]: + image_metadata = detections_element.image_metadata + if image_metadata is None: + return None + count_tensor = image_metadata.get("valid_count") + mask_resize_detection_limit = image_metadata.get("mask_resize_detection_limit") + if count_tensor is None or mask_resize_detection_limit is None: + return None + copy_count = int(mask_resize_detection_limit) + xyxy_tensor = detections_element.xyxy + confidence_tensor = detections_element.confidence + class_id_tensor = detections_element.class_id + mask_tensor = detections_element.mask + if ( + copy_count <= 0 + or not xyxy_tensor.is_cuda + or not confidence_tensor.is_cuda + or not class_id_tensor.is_cuda + or not mask_tensor.is_cuda + or mask_tensor.ndim != 3 + or mask_tensor.shape[0] < copy_count + ): + return None + try: + buffers = _get_rfdetr_conversion_buffers( + count=copy_count, + mask_shape=tuple(mask_tensor.shape[1:]), + xyxy_dtype=xyxy_tensor.dtype, + confidence_dtype=confidence_tensor.dtype, + class_id_dtype=class_id_tensor.dtype, + mask_dtype=mask_tensor.dtype, + ) + except RuntimeError: + return None + ( + xyxy_buffer, + confidence_buffer, + class_id_buffer, + mask_buffer, + count_buffer, + ) = buffers + count_buffer.copy_(count_tensor, non_blocking=True) + xyxy_buffer[:copy_count].copy_(xyxy_tensor[:copy_count], non_blocking=True) + confidence_buffer[:copy_count].copy_( + confidence_tensor[:copy_count], non_blocking=True + ) + class_id_buffer[:copy_count].copy_( + class_id_tensor[:copy_count], non_blocking=True + ) + mask_buffer[:copy_count].copy_(mask_tensor[:copy_count], non_blocking=True) + torch.cuda.current_stream(xyxy_tensor.device).synchronize() + valid_count = int(count_buffer.item()) + if valid_count > copy_count: + return None + return ( + xyxy_buffer[:valid_count].numpy().copy(), + confidence_buffer[:valid_count].numpy().copy(), + class_id_buffer[:valid_count].numpy().copy(), + mask_buffer[:valid_count].numpy().copy(), + ) + def run_remotely( self, images: Batch[WorkflowImageData], diff --git a/inference_models/inference_models/models/common/trt.py b/inference_models/inference_models/models/common/trt.py index 2d62408930..12bdaa8c44 100644 --- a/inference_models/inference_models/models/common/trt.py +++ b/inference_models/inference_models/models/common/trt.py @@ -435,6 +435,8 @@ def infer_from_trt_engine( outputs: List[str], stream: Optional[torch.cuda.Stream] = None, trt_cuda_graph_cache: Optional[TRTCudaGraphCache] = None, + synchronize: bool = True, + cuda_graph_replay_warmup_count: int = 0, ) -> List[torch.Tensor]: """Run inference using a TensorRT engine, optionally with CUDA graph acceleration. @@ -570,8 +572,10 @@ def infer_from_trt_engine( input_name=input_name, outputs=outputs, trt_cuda_graph_cache=trt_cuda_graph_cache, + cuda_graph_replay_warmup_count=cuda_graph_replay_warmup_count, ) - stream.synchronize() + if synchronize: + stream.synchronize() return results @@ -584,6 +588,7 @@ def _infer_from_trt_engine( input_name: str, outputs: List[str], trt_cuda_graph_cache: Optional[TRTCudaGraphCache] = None, + cuda_graph_replay_warmup_count: int = 0, ) -> List[torch.Tensor]: if trt_config.static_batch_size is not None: min_batch_size = trt_config.static_batch_size @@ -601,6 +606,7 @@ def _infer_from_trt_engine( min_batch_size=min_batch_size, max_batch_size=max_batch_size, trt_cuda_graph_cache=trt_cuda_graph_cache, + cuda_graph_replay_warmup_count=cuda_graph_replay_warmup_count, ) @@ -614,6 +620,7 @@ def _infer_from_trt_engine_with_batch_size_boundaries( min_batch_size: int, max_batch_size: int, trt_cuda_graph_cache: Optional[TRTCudaGraphCache] = None, + cuda_graph_replay_warmup_count: int = 0, ) -> List[torch.Tensor]: if pre_processed_images.shape[0] <= max_batch_size: reminder = min_batch_size - pre_processed_images.shape[0] @@ -637,6 +644,7 @@ def _infer_from_trt_engine_with_batch_size_boundaries( input_name=input_name, outputs=outputs, trt_cuda_graph_cache=trt_cuda_graph_cache, + cuda_graph_replay_warmup_count=cuda_graph_replay_warmup_count, ) if reminder > 0: results = [r[:-reminder] for r in results] @@ -667,6 +675,7 @@ def _infer_from_trt_engine_with_batch_size_boundaries( input_name=input_name, outputs=outputs, trt_cuda_graph_cache=trt_cuda_graph_cache, + cuda_graph_replay_warmup_count=cuda_graph_replay_warmup_count, ) if reminder > 0: results = [r[:-reminder] for r in results] @@ -683,6 +692,7 @@ def _execute_trt_engine( input_name: str, outputs: List[str], trt_cuda_graph_cache: Optional[TRTCudaGraphCache] = None, + cuda_graph_replay_warmup_count: int = 0, ) -> List[torch.Tensor]: if trt_cuda_graph_cache is not None: input_shape = tuple(pre_processed_images.shape) @@ -698,6 +708,7 @@ def _execute_trt_engine( device=device, input_name=input_name, outputs=outputs, + cuda_graph_replay_warmup_count=cuda_graph_replay_warmup_count, ) trt_cuda_graph_cache[cache_key] = trt_cuda_graph return results @@ -705,11 +716,13 @@ def _execute_trt_engine( else: trt_cuda_graph_state = trt_cuda_graph_cache[cache_key] stream = trt_cuda_graph_state.cuda_stream + caller_stream = torch.cuda.current_stream(device) + stream.wait_stream(caller_stream) with torch.cuda.stream(stream): trt_cuda_graph_state.input_buffer.copy_(pre_processed_images) trt_cuda_graph_state.cuda_graph.replay() results = [buf.clone() for buf in trt_cuda_graph_state.output_buffers] - stream.synchronize() + caller_stream.wait_stream(stream) return results else: @@ -752,6 +765,7 @@ def _capture_cuda_graph( device: torch.device, input_name: str, outputs: List[str], + cuda_graph_replay_warmup_count: int = 0, ) -> Tuple[List[torch.Tensor], TRTCudaGraphState]: # Each CUDA graph needs its own execution context. Sharing a single context # across graphs for different input shapes causes TRT to reallocate internal @@ -813,6 +827,8 @@ def _capture_cuda_graph( # in order to avoid drift of results - it's better to replay to get the results with torch.cuda.stream(stream): cuda_graph.replay() + for _ in range(max(int(cuda_graph_replay_warmup_count), 0)): + cuda_graph.replay() results = [buf.clone() for buf in output_buffers] stream.synchronize() diff --git a/inference_models/inference_models/models/rfdetr/common.py b/inference_models/inference_models/models/rfdetr/common.py index a3ae26cd29..5fc8520ba3 100644 --- a/inference_models/inference_models/models/rfdetr/common.py +++ b/inference_models/inference_models/models/rfdetr/common.py @@ -12,10 +12,14 @@ ) from inference_models.models.common.roboflow.post_processing import ( align_instance_segmentation_results, - align_instance_segmentation_results_to_rle_masks, rescale_image_detections, ) +from inference_models.models.common.rle_utils import torch_mask_to_coco_rle from inference_models.models.rfdetr.class_remapping import ClassesReMapping +from inference_models.models.rfdetr.fused_postprocess import ( + fused_resize_selected_masks, + fused_select_topk_boxes, +) from inference_models.models.rfdetr.post_processor import select_topk_predictions from inference_models.utils.file_system import read_json @@ -86,11 +90,8 @@ def post_process_object_detection_results( predicted_confidence = predicted_confidence[confidence_mask] top_classes = top_classes[confidence_mask] selected_boxes = image_bboxes[confidence_mask] - predicted_confidence, sorted_indices = torch.sort( - predicted_confidence, descending=True - ) - top_classes = top_classes[sorted_indices] - selected_boxes = selected_boxes[sorted_indices] + # select_topk_predictions returns scores sorted descending; the boolean + # filters above preserve that order. cxcy = selected_boxes[:, :2] wh = selected_boxes[:, 2:] xy_min = cxcy - 0.5 * wh @@ -131,6 +132,8 @@ def post_process_instance_segmentation_results( threshold: Union[float, torch.Tensor], num_classes: int, classes_re_mapping: Optional[ClassesReMapping], + defer_fused_postprocess_count: bool = False, + deferred_mask_resize_detection_limit: Optional[int] = None, ) -> List[InstanceDetections]: logits_sigmoid = torch.nn.functional.sigmoid(logits) results = [] @@ -140,6 +143,19 @@ def post_process_instance_segmentation_results( for image_bboxes, image_logits, image_masks, image_meta in zip( bboxes, logits_sigmoid, masks, pre_processing_meta ): + fused_result = _try_fused_instance_segmentation_post_process( + image_bboxes=image_bboxes, + image_logits=image_logits, + image_masks=image_masks, + image_meta=image_meta, + threshold=threshold, + classes_re_mapping=classes_re_mapping, + defer_count=defer_fused_postprocess_count, + deferred_mask_resize_detection_limit=deferred_mask_resize_detection_limit, + ) + if fused_result is not None: + results.append(fused_result) + continue confidence, top_classes, image_bboxes, query_indices = select_topk_predictions( logits_sigmoid=image_logits, bboxes_cxcywh=image_bboxes, @@ -169,10 +185,8 @@ def post_process_instance_segmentation_results( top_classes = top_classes[confidence_mask] selected_boxes = image_bboxes[confidence_mask] selected_masks = image_masks[confidence_mask] - confidence, sorted_indices = torch.sort(confidence, descending=True) - top_classes = top_classes[sorted_indices] - selected_boxes = selected_boxes[sorted_indices] - selected_masks = selected_masks[sorted_indices] + # select_topk_predictions returns scores sorted descending; the boolean + # filters above preserve that order. cxcy = selected_boxes[:, :2] wh = selected_boxes[:, 2:] xy_min = cxcy - 0.5 * wh @@ -218,6 +232,104 @@ def post_process_instance_segmentation_results( return results +def _try_fused_instance_segmentation_post_process( + image_bboxes: torch.Tensor, + image_logits: torch.Tensor, + image_masks: torch.Tensor, + image_meta: PreProcessingMetadata, + threshold: Union[float, torch.Tensor], + classes_re_mapping: Optional[ClassesReMapping], + defer_count: bool = False, + deferred_mask_resize_detection_limit: Optional[int] = None, +) -> Optional[InstanceDetections]: + if isinstance(threshold, torch.Tensor) or classes_re_mapping is None: + return None + if ( + image_meta.pad_left != 0 + or image_meta.pad_top != 0 + or image_meta.pad_right != 0 + or image_meta.pad_bottom != 0 + or image_meta.static_crop_offset.offset_x != 0 + or image_meta.static_crop_offset.offset_y != 0 + or image_meta.nonsquare_intermediate_size is not None + or image_meta.original_size.width != image_meta.size_after_pre_processing.width + or image_meta.original_size.height + != image_meta.size_after_pre_processing.height + ): + return None + fused = fused_select_topk_boxes( + image_bboxes=image_bboxes, + image_logits=image_logits, + threshold=threshold, + inference_width=image_meta.inference_size.width, + inference_height=image_meta.inference_size.height, + scale_width=image_meta.scale_width, + scale_height=image_meta.scale_height, + original_width=image_meta.original_size.width, + original_height=image_meta.original_size.height, + class_mapping=classes_re_mapping.class_mapping, + return_cpu_count=not defer_count, + ) + if fused is None: + return None + confidence, top_classes, selected_boxes, query_indices, selected_count = fused + if defer_count: + aligned_masks = fused_resize_selected_masks( + image_masks=image_masks, + query_indices=query_indices, + count=selected_count, + output_height=image_meta.original_size.height, + output_width=image_meta.original_size.width, + detection_limit=deferred_mask_resize_detection_limit, + ) + if aligned_masks is None: + return None + image_metadata = {"valid_count": selected_count} + if deferred_mask_resize_detection_limit is not None: + image_metadata.update( + { + "mask_resize_detection_limit": min( + int(deferred_mask_resize_detection_limit), + aligned_masks.shape[0], + ), + "source_masks": image_masks, + "query_indices": query_indices, + "output_height": image_meta.original_size.height, + "output_width": image_meta.original_size.width, + } + ) + return InstanceDetections( + xyxy=selected_boxes, + confidence=confidence, + class_id=top_classes.int(), + mask=aligned_masks, + image_metadata=image_metadata, + ) + if selected_count == 0: + aligned_masks = torch.empty( + size=(0, image_meta.original_size.height, image_meta.original_size.width), + dtype=torch.bool, + device=image_bboxes.device, + ) + else: + selected_masks = image_masks[query_indices] + aligned_masks = ( + functional.resize( + selected_masks, + [image_meta.original_size.height, image_meta.original_size.width], + interpolation=functional.InterpolationMode.BILINEAR, + ) + .gt_(0.0) + .to(dtype=torch.bool) + ) + return InstanceDetections( + xyxy=selected_boxes.round().int(), + confidence=confidence, + class_id=top_classes.int(), + mask=aligned_masks, + ) + + def post_process_instance_segmentation_results_to_rle_masks( bboxes: torch.Tensor, logits: torch.Tensor, @@ -264,10 +376,8 @@ def post_process_instance_segmentation_results_to_rle_masks( top_classes = top_classes[confidence_mask] selected_boxes = image_bboxes[confidence_mask] selected_masks = image_masks[confidence_mask] - confidence, sorted_indices = torch.sort(confidence, descending=True) - top_classes = top_classes[sorted_indices] - selected_boxes = selected_boxes[sorted_indices] - selected_masks = selected_masks[sorted_indices] + # select_topk_predictions returns scores sorted descending; the boolean + # filters above preserve that order. cxcy = selected_boxes[:, :2] wh = selected_boxes[:, 2:] xy_min = cxcy - 0.5 * wh @@ -292,8 +402,7 @@ def post_process_instance_segmentation_results_to_rle_masks( image_meta.pad_bottom, ) selected_boxes_xyxy = selected_boxes_xyxy_pct * denorm_size_whwh - aligned_boxes, rle_masks = [], [] - for bbox, mask in align_instance_segmentation_results_to_rle_masks( + aligned_boxes, aligned_masks = align_instance_segmentation_results( image_bboxes=selected_boxes_xyxy, masks=selected_masks, padding=padding, @@ -303,9 +412,8 @@ def post_process_instance_segmentation_results_to_rle_masks( size_after_pre_processing=image_meta.size_after_pre_processing, inference_size=denorm_size, static_crop_offset=image_meta.static_crop_offset, - ): - aligned_boxes.append(bbox) - rle_masks.append(mask) + ) + rle_masks = [torch_mask_to_coco_rle(mask) for mask in aligned_masks] instances_masks = InstancesRLEMasks.from_coco_rle_masks( image_size=( image_meta.original_size.height, @@ -313,11 +421,10 @@ def post_process_instance_segmentation_results_to_rle_masks( ), masks=rle_masks, ) - if len(aligned_boxes) > 0: - aligned_boxes_tensor = torch.stack(aligned_boxes, dim=0) + if aligned_boxes.shape[0] > 0: final_results.append( InstanceDetections( - xyxy=aligned_boxes_tensor.round().int(), + xyxy=aligned_boxes.round().int(), confidence=confidence, class_id=top_classes.int(), mask=instances_masks, diff --git a/inference_models/inference_models/models/rfdetr/fused_postprocess.py b/inference_models/inference_models/models/rfdetr/fused_postprocess.py new file mode 100644 index 0000000000..d068a18206 --- /dev/null +++ b/inference_models/inference_models/models/rfdetr/fused_postprocess.py @@ -0,0 +1,252 @@ +from typing import Optional, Tuple + +import torch + +try: + import triton + import triton.language as tl +except ImportError: + triton = None + tl = None + + +MAX_RFDETR_DETECTIONS = 100 + + +if triton is not None: + + @triton.jit + def _select_topk_boxes_kernel( + logits_ptr, + bboxes_ptr, + class_mapping_ptr, + scores_out_ptr, + classes_out_ptr, + boxes_out_ptr, + queries_out_ptr, + count_out_ptr, + threshold: tl.constexpr, + inference_width: tl.constexpr, + inference_height: tl.constexpr, + scale_width: tl.constexpr, + scale_height: tl.constexpr, + original_width: tl.constexpr, + original_height: tl.constexpr, + num_queries: tl.constexpr, + num_logits_classes: tl.constexpr, + has_class_mapping: tl.constexpr, + block_size: tl.constexpr, + ): + offsets = tl.arange(0, block_size) + valid_offsets = offsets < (num_queries * num_logits_classes) + scores = tl.load(logits_ptr + offsets, mask=valid_offsets, other=-float("inf")) + scores = tl.where(valid_offsets, scores, -float("inf")) + selected_count = tl.full((), 0, tl.int32) + iteration = tl.full((), 0, tl.int32) + top_score = threshold + 1.0 + + while (iteration < 100) & (top_score > threshold): + top_score = tl.max(scores, axis=0) + is_top = scores == top_score + top_offset = tl.min(tl.where(is_top, offsets, block_size), axis=0) + query_index = top_offset // num_logits_classes + raw_class_id = top_offset - query_index * num_logits_classes + + if has_class_mapping: + class_id = tl.load(class_mapping_ptr + raw_class_id) + keep = (class_id >= 0) & (top_score > threshold) + else: + class_id = raw_class_id + keep = (raw_class_id < (num_logits_classes - 1)) & ( + top_score > threshold + ) + + out_index = selected_count + cx = tl.load(bboxes_ptr + query_index * 4 + 0) + cy = tl.load(bboxes_ptr + query_index * 4 + 1) + w = tl.load(bboxes_ptr + query_index * 4 + 2) + h = tl.load(bboxes_ptr + query_index * 4 + 3) + x1 = (cx - 0.5 * w) * inference_width / scale_width + y1 = (cy - 0.5 * h) * inference_height / scale_height + x2 = (cx + 0.5 * w) * inference_width / scale_width + y2 = (cy + 0.5 * h) * inference_height / scale_height + x1 = tl.minimum(tl.maximum(x1, 0.0), original_width) + y1 = tl.minimum(tl.maximum(y1, 0.0), original_height) + x2 = tl.minimum(tl.maximum(x2, 0.0), original_width) + y2 = tl.minimum(tl.maximum(y2, 0.0), original_height) + + tl.store(scores_out_ptr + out_index, top_score, mask=keep) + tl.store(classes_out_ptr + out_index, class_id, mask=keep) + tl.store(queries_out_ptr + out_index, query_index, mask=keep) + tl.store(boxes_out_ptr + out_index * 4 + 0, x1, mask=keep) + tl.store(boxes_out_ptr + out_index * 4 + 1, y1, mask=keep) + tl.store(boxes_out_ptr + out_index * 4 + 2, x2, mask=keep) + tl.store(boxes_out_ptr + out_index * 4 + 3, y2, mask=keep) + selected_count += keep.to(tl.int32) + scores = tl.where(offsets == top_offset, -float("inf"), scores) + iteration += 1 + + tl.store(count_out_ptr, selected_count) + + @triton.jit + def _resize_selected_masks_kernel( + masks_ptr, + query_indices_ptr, + count_ptr, + output_ptr, + in_height: tl.constexpr, + in_width: tl.constexpr, + out_height: tl.constexpr, + out_width: tl.constexpr, + block_size: tl.constexpr, + ): + det_index = tl.program_id(0) + pixel_block = tl.program_id(1) + count = tl.load(count_ptr) + offsets = pixel_block * block_size + tl.arange(0, block_size) + total_pixels = out_height * out_width + valid = (det_index < count) & (offsets < total_pixels) + query_index = tl.load(query_indices_ptr + det_index, mask=det_index < count) + out_y = offsets // out_width + out_x = offsets - out_y * out_width + + in_y = (out_y.to(tl.float32) + 0.5) * in_height / out_height - 0.5 + in_x = (out_x.to(tl.float32) + 0.5) * in_width / out_width - 0.5 + y0f = tl.floor(in_y) + x0f = tl.floor(in_x) + y0 = y0f.to(tl.int32) + x0 = x0f.to(tl.int32) + y1 = y0 + 1 + x1 = x0 + 1 + wy = in_y - y0f + wx = in_x - x0f + y0 = tl.minimum(tl.maximum(y0, 0), in_height - 1) + y1 = tl.minimum(tl.maximum(y1, 0), in_height - 1) + x0 = tl.minimum(tl.maximum(x0, 0), in_width - 1) + x1 = tl.minimum(tl.maximum(x1, 0), in_width - 1) + + mask_base = masks_ptr + query_index * in_height * in_width + v00 = tl.load(mask_base + y0 * in_width + x0, mask=valid, other=0.0) + v01 = tl.load(mask_base + y0 * in_width + x1, mask=valid, other=0.0) + v10 = tl.load(mask_base + y1 * in_width + x0, mask=valid, other=0.0) + v11 = tl.load(mask_base + y1 * in_width + x1, mask=valid, other=0.0) + top = v00 * (1.0 - wx) + v01 * wx + bottom = v10 * (1.0 - wx) + v11 * wx + resized = top * (1.0 - wy) + bottom * wy + tl.store( + output_ptr + det_index * total_pixels + offsets, + resized > 0.0, + mask=valid, + ) + + +def fused_select_topk_boxes( + image_bboxes: torch.Tensor, + image_logits: torch.Tensor, + threshold: float, + inference_width: int, + inference_height: int, + scale_width: float, + scale_height: float, + original_width: int, + original_height: int, + class_mapping: Optional[torch.Tensor], + return_cpu_count: bool = True, +) -> Optional[ + Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor] +]: + if triton is None or not image_logits.is_cuda or image_logits.ndim != 2: + return None + num_queries, num_logits_classes = image_logits.shape + if num_queries != MAX_RFDETR_DETECTIONS: + return None + block_size = triton.next_power_of_2(num_queries * num_logits_classes) + scores = torch.empty((MAX_RFDETR_DETECTIONS,), device=image_logits.device) + classes = torch.empty( + (MAX_RFDETR_DETECTIONS,), dtype=torch.int32, device=image_logits.device + ) + boxes = torch.empty( + (MAX_RFDETR_DETECTIONS, 4), + dtype=image_bboxes.dtype, + device=image_bboxes.device, + ) + query_indices = torch.zeros( + (MAX_RFDETR_DETECTIONS,), dtype=torch.int32, device=image_logits.device + ) + count = torch.empty((1,), dtype=torch.int32, device=image_logits.device) + if class_mapping is None: + class_mapping = torch.empty((1,), dtype=torch.int32, device=image_logits.device) + has_class_mapping = False + else: + has_class_mapping = True + _select_topk_boxes_kernel[(1,)]( + image_logits, + image_bboxes, + class_mapping, + scores, + classes, + boxes, + query_indices, + count, + float(threshold), + int(inference_width), + int(inference_height), + float(scale_width), + float(scale_height), + int(original_width), + int(original_height), + num_queries, + num_logits_classes, + has_class_mapping, + block_size, + num_warps=8, + ) + if not return_cpu_count: + return scores, classes, boxes, query_indices.to(dtype=torch.long), count + selected_count = int(count.cpu().item()) + return ( + scores[:selected_count], + classes[:selected_count], + boxes[:selected_count], + query_indices[:selected_count].to(dtype=torch.long), + selected_count, + ) + + +def fused_resize_selected_masks( + image_masks: torch.Tensor, + query_indices: torch.Tensor, + count: torch.Tensor, + output_height: int, + output_width: int, + detection_limit: Optional[int] = None, +) -> Optional[torch.Tensor]: + if triton is None or not image_masks.is_cuda: + return None + output = torch.empty( + (MAX_RFDETR_DETECTIONS, output_height, output_width), + dtype=torch.bool, + device=image_masks.device, + ) + _, input_height, input_width = image_masks.shape + block_size = 256 + if detection_limit is None: + detection_limit = MAX_RFDETR_DETECTIONS + detection_limit = min(max(int(detection_limit), 1), MAX_RFDETR_DETECTIONS) + grid = ( + detection_limit, + triton.cdiv(output_height * output_width, block_size), + ) + _resize_selected_masks_kernel[grid]( + image_masks, + query_indices, + count, + output, + input_height, + input_width, + output_height, + output_width, + block_size, + num_warps=2, + ) + return output diff --git a/inference_models/inference_models/models/rfdetr/pre_processing.py b/inference_models/inference_models/models/rfdetr/pre_processing.py index 3e2ceaad09..c8ff7f529b 100644 --- a/inference_models/inference_models/models/rfdetr/pre_processing.py +++ b/inference_models/inference_models/models/rfdetr/pre_processing.py @@ -11,6 +11,7 @@ tensor F.resize → F.normalize """ +import threading from typing import List, Optional, Tuple, Union import numpy as np @@ -40,6 +41,9 @@ pre_process_numpy_image, ) +_PINNED_BUFFER_STORAGE = threading.local() +_NORMALIZATION_STORAGE = threading.local() + def pre_process_network_input( images: Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]], @@ -117,15 +121,24 @@ def pre_process_network_input( target_size=target_size, input_color_mode=input_color_mode, pre_processing_overrides=pre_processing_overrides, + use_pinned_output=( + target_device.type == "cuda" and len(image_list) == 1 + ), ) else: raise TypeError( f"Unsupported image input type for RFDETR pre-processing: {type(img)}" ) - tensors.append(tensor.to(device=target_device)) + non_blocking = target_device.type == "cuda" and tensor.is_pinned() + tensors.append(tensor.to(device=target_device, non_blocking=non_blocking)) + if non_blocking: + _record_pinned_buffer_copy(target_device=target_device) metadata.append(meta) - batch = torch.stack(tensors).contiguous() + if len(tensors) == 1: + batch = tensors[0].unsqueeze(0).contiguous() + else: + batch = torch.stack(tensors).contiguous() return batch, metadata @@ -136,6 +149,7 @@ def _pre_process_numpy( target_size: ImageDimensions, input_color_mode: Optional[ColorMode], pre_processing_overrides: Optional[PreProcessingOverrides], + use_pinned_output: bool, ) -> Tuple[torch.Tensor, PreProcessingMetadata]: """numpy / uint8-tensor branch: PIL chain matching training source-of-truth. @@ -145,6 +159,7 @@ def _pre_process_numpy( `training_input_size` (matching training's SquareResize). Otherwise we stretch directly in a single PIL F.resize step. """ + swap_tensor_channels = False if _needs_two_step_resize(network_input): intermediate_image, meta = _dataset_version_resize_uint8( image=image, @@ -171,7 +186,9 @@ def _pre_process_numpy( width=image.shape[1], height=image.shape[0] ) if input_color_mode != network_input.color_mode: - image = image[:, :, ::-1] + swap_tensor_channels = True + else: + swap_tensor_channels = False pil = Image.fromarray(np.ascontiguousarray(image)) meta = _build_metadata( original_size=original_size, @@ -180,12 +197,102 @@ def _pre_process_numpy( static_crop_offset=static_crop_offset, ) - resized = TF.resize(pil, (target_size.height, target_size.width), antialias=True) + resized = pil.resize( + (target_size.width, target_size.height), + resample=Image.Resampling.BILINEAR, + ) + tensor = _pil_image_to_normalized_tensor( + image=resized, + network_input=network_input, + swap_tensor_channels=swap_tensor_channels, + use_pinned_output=use_pinned_output, + ) + if tensor is not None: + return tensor, meta tensor = TF.to_tensor(resized) + if swap_tensor_channels and tensor.shape[0] == 3: + tensor = tensor[[2, 1, 0], :, :] tensor = _apply_normalization(tensor, network_input) return tensor, meta +def _pil_image_to_normalized_tensor( + image: Image.Image, + network_input: NetworkInputDefinition, + swap_tensor_channels: bool, + use_pinned_output: bool = False, +) -> Optional[torch.Tensor]: + if not network_input.normalization or network_input.input_channels != 3: + return None + image_array = np.asarray(image) + if image_array.ndim != 3 or image_array.shape[2] != 3: + return None + multiplier, bias = _get_normalization_constants(network_input.normalization) + channel_order = (2, 1, 0) if swap_tensor_channels else (0, 1, 2) + shape = (3, image_array.shape[0], image_array.shape[1]) + if use_pinned_output: + normalized_tensor = _get_pinned_normalized_buffer(shape=shape) + normalized = normalized_tensor.numpy() + else: + normalized_tensor = None + normalized = np.empty(shape, dtype=np.float32) + for output_channel, input_channel in enumerate(channel_order): + channel = normalized[output_channel] + np.multiply( + image_array[:, :, input_channel], + multiplier[output_channel], + out=channel, + casting="unsafe", + ) + channel += bias[output_channel] + if normalized_tensor is not None: + return normalized_tensor + return torch.from_numpy(normalized) + + +def _get_normalization_constants( + normalization: Tuple[List[float], List[float]], +) -> Tuple[np.ndarray, np.ndarray]: + key = (tuple(normalization[0]), tuple(normalization[1])) + cached_key = getattr(_NORMALIZATION_STORAGE, "key", None) + if cached_key != key: + _NORMALIZATION_STORAGE.key = key + mean = np.asarray(normalization[0], dtype=np.float32) + std = np.asarray(normalization[1], dtype=np.float32) + _NORMALIZATION_STORAGE.multiplier = np.float32(1.0 / 255.0) / std + _NORMALIZATION_STORAGE.bias = -mean / std + return ( + _NORMALIZATION_STORAGE.multiplier, + _NORMALIZATION_STORAGE.bias, + ) + + +def _get_pinned_normalized_buffer( + shape: Tuple[int, int, int], +) -> torch.Tensor: + copy_event = getattr(_PINNED_BUFFER_STORAGE, "copy_event", None) + if copy_event is not None: + copy_event.synchronize() + buffer = getattr(_PINNED_BUFFER_STORAGE, "normalized_buffer", None) + if ( + buffer is None + or tuple(buffer.shape) != shape + or buffer.dtype != torch.float32 + or not buffer.is_pinned() + ): + buffer = torch.empty(shape, dtype=torch.float32, pin_memory=True) + _PINNED_BUFFER_STORAGE.normalized_buffer = buffer + return buffer + + +def _record_pinned_buffer_copy(target_device: torch.device) -> None: + copy_event = getattr(_PINNED_BUFFER_STORAGE, "copy_event", None) + if copy_event is None: + copy_event = torch.cuda.Event() + _PINNED_BUFFER_STORAGE.copy_event = copy_event + copy_event.record(torch.cuda.current_stream(target_device)) + + def _needs_two_step_resize(network_input: NetworkInputDefinition) -> bool: """True when the dataset-version resize_mode is non-stretch — the resize Roboflow's exporter applied at version creation needs to be replayed at diff --git a/inference_models/inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py b/inference_models/inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py index e8196f8eae..194eccc1d2 100644 --- a/inference_models/inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py +++ b/inference_models/inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py @@ -175,6 +175,10 @@ def from_pretrained( default_cuda_graph_cache_size=default_trt_cuda_graph_cache_size, cuda_graph_cache=trt_cuda_graph_cache, ) + if trt_cuda_graph_cache is None: + trt_cuda_graph_cache = TRTCudaGraphCache( + capacity=default_trt_cuda_graph_cache_size + ) return cls( engine=engine, input_name=inputs[0], @@ -247,7 +251,10 @@ def pre_process( image_size_wh=image_size, pre_processing_overrides=pre_processing_overrides, ) - self._pre_process_stream.synchronize() + if not kwargs.get("defer_cuda_stream_sync", False): + torch.cuda.current_stream(self._device).wait_stream( + self._pre_process_stream + ) return pre_processed_images, pre_processing_meta def forward( @@ -259,6 +266,7 @@ def forward( cache = self._trt_cuda_graph_cache if not disable_cuda_graphs else None with self._lock: with use_cuda_context(context=self._cuda_context): + self._inference_stream.wait_stream(self._pre_process_stream) detections, labels, masks = infer_from_trt_engine( pre_processed_images=pre_processed_images, trt_config=self._trt_config, @@ -269,7 +277,13 @@ def forward( outputs=self._output_names, stream=self._inference_stream, trt_cuda_graph_cache=cache, + synchronize=False, + cuda_graph_replay_warmup_count=64, ) + if not kwargs.get("defer_cuda_stream_sync", False): + torch.cuda.current_stream(self._device).wait_stream( + self._inference_stream + ) return detections, labels, masks def post_process( @@ -296,6 +310,7 @@ def post_process( default_confidence=INFERENCE_MODELS_RFDETR_DEFAULT_CONFIDENCE, ) with torch.cuda.stream(self._post_process_stream): + self._post_process_stream.wait_stream(self._inference_stream) for result_element in model_results: result_element.record_stream(self._post_process_stream) bboxes, logits, masks = model_results @@ -308,6 +323,12 @@ def post_process( threshold=confidence_filter.get_threshold(self.class_names), num_classes=len(self.class_names), classes_re_mapping=self._classes_re_mapping, + defer_fused_postprocess_count=kwargs.get( + "defer_fused_postprocess_count", False + ), + deferred_mask_resize_detection_limit=kwargs.get( + "deferred_mask_resize_detection_limit" + ), ) else: results = post_process_instance_segmentation_results_to_rle_masks( @@ -319,7 +340,7 @@ def post_process( num_classes=len(self.class_names), classes_re_mapping=self._classes_re_mapping, ) - self._post_process_stream.synchronize() + torch.cuda.current_stream(self._device).wait_stream(self._post_process_stream) return results @property