Open
Conversation
jkerfsx
approved these changes
Mar 4, 2026
| std::process::exit(1); | ||
| }, | ||
| /// Build input tensors using per-input shape profiles at the given phase (min/opt/max). | ||
| fn build_inputs(engine: &UniquePtr<Engine>, phase: &str) -> Vec<InputTensor> { |
There was a problem hiding this comment.
nit: I'd maybe call phase as profile_name or shape_mode?
| /// Represents the batch dimensions supported by a TensorRT engine. | ||
| /// | ||
| /// Deprecated: Use `get_input_shape_profiles()` for per-input profiles. | ||
| #[derive(Debug, Clone)] |
| let input_infos = engine.get_input_dims(); | ||
|
|
||
| profiles.iter().zip(input_infos.iter()).map(|(profile, info)| { | ||
| let shape = match phase { |
Contributor
There was a problem hiding this comment.
We should make this match case over an ENUM instead of over strings
| name: input_info.name.clone(), | ||
| data: input_data, | ||
| dtype: input_info.dtype.clone(), | ||
| let dtype_size: usize = match info.dtype { |
Contributor
There was a problem hiding this comment.
I should probably know this since I wrote it, but do we have an FP16 type?
| info!(" iterations : {}", num_runs); | ||
| info!(" avg : {:.3}ms", avg_ms); | ||
| info!(" p50 : {:.3}ms", p50_ms); | ||
| info!(" p99 : {:.3}ms", p99_ms); |
GiveThanksAlways
added a commit
to GiveThanksAlways/libinfer
that referenced
this pull request
Mar 10, 2026
…puts + safety/perf improvements Rewrites engine.h, engine.cpp, and lib.rs to merge upstream PR saronic-technologies#31's per-input heterogeneous dynamic shape support with our CUDA graph, managed memory (IO_COHERENCE), and Direct I/O optimizations. Key changes: - Per-input shape profiles (InputShapeProfile) from TRT optimization profiles, replacing the global batch-dim model. Each input's dynamic dimension is resolved independently via precomputed staticByteCount (one integer division, zero heap allocations). - Output buffer sizes computed via TRT shape propagation (set inputs to max shapes, query output shapes) instead of manual mOutputLengths. - Pre-computed mInputIndices/mOutputIndices vectors for O(1) hot-path lookups (was O(n) linear scan per tensor). - Shape caching via mLastInputShapes: skips redundant setInputShape() calls when shapes are unchanged between frames. - Safety: getDataTypeSize() default case throws instead of UB, bounds checking on write_input_buffer/read_output_buffer, std::call_once for thread-safe logger initialization, documented Send safety on ffi::Engine. - Removed unsafe raw pointer APIs (get_input_buffer_ptr, get_output_buffer_ptr) in favor of safe write/read buffer API. - Backward-compatible get_batch_dims() via get_input_shape_profiles(). Benchmark (Jetson AGX Orin 64GB, YOLOv8n FP16 × 3, 2048 iters): Stock: 29308 µs 34.2 Hz CUDA Graph: 26536 µs 37.3 Hz (1.10×) Managed: 28640 µs 34.8 Hz Managed+Graph: 25930 µs 38.7 Hz (1.13×) Direct I/O: 25098 µs 40.1 Hz (1.17×, 4210 µs saved/frame) Tail latency: p99 30186→25399 µs (15.9% tighter)
GiveThanksAlways
added a commit
to GiveThanksAlways/libinfer
that referenced
this pull request
Mar 10, 2026
… and latest numbers
GiveThanksAlways
added a commit
to GiveThanksAlways/libinfer
that referenced
this pull request
Mar 10, 2026
- bench_pr31_comparison.rs: 32768-iter benchmark matching PR saronic-technologies#31's methodology Tests stock infer(), CUDA Graph, Managed (zerocopy), Direct I/O modes Uses get_input_shape_profiles() + build_inputs() identical to PR saronic-technologies#31 - PR31_COMPARISON.md: full results on 4 engines (YOLOv8n 640/320, TCN FP16/FP32) Key finding: CUDA Graph beats PR saronic-technologies#31's stock YOLOv8n by 5.6% (1.664 vs 1.763ms p50) CUDA Graph delivers 1.86x on small models, 1.12x on medium, 1.03x on large Direct I/O: 1.66x on small models, tightens p99 tail by 22-26%
|
Sorry ignore the commits that referenced this pull request. Working on adding CUDA graphs and IO_COHERENCE to libinfer. I will create a PR once that is finished. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds per-input heterogeneous dynamic shape support; each input tensor now has its own min/opt/max shape profile from TensorRT, replacing the single global batch size. Input buffers allocated per their own max shape; output buffers sized via TensorRT shape propagation after setting all inputs to max.
infer()resolves each input's dynamic dimension independently using precomputed metadata (one integer division per dynamic input, zero heap allocations).Also removed
get_output_len()andmOutputLengths, output sizes now queried dynamically frommContext->getTensorShape()after input shapes are setQuestions:
get_batch_dims()?Benchmarks:
FP16 engine (3 tensors, 1 dynamic dim on [1])
FP32 engine (3 tensors, 1 dynamic dim on [1])
YOLOv8n (static, single input)