feat: Heterogeneously dynamic inputs by mwinters-saronic · Pull Request #31 · saronic-technologies/libinfer

mwinters-saronic · 2026-03-03T23:35:22Z

Adds per-input heterogeneous dynamic shape support; each input tensor now has its own min/opt/max shape profile from TensorRT, replacing the single global batch size. Input buffers allocated per their own max shape; output buffers sized via TensorRT shape propagation after setting all inputs to max. infer() resolves each input's dynamic dimension independently using precomputed metadata (one integer division per dynamic input, zero heap allocations).

Also removed get_output_len() and mOutputLengths, output sizes now queried dynamically from mContext->getTensorShape() after input shapes are set

Questions:

remove get_batch_dims()?

Benchmarks:

FP16 engine (3 tensors, 1 dynamic dim on [1])

dynD	Avg	p50	p99	Throughput
1	5.57ms	5.57ms	5.65ms	179.6 infer/s
16	11.47ms	11.48ms	11.76ms	87.2 infer/s
64	38.96ms	39.07ms	39.67ms	25.7 infer/s

FP32 engine (3 tensors, 1 dynamic dim on [1])

dynD	Avg	p50	p99	Throughput
1	20.00ms	20.00ms	20.18ms	50.0 infer/s
16	37.37ms	37.23ms	39.86ms	26.8 infer/s
64	127.2ms	—	—	~7.9 infer/s

YOLOv8n (static, single input)

Metric	Value
Avg	1.750ms
p50	1.763ms
p99	1.781ms
Min	1.642ms
Max	1.787ms
Throughput	571.5 infer/s

jkerfsx

Looks good to me.

jkerfsx · 2026-03-04T15:20:16Z

examples/benchmark.rs

-                std::process::exit(1);
-            },
+/// Build input tensors using per-input shape profiles at the given phase (min/opt/max).
+fn build_inputs(engine: &UniquePtr<Engine>, phase: &str) -> Vec<InputTensor> {


nit: I'd maybe call phase as profile_name or shape_mode?

jkerfsx · 2026-03-04T15:30:05Z

src/lib.rs

+/// Represents the batch dimensions supported by a TensorRT engine.
+///
+/// Deprecated: Use `get_input_shape_profiles()` for per-input profiles.
 #[derive(Debug, Clone)]


nit: #[deprecated]

carlschader-saronic · 2026-03-04T20:23:47Z

examples/benchmark.rs

+    let input_infos = engine.get_input_dims();
+
+    profiles.iter().zip(input_infos.iter()).map(|(profile, info)| {
+        let shape = match phase {


We should make this match case over an ENUM instead of over strings

carlschader-saronic · 2026-03-04T20:24:30Z

examples/benchmark.rs

-            name: input_info.name.clone(),
-            data: input_data,
-            dtype: input_info.dtype.clone(),
+        let dtype_size: usize = match info.dtype {


I should probably know this since I wrote it, but do we have an FP16 type?

carlschader-saronic · 2026-03-04T20:25:33Z

examples/benchmark.rs

+    info!("  iterations : {}", num_runs);
+    info!("  avg        : {:.3}ms", avg_ms);
+    info!("  p50        : {:.3}ms", p50_ms);
+    info!("  p99        : {:.3}ms", p99_ms);


…puts + safety/perf improvements Rewrites engine.h, engine.cpp, and lib.rs to merge upstream PR saronic-technologies#31's per-input heterogeneous dynamic shape support with our CUDA graph, managed memory (IO_COHERENCE), and Direct I/O optimizations. Key changes: - Per-input shape profiles (InputShapeProfile) from TRT optimization profiles, replacing the global batch-dim model. Each input's dynamic dimension is resolved independently via precomputed staticByteCount (one integer division, zero heap allocations). - Output buffer sizes computed via TRT shape propagation (set inputs to max shapes, query output shapes) instead of manual mOutputLengths. - Pre-computed mInputIndices/mOutputIndices vectors for O(1) hot-path lookups (was O(n) linear scan per tensor). - Shape caching via mLastInputShapes: skips redundant setInputShape() calls when shapes are unchanged between frames. - Safety: getDataTypeSize() default case throws instead of UB, bounds checking on write_input_buffer/read_output_buffer, std::call_once for thread-safe logger initialization, documented Send safety on ffi::Engine. - Removed unsafe raw pointer APIs (get_input_buffer_ptr, get_output_buffer_ptr) in favor of safe write/read buffer API. - Backward-compatible get_batch_dims() via get_input_shape_profiles(). Benchmark (Jetson AGX Orin 64GB, YOLOv8n FP16 × 3, 2048 iters): Stock: 29308 µs 34.2 Hz CUDA Graph: 26536 µs 37.3 Hz (1.10×) Managed: 28640 µs 34.8 Hz Managed+Graph: 25930 µs 38.7 Hz (1.13×) Direct I/O: 25098 µs 40.1 Hz (1.17×, 4210 µs saved/frame) Tail latency: p99 30186→25399 µs (15.9% tighter)

… and latest numbers

- bench_pr31_comparison.rs: 32768-iter benchmark matching PR saronic-technologies#31's methodology Tests stock infer(), CUDA Graph, Managed (zerocopy), Direct I/O modes Uses get_input_shape_profiles() + build_inputs() identical to PR saronic-technologies#31 - PR31_COMPARISON.md: full results on 4 engines (YOLOv8n 640/320, TCN FP16/FP32) Key finding: CUDA Graph beats PR saronic-technologies#31's stock YOLOv8n by 5.6% (1.664 vs 1.763ms p50) CUDA Graph delivers 1.86x on small models, 1.12x on medium, 1.03x on large Direct I/O: 1.66x on small models, tightens p99 tail by 22-26%

GiveThanksAlways · 2026-03-11T04:28:46Z

Sorry ignore the commits that referenced this pull request. Working on adding CUDA graphs and IO_COHERENCE to libinfer. I will create a PR once that is finished.

mwinters-saronic added 3 commits March 3, 2026 16:57

feat: init per tensor heterogeneous dynamic dims

a1d5c21

chore: modified benchmark

0205ecd

fix: removed no-op calls, cleanup

2944886

mwinters-saronic requested review from carlschader-saronic and freeman94 March 3, 2026 23:35

jkerfsx approved these changes Mar 4, 2026

View reviewed changes

carlschader-saronic reviewed Mar 4, 2026

View reviewed changes

GiveThanksAlways added a commit to GiveThanksAlways/libinfer that referenced this pull request Mar 10, 2026

docs: update benchmark report with PR saronic-technologies#31 changes…

fdcaf53

… and latest numbers

chore: updated readme

888e105

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Heterogeneously dynamic inputs#31

feat: Heterogeneously dynamic inputs#31
mwinters-saronic wants to merge 4 commits intomainfrom
mpw/heterogeneously_dynamic_inputs

mwinters-saronic commented Mar 3, 2026

Uh oh!

jkerfsx left a comment

Uh oh!

jkerfsx Mar 4, 2026

Uh oh!

jkerfsx Mar 4, 2026

Uh oh!

carlschader-saronic Mar 4, 2026

Uh oh!

carlschader-saronic Mar 4, 2026

Uh oh!

carlschader-saronic Mar 4, 2026

Uh oh!

GiveThanksAlways commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mwinters-saronic commented Mar 3, 2026

Benchmarks:

FP16 engine (3 tensors, 1 dynamic dim on [1])

FP32 engine (3 tensors, 1 dynamic dim on [1])

YOLOv8n (static, single input)

Uh oh!

jkerfsx left a comment

Choose a reason for hiding this comment

Uh oh!

jkerfsx Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jkerfsx Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlschader-saronic Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlschader-saronic Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlschader-saronic Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

GiveThanksAlways commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants