Skip to content

feat: Heterogeneously dynamic inputs#31

Open
mwinters-saronic wants to merge 4 commits intomainfrom
mpw/heterogeneously_dynamic_inputs
Open

feat: Heterogeneously dynamic inputs#31
mwinters-saronic wants to merge 4 commits intomainfrom
mpw/heterogeneously_dynamic_inputs

Conversation

@mwinters-saronic
Copy link
Contributor

Adds per-input heterogeneous dynamic shape support; each input tensor now has its own min/opt/max shape profile from TensorRT, replacing the single global batch size. Input buffers allocated per their own max shape; output buffers sized via TensorRT shape propagation after setting all inputs to max. infer() resolves each input's dynamic dimension independently using precomputed metadata (one integer division per dynamic input, zero heap allocations).

Also removed get_output_len() and mOutputLengths, output sizes now queried dynamically from mContext->getTensorShape() after input shapes are set

Questions:

  • remove get_batch_dims()?

Benchmarks:

FP16 engine (3 tensors, 1 dynamic dim on [1])

dynD Avg p50 p99 Throughput
1 5.57ms 5.57ms 5.65ms 179.6 infer/s
16 11.47ms 11.48ms 11.76ms 87.2 infer/s
64 38.96ms 39.07ms 39.67ms 25.7 infer/s

FP32 engine (3 tensors, 1 dynamic dim on [1])

dynD Avg p50 p99 Throughput
1 20.00ms 20.00ms 20.18ms 50.0 infer/s
16 37.37ms 37.23ms 39.86ms 26.8 infer/s
64 127.2ms ~7.9 infer/s

YOLOv8n (static, single input)

Metric Value
Avg 1.750ms
p50 1.763ms
p99 1.781ms
Min 1.642ms
Max 1.787ms
Throughput 571.5 infer/s

Copy link

@jkerfsx jkerfsx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

std::process::exit(1);
},
/// Build input tensors using per-input shape profiles at the given phase (min/opt/max).
fn build_inputs(engine: &UniquePtr<Engine>, phase: &str) -> Vec<InputTensor> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd maybe call phase as profile_name or shape_mode?

/// Represents the batch dimensions supported by a TensorRT engine.
///
/// Deprecated: Use `get_input_shape_profiles()` for per-input profiles.
#[derive(Debug, Clone)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: #[deprecated]

let input_infos = engine.get_input_dims();

profiles.iter().zip(input_infos.iter()).map(|(profile, info)| {
let shape = match phase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this match case over an ENUM instead of over strings

name: input_info.name.clone(),
data: input_data,
dtype: input_info.dtype.clone(),
let dtype_size: usize = match info.dtype {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably know this since I wrote it, but do we have an FP16 type?

info!(" iterations : {}", num_runs);
info!(" avg : {:.3}ms", avg_ms);
info!(" p50 : {:.3}ms", p50_ms);
info!(" p99 : {:.3}ms", p99_ms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

GiveThanksAlways added a commit to GiveThanksAlways/libinfer that referenced this pull request Mar 10, 2026
…puts + safety/perf improvements

Rewrites engine.h, engine.cpp, and lib.rs to merge upstream PR saronic-technologies#31's
per-input heterogeneous dynamic shape support with our CUDA graph,
managed memory (IO_COHERENCE), and Direct I/O optimizations.

Key changes:
- Per-input shape profiles (InputShapeProfile) from TRT optimization
  profiles, replacing the global batch-dim model. Each input's dynamic
  dimension is resolved independently via precomputed staticByteCount
  (one integer division, zero heap allocations).
- Output buffer sizes computed via TRT shape propagation (set inputs to
  max shapes, query output shapes) instead of manual mOutputLengths.
- Pre-computed mInputIndices/mOutputIndices vectors for O(1) hot-path
  lookups (was O(n) linear scan per tensor).
- Shape caching via mLastInputShapes: skips redundant setInputShape()
  calls when shapes are unchanged between frames.
- Safety: getDataTypeSize() default case throws instead of UB, bounds
  checking on write_input_buffer/read_output_buffer, std::call_once
  for thread-safe logger initialization, documented Send safety on
  ffi::Engine.
- Removed unsafe raw pointer APIs (get_input_buffer_ptr,
  get_output_buffer_ptr) in favor of safe write/read buffer API.
- Backward-compatible get_batch_dims() via get_input_shape_profiles().

Benchmark (Jetson AGX Orin 64GB, YOLOv8n FP16 × 3, 2048 iters):
  Stock:         29308 µs  34.2 Hz
  CUDA Graph:    26536 µs  37.3 Hz  (1.10×)
  Managed:       28640 µs  34.8 Hz
  Managed+Graph: 25930 µs  38.7 Hz  (1.13×)
  Direct I/O:    25098 µs  40.1 Hz  (1.17×, 4210 µs saved/frame)
  Tail latency:  p99 30186→25399 µs (15.9% tighter)
GiveThanksAlways added a commit to GiveThanksAlways/libinfer that referenced this pull request Mar 10, 2026
GiveThanksAlways added a commit to GiveThanksAlways/libinfer that referenced this pull request Mar 10, 2026
- bench_pr31_comparison.rs: 32768-iter benchmark matching PR saronic-technologies#31's methodology
  Tests stock infer(), CUDA Graph, Managed (zerocopy), Direct I/O modes
  Uses get_input_shape_profiles() + build_inputs() identical to PR saronic-technologies#31

- PR31_COMPARISON.md: full results on 4 engines (YOLOv8n 640/320, TCN FP16/FP32)
  Key finding: CUDA Graph beats PR saronic-technologies#31's stock YOLOv8n by 5.6% (1.664 vs 1.763ms p50)
  CUDA Graph delivers 1.86x on small models, 1.12x on medium, 1.03x on large
  Direct I/O: 1.66x on small models, tightens p99 tail by 22-26%
@GiveThanksAlways
Copy link

Sorry ignore the commits that referenced this pull request. Working on adding CUDA graphs and IO_COHERENCE to libinfer. I will create a PR once that is finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants