Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer by yuanknv · Pull Request #1 · ros2/rosidl_buffer_backends

yuanknv · 2026-04-07T07:30:21Z

Description

This pull request adds CUDA buffer backend implementations for the rosidl::Buffer, enabling zero-copy GPU memory sharing between ROS 2 publishers and subscribers. It also adds a header-only conversion layer between at::Tensor and tensor_msgs/ExperimentalTensor, using the registered rosidl::Buffer backend for the underlying tensor storage.

CUDA buffer backend: Enables zero-copy GPU data transport with fully asynchronous - data could stay on the GPU accorss ROS nodes. allocate_msg allocates from a CUDA Virtual Memory Management (VMM) based IPC memory pool; each block carries a pre-exported POSIX FD for zero-overhead IPC reuse. from_output_buffer(...) returns a WriteHandle for publisher/output paths, while from_input_buffer(...) returns a ReadHandle for subscriber/input paths. On transmit, the plugin checks locality via a shared-memory endpoint registry: for same-host same-GPU peers, it sends the block's FD over a Unix socket and an IPC event handle for cross-process GPU sync; otherwise it falls back to CPU serialization. On receive, the block is imported and mapped (cached per source block), with a shared-memory refcount and UID validation to prevent stale reuse. A background recycler thread handles event synchronization and block reclamation off the callback thread.

Torch conversions: Adds a header-only conversion layer between tensor_msgs/ExperimentalTensor and at::Tensor. ExperimentalTensor carries DLPack-aligned tensor metadata (dtype, shape, strides, byte_offset) and stores bytes in a normal uint8[] data field, which maps to rosidl::Buffer<uint8_t>. Storage and transport are delegated to whichever buffer backend is registered for that field, such as cuda_buffer_backend or CPU fallback. allocate_tensor_msg(shape, dtype) pre-sizes the message buffer and selects the accelerated backend when available. from_output_tensor_msg(msg) gives publishers a writable at::Tensor view over msg.data; from_input_tensor_msg(msg, clone=true) gives subscribers an independent tensor by default, with clone=false available for zero-copy read-only views. to_tensor_msg(msg, tensor) copies an existing at::Tensor into pre-allocated message storage and updates DLPack metadata.

This pull request consists of the following key components:

cuda_buffer: Core CUDA-backed rosidl::Buffer storage library. Provides the VMM-backed CUDA IPC memory pool, CUDA event-based ReadHandle / WriteHandle synchronization, host endpoint locality discovery, IPC import/cache utilities, and the user-facing allocate_buffer, from_input_buffer, from_output_buffer, and to_buffer APIs.
cuda_buffer_backend: Pluginlib backend for CUDA-backed rosidl::Buffer transport. Handles endpoint discovery, CudaBufferDescriptor serialization, IPC handle transfer, imported block lifecycle, and CPU fallback when CUDA IPC is unavailable.
cuda_buffer_backend_msgs: ROS 2 message definition for CudaBufferDescriptor.
tensor_msgs: Internal experimental DLPack-aligned tensor message package. Currently exposes ExperimentalTensor.
torch_conversions: Header-only LibTorch conversion helpers for tensor_msgs/ExperimentalTensor. Provides tensor allocation, input/output tensor views, tensor-to-message copy helpers, and DLPack bridge internals on top of rosidl::Buffer storage.
libtorch_vendor: Vendor package for bringing in the LibTorch C++ distribution used by torch_conversions.

Is this user-facing behavior change?

No.

Did you use Generative AI?

Yes. Claude (claude-4.6-opus) via Cursor was used to assist with creating an initial prototype version of the changes contained in this PR.

Additional Information

This PR is part of the broader ROS 2 native buffer feature introduced in this post.

ahcorde

I also detected some linters failures

and this is not passing

test_cuda_image_cpu_fallback_fastrtps_launch with this error:

6: FAIL: test_cpu_fallback_paths (cuda_buffer_backend.TestCudaImageCpuFallbackFastRTPS.test_cpu_fallback_paths)
6: Test all CPU fallback paths and normal IPC simultaneously over FastRTPS.
6: ----------------------------------------------------------------------
6: Traceback (most recent call last):
6:   File "/tmp/ws/src/rosidl_buffer_backends/cuda_buffer_backend/cuda_buffer_backend/test/test_cuda_image_cpu_fallback_fastrtps_launch.py", line 203, in test_cpu_fallback_paths
6:     self.assertTrue(
6: AssertionError: False is not true : Cross-device fallback validation failed (expected backend="cpu")

Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

yuanknv · 2026-04-10T23:27:41Z

I also detected some linters failures

and this is not passing

test_cuda_image_cpu_fallback_fastrtps_launch with this error:

6: FAIL: test_cpu_fallback_paths (cuda_buffer_backend.TestCudaImageCpuFallbackFastRTPS.test_cpu_fallback_paths)
6: Test all CPU fallback paths and normal IPC simultaneously over FastRTPS.
6: ----------------------------------------------------------------------
6: Traceback (most recent call last):
6:   File "/tmp/ws/src/rosidl_buffer_backends/cuda_buffer_backend/cuda_buffer_backend/test/test_cuda_image_cpu_fallback_fastrtps_launch.py", line 203, in test_cpu_fallback_paths
6:     self.assertTrue(
6: AssertionError: False is not true : Cross-device fallback validation failed (expected backend="cpu")

Thanks for the review and feedback! I've fixed the linter errors in torch_buffer (which were missed from the local run). However, I can't reproduce the test_cuda_image_cpu_fallback_fastrtps_launch failure on my end. Would you mind sharing your test setup?

ahcorde

I'm getting some errors when compiling the code

I have to set up the g++ and gcc version on torch_buffer

set(CMAKE_C_COMPILER gcc-12)
set(CMAKE_CXX_COMPILER g++-12)

I'm getting this link error:

/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaGraphAddDependencies_v2@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamGetCaptureInfo_v3@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamUpdateCaptureDependencies_v2@libcudart.so.12'

yuanknv · 2026-04-17T21:58:45Z

I'm getting some errors when compiling the code

I have to set up the g++ and gcc version on torch_buffer

set(CMAKE_C_COMPILER gcc-12)
set(CMAKE_CXX_COMPILER g++-12)

I'm getting this link error:

/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaGraphAddDependencies_v2@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamGetCaptureInfo_v3@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamUpdateCaptureDependencies_v2@libcudart.so.12'

Thanks for testing ! Both issues are fixed.

For the undefined references: the libtorch_vendor always downloaded 2.11.0+cu126 regardless of host CUDA, so older toolkits got mismatched symbols — it now auto-detects the CUDA toolkit version and picks the matching LibTorch variant and the latest version published for it (overridable via -DLIBTORCH_CUDA_VERSION= / -DLIBTORCH_VERSION=).

For the g++-12 pin: that was the CUDA toolkit's host_config.h rejecting a newer host GCC. (CUDA 11.8 → max GCC 11, 12.0–12.3 → GCC 12) Since torch_buffer compiles no CUDA device code, we now define __NV_NO_HOST_COMPILER_CHECK=1 to bypass it, so no compiler pin is needed anymore

hidmic

Wow, this is a ton of code. I mostly skimmed over it. It's way past my human context window 😅 It's going to be hard to provide a proper review.

Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

hidmic

Second pass. I have a somewhat broad question: how resilient is the CUDA backend to crashes? IIUC the CUDA VMM (and presumably the device driver?) will take care of releasing blocks as soon as all handles are closed. Would an abnormal process exit still result in handles closing? And byeond the CUDA VMM, I see we are opening named shared memory segments and Unix sockets, and we do proper clean up, but what if processes crash?

I ask because it is unfortunately not uncommon for processes in a complex ROS stack to die or get killed. Would that litter the OS? Could that result in system resource leakage? And if it does, what are the mitigation steps?

Other than that, really nice code!

yuanknv · 2026-05-07T17:44:40Z

Second pass. I have a somewhat broad question: how resilient is the CUDA backend to crashes? IIUC the CUDA VMM (and presumably the device driver?) will take care of releasing blocks as soon as all handles are closed. Would an abnormal process exit still result in handles closing? And byeond the CUDA VMM, I see we are opening named shared memory segments and Unix sockets, and we do proper clean up, but what if processes crash?

I ask because it is unfortunately not uncommon for processes in a complex ROS stack to die or get killed. Would that litter the OS? Could that result in system resource leakage? And if it does, what are the mitigation steps?

Other than that, really nice code!

Great point, thanks for pointing it out!

From a correctness perspective, if the publisher process crashes after a subscriber has imported the VMM block, the existing imported mapping should remain valid until the subscriber process releases its imported handle/mapping.

From a cleanup perspective, CUDA/the OS should clean up process-owned CUDA handles and mappings when each process exits, including abnormal exits. The remaining weak points are the named shared memory segments and Unix socket path: if the publisher crashes before normal teardown, those names may be left behind.

we will address the second point in a follow-up MR, together with corresponding tests

yuanknv · 2026-05-07T18:48:42Z

@hidmic @ahcorde Thanks again for the thorough reviews and all the helpful feedback. I’ve addressed the active comments and will merge the current PRs so we can start the package release setup and verify the release procedure/permissions. Please still feel free to leave additional comments after merge; I’ll track and address any follow-up feedback in new PRs.

yuanknv added 3 commits April 7, 2026 00:14

initial implementation of cuda_buffer_backend and torch_buffer_backend

0b2e189

clean up

ab2d00f

bug fix

d8b838f

ahcorde requested changes Apr 8, 2026

View reviewed changes

Comment thread cuda_buffer_backend/cuda_buffer/CMakeLists.txt

Comment thread cuda_buffer_backend/cuda_buffer/package.xml Outdated

Comment thread torch_buffer_backend/torch_buffer/CMakeLists.txt Outdated

ahcorde requested changes Apr 8, 2026

View reviewed changes

yuanknv requested review from MiguelCompany, ahcorde, hidmic, karsten-nvidia, mjcarroll, nvcyc and tfoote April 10, 2026 16:30

yuanknv and others added 4 commits April 10, 2026 11:02

Apply suggestions from code review

06d403e

Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

Apply suggestions from code review

7715637

Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

Update torch_buffer_backend/torch_buffer/include/torch_buffer/torch_b…

06c267b

…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

address comments

98e286a

yuanknv added 7 commits April 10, 2026 18:02

fix lints

e0d3a7e

add libtorch_vendor package, update to_buffer function

cb2e100

update API and readme

3b4dc6b

move the libtorch_vendor to the root folder

9def199

update from_buffer function and libtorch version

f8cd8c9

update readme

6f2593a

add torch zero-copy from_buffer param

701188b

ahcorde requested changes Apr 17, 2026

View reviewed changes

libtorch_vendor auto detect CUDA version

34f035f

yuanknv added 2 commits April 17, 2026 15:38

add contributing file

f3808ae

remove the typesupport fastrtps dep

ad7aa24

add rodil_typesupport_cpp to torch backend

775cd1a

hidmic reviewed Apr 22, 2026

View reviewed changes

ahcorde requested changes Apr 22, 2026

View reviewed changes

yuanknv and others added 5 commits April 24, 2026 12:06

address reviewer comments

192f72f

Apply suggestions from code review

3abd8e3

Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

Update torch_buffer_backend/torch_buffer/include/torch_buffer/torch_b…

ac2c790

…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>

address more comments

caf292b

replace torch buffer backend with torch tensor apis

e38a493

fujitatomoya mentioned this pull request Apr 28, 2026

Add documents for rosidl::Buffer features ros2/ros2_documentation#6440

Merged

yuanknv added 7 commits April 28, 2026 17:28

rename user facing apis

d14c918

update readme

ffe075d

rename tensor msg

1aef8ca

rename cuda from_buffer apis

7f2e861

update readmes

e74682b

update readme

97f0605

update package descriptions

c28b6e8

yuanknv requested a review from hidmic April 29, 2026 22:02

yuanknv added 6 commits April 29, 2026 17:19

add lifecycle description

b43c8f2

update diagram

90d06e2

address comments

040b731

add gitignore

06aa64e

add comments

bfa9a8f

unify singleton patterns

0d6c3cb

yuanknv requested a review from ahcorde May 4, 2026 22:53

hidmic reviewed May 5, 2026

View reviewed changes

address comments

dfe3e35

fix lints

9aff099

yuanknv merged commit ba5edfe into ros2:main May 7, 2026

Conversation

yuanknv commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Is this user-facing behavior change?

Did you use Generative AI?

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahcorde left a comment

Choose a reason for hiding this comment

Uh oh!

yuanknv commented Apr 10, 2026

Uh oh!

ahcorde left a comment

Choose a reason for hiding this comment

Uh oh!

yuanknv commented Apr 17, 2026

Uh oh!

hidmic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hidmic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanknv commented May 7, 2026

Uh oh!

yuanknv commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuanknv commented Apr 7, 2026 •

edited

Loading