Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer#1
Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer#1
Conversation
ahcorde
left a comment
There was a problem hiding this comment.
I also detected some linters failures
and this is not passing
test_cuda_image_cpu_fallback_fastrtps_launch with this error:
6: FAIL: test_cpu_fallback_paths (cuda_buffer_backend.TestCudaImageCpuFallbackFastRTPS.test_cpu_fallback_paths)
6: Test all CPU fallback paths and normal IPC simultaneously over FastRTPS.
6: ----------------------------------------------------------------------
6: Traceback (most recent call last):
6: File "/tmp/ws/src/rosidl_buffer_backends/cuda_buffer_backend/cuda_buffer_backend/test/test_cuda_image_cpu_fallback_fastrtps_launch.py", line 203, in test_cpu_fallback_paths
6: self.assertTrue(
6: AssertionError: False is not true : Cross-device fallback validation failed (expected backend="cpu")Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
Thanks for the review and feedback! I've fixed the linter errors in |
ahcorde
left a comment
There was a problem hiding this comment.
I'm getting some errors when compiling the code
- I have to set up the g++ and gcc version on torch_buffer
set(CMAKE_C_COMPILER gcc-12)
set(CMAKE_CXX_COMPILER g++-12)- I'm getting this link error:
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaGraphAddDependencies_v2@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamGetCaptureInfo_v3@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamUpdateCaptureDependencies_v2@libcudart.so.12'
Thanks for testing ! Both issues are fixed. For the undefined references: the For the g++-12 pin: that was the CUDA toolkit's host_config.h rejecting a newer host GCC. (CUDA 11.8 → max GCC 11, 12.0–12.3 → GCC 12) Since |
hidmic
left a comment
There was a problem hiding this comment.
Wow, this is a ton of code. I mostly skimmed over it. It's way past my human context window 😅 It's going to be hard to provide a proper review.
Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
hidmic
left a comment
There was a problem hiding this comment.
Second pass. I have a somewhat broad question: how resilient is the CUDA backend to crashes? IIUC the CUDA VMM (and presumably the device driver?) will take care of releasing blocks as soon as all handles are closed. Would an abnormal process exit still result in handles closing? And byeond the CUDA VMM, I see we are opening named shared memory segments and Unix sockets, and we do proper clean up, but what if processes crash?
I ask because it is unfortunately not uncommon for processes in a complex ROS stack to die or get killed. Would that litter the OS? Could that result in system resource leakage? And if it does, what are the mitigation steps?
Other than that, really nice code!
Great point, thanks for pointing it out! From a correctness perspective, if the publisher process crashes after a subscriber has imported the VMM block, the existing imported mapping should remain valid until the subscriber process releases its imported handle/mapping. From a cleanup perspective, CUDA/the OS should clean up process-owned CUDA handles and mappings when each process exits, including abnormal exits. The remaining weak points are the named shared memory segments and Unix socket path: if the publisher crashes before normal teardown, those names may be left behind. we will address the second point in a follow-up MR, together with corresponding tests |
|
@hidmic @ahcorde Thanks again for the thorough reviews and all the helpful feedback. I’ve addressed the active comments and will merge the current PRs so we can start the package release setup and verify the release procedure/permissions. Please still feel free to leave additional comments after merge; I’ll track and address any follow-up feedback in new PRs. |
Description
This pull request adds CUDA buffer backend implementations for the
rosidl::Buffer, enabling zero-copy GPU memory sharing between ROS 2 publishers and subscribers. It also adds a header-only conversion layer betweenat::Tensorandtensor_msgs/ExperimentalTensor, using the registeredrosidl::Bufferbackend for the underlying tensor storage.CUDA buffer backend: Enables zero-copy GPU data transport with fully asynchronous - data could stay on the GPU accorss ROS nodes.
allocate_msgallocates from a CUDA Virtual Memory Management (VMM) based IPC memory pool; each block carries a pre-exported POSIX FD for zero-overhead IPC reuse.from_output_buffer(...) returns aWriteHandlefor publisher/output paths, whilefrom_input_buffer(...) returns aReadHandlefor subscriber/input paths. On transmit, the plugin checks locality via a shared-memory endpoint registry: for same-host same-GPU peers, it sends the block's FD over a Unix socket and an IPC event handle for cross-process GPU sync; otherwise it falls back to CPU serialization. On receive, the block is imported and mapped (cached per source block), with a shared-memory refcount and UID validation to prevent stale reuse. A background recycler thread handles event synchronization and block reclamation off the callback thread.Torch conversions: Adds a header-only conversion layer between
tensor_msgs/ExperimentalTensorandat::Tensor.ExperimentalTensorcarries DLPack-aligned tensor metadata (dtype, shape, strides, byte_offset) and stores bytes in a normal uint8[] data field, which maps torosidl::Buffer<uint8_t>. Storage and transport are delegated to whichever buffer backend is registered for that field, such ascuda_buffer_backendor CPU fallback.allocate_tensor_msg(shape, dtype)pre-sizes the message buffer and selects the accelerated backend when available.from_output_tensor_msg(msg)gives publishers a writableat::Tensorview over msg.data;from_input_tensor_msg(msg, clone=true)gives subscribers an independent tensor by default, with clone=false available for zero-copy read-only views.to_tensor_msg(msg, tensor)copies an existingat::Tensorinto pre-allocated message storage and updates DLPack metadata.This pull request consists of the following key components:
cuda_buffer: Core CUDA-backed rosidl::Buffer storage library. Provides the VMM-backed CUDA IPC memory pool, CUDA event-based ReadHandle / WriteHandle synchronization, host endpoint locality discovery, IPC import/cache utilities, and the user-facingallocate_buffer,from_input_buffer,from_output_buffer, andto_bufferAPIs.cuda_buffer_backend: Pluginlib backend for CUDA-backed rosidl::Buffer transport. Handles endpoint discovery, CudaBufferDescriptor serialization, IPC handle transfer, imported block lifecycle, and CPU fallback when CUDA IPC is unavailable.cuda_buffer_backend_msgs: ROS 2 message definition for CudaBufferDescriptor.tensor_msgs: Internal experimental DLPack-aligned tensor message package. Currently exposes ExperimentalTensor.torch_conversions: Header-only LibTorch conversion helpers fortensor_msgs/ExperimentalTensor. Provides tensor allocation, input/output tensor views, tensor-to-message copy helpers, and DLPack bridge internals on top of rosidl::Buffer storage.libtorch_vendor: Vendor package for bringing in the LibTorch C++ distribution used bytorch_conversions.Is this user-facing behavior change?
No.
Did you use Generative AI?
Yes. Claude (claude-4.6-opus) via Cursor was used to assist with creating an initial prototype version of the changes contained in this PR.
Additional Information
This PR is part of the broader ROS 2 native buffer feature introduced in this post.