-
Notifications
You must be signed in to change notification settings - Fork 124
[Feature-request] Add --no-persistenced to 98-nvidia.sh hook to avoid permission errors in containerized environments #265
Description
Environment
- Slurm: Deployed on Kubernetes via Slinky Project
- Slurm Image:
ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04(includes Enroot + Pyxis) - nvidia-container-toolkit: v1.19.0 (bundled in the Slinky image)
Problem
I encountered the same nvidia-persistenced/socket: operation not permitted error as described in SlinkyProject/slurm-operator#99 (NVIDIA Hook Failure). While investigating a fix, I found that the issue can be resolved at the enroot hook level.
In 98-nvidia.sh, the current cli_args initialization is:
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")When nvidia-container-cli configure runs inside a Kubernetes pod, it attempts to bind-mount /var/run/nvidia-persistenced/socket into the container rootfs. In containerized (non-baremetal) environments, this mount operation is denied due to filesystem/permission restrictions, causing GPU workloads submitted via srun --container-image=... to fail.
A comment on slurm-operator#99 suggested that this was fixed in nvidia-container-toolkit#1593 (v1.18.2+). However, I have confirmed that the nvidia-container-toolkit version in my Slinky image is already v1.19.0, and the error still persists:
root@gpu-gpu-1:/tmp# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.19.0
commit: ec7b4e2fa2caecad6d89be4a26029b831fe7503aThis indicates that upgrading nvidia-container-toolkit alone does not fully resolve the issue in containerized Kubernetes environments. The 98-nvidia.sh hook still unconditionally allows nvidia-container-cli to attempt mounting the persistenced socket, which fails due to permission restrictions in the pod.
Fix
Adding --no-persistenced to the default cli_args array in 98-nvidia.sh resolves the issue:
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")Before — original 98-nvidia.sh, error on socket mount:
root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi
nvidia-container-cli: mount error: mount operation failed: /run/enroot/overlay/run/nvidia-persistenced/socket: operation not permitted
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1After — patched 98-nvidia.sh with --no-persistenced, GPUs accessible:
root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi
==========
== CUDA ==
==========
CUDA Version 12.9.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Fri Mar 27 04:49:59 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
......Rationale
- In containerized environments (e.g., Slinky NodeSet pods),
nvidia-persistencedis not running and its socket does not exist — attempting to mount it will always fail. - Even on bare-metal where
nvidia-persistencedis running, its driver persistence effect operates at the host kernel level. Enroot containers access GPUs through the driver directly and do not need the persistenced socket mounted to function correctly. - This is consistent with
--no-cgroupsalready being included incli_args— skipping functionality that is managed externally by the container runtime or host environment.