Skip to content

[Feature-request] Add --no-persistenced to 98-nvidia.sh hook to avoid permission errors in containerized environments #265

@sharlynxy

Description

@sharlynxy

Environment

  • Slurm: Deployed on Kubernetes via Slinky Project
  • Slurm Image: ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04 (includes Enroot + Pyxis)
  • nvidia-container-toolkit: v1.19.0 (bundled in the Slinky image)

Problem

I encountered the same nvidia-persistenced/socket: operation not permitted error as described in SlinkyProject/slurm-operator#99 (NVIDIA Hook Failure). While investigating a fix, I found that the issue can be resolved at the enroot hook level.

In 98-nvidia.sh, the current cli_args initialization is:

cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")

When nvidia-container-cli configure runs inside a Kubernetes pod, it attempts to bind-mount /var/run/nvidia-persistenced/socket into the container rootfs. In containerized (non-baremetal) environments, this mount operation is denied due to filesystem/permission restrictions, causing GPU workloads submitted via srun --container-image=... to fail.

A comment on slurm-operator#99 suggested that this was fixed in nvidia-container-toolkit#1593 (v1.18.2+). However, I have confirmed that the nvidia-container-toolkit version in my Slinky image is already v1.19.0, and the error still persists:

root@gpu-gpu-1:/tmp# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.19.0
commit: ec7b4e2fa2caecad6d89be4a26029b831fe7503a

This indicates that upgrading nvidia-container-toolkit alone does not fully resolve the issue in containerized Kubernetes environments. The 98-nvidia.sh hook still unconditionally allows nvidia-container-cli to attempt mounting the persistenced socket, which fails due to permission restrictions in the pod.

Fix

Adding --no-persistenced to the default cli_args array in 98-nvidia.sh resolves the issue:

cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")

Before — original 98-nvidia.sh, error on socket mount:

root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")

root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi
nvidia-container-cli: mount error: mount operation failed: /run/enroot/overlay/run/nvidia-persistenced/socket: operation not permitted
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

After — patched 98-nvidia.sh with --no-persistenced, GPUs accessible:

root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")

root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.9.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Fri Mar 27 04:49:59 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

......

Rationale

  • In containerized environments (e.g., Slinky NodeSet pods), nvidia-persistenced is not running and its socket does not exist — attempting to mount it will always fail.
  • Even on bare-metal where nvidia-persistenced is running, its driver persistence effect operates at the host kernel level. Enroot containers access GPUs through the driver directly and do not need the persistenced socket mounted to function correctly.
  • This is consistent with --no-cgroups already being included in cli_args — skipping functionality that is managed externally by the container runtime or host environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions