[Feature-request] Add `--no-persistenced` to `98-nvidia.sh` hook to avoid permission errors in containerized environments

## Environment

- **Slurm**: Deployed on Kubernetes via [Slinky Project](https://github.com/SlinkyProject/slurm-operator)
- **Slurm Image**: `ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04` (includes Enroot + Pyxis)
- **nvidia-container-toolkit**: v1.19.0 (bundled in the Slinky image)

## Problem

I encountered the same `nvidia-persistenced/socket: operation not permitted` error as described in [SlinkyProject/slurm-operator#99 (NVIDIA Hook Failure)](https://github.com/SlinkyProject/slurm-operator/issues/99). While investigating a fix, I found that the issue can be resolved at the enroot hook level.

In `98-nvidia.sh`, the current `cli_args` initialization is:

```bash
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
```

When `nvidia-container-cli configure` runs inside a Kubernetes pod, it attempts to bind-mount `/var/run/nvidia-persistenced/socket` into the container rootfs. In containerized (non-baremetal) environments, this mount operation is denied due to filesystem/permission restrictions, causing GPU workloads submitted via `srun --container-image=...` to fail.

A [comment on slurm-operator#99](https://github.com/SlinkyProject/slurm-operator/issues/99#issuecomment-3807712911) suggested that this was fixed in [nvidia-container-toolkit#1593](https://github.com/NVIDIA/nvidia-container-toolkit/pull/1593) (v1.18.2+). However, I have confirmed that the `nvidia-container-toolkit` version in my Slinky image is already **v1.19.0**, and the error still persists:

```bash
root@gpu-gpu-1:/tmp# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.19.0
commit: ec7b4e2fa2caecad6d89be4a26029b831fe7503a
```

This indicates that upgrading `nvidia-container-toolkit` alone does not fully resolve the issue in containerized Kubernetes environments. The `98-nvidia.sh` hook still unconditionally allows `nvidia-container-cli` to attempt mounting the persistenced socket, which fails due to permission restrictions in the pod.

## Fix

Adding `--no-persistenced` to the default `cli_args` array in `98-nvidia.sh` resolves the issue:

```bash
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
```

**Before** — original `98-nvidia.sh`, error on socket mount:

```bash
root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")

root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi
nvidia-container-cli: mount error: mount operation failed: /run/enroot/overlay/run/nvidia-persistenced/socket: operation not permitted
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
```

**After** — patched `98-nvidia.sh` with `--no-persistenced`, GPUs accessible:

```bash
root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")

root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.9.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Fri Mar 27 04:49:59 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

......
```

### Rationale

- In containerized environments (e.g., Slinky NodeSet pods), `nvidia-persistenced` is not running and its socket does not exist — attempting to mount it will always fail.
- Even on bare-metal where `nvidia-persistenced` is running, its driver persistence effect operates at the host kernel level. Enroot containers access GPUs through the driver directly and do not need the persistenced socket mounted to function correctly.
- This is consistent with `--no-cgroups` already being included in `cli_args` — skipping functionality that is managed externally by the container runtime or host environment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature-request] Add `--no-persistenced` to `98-nvidia.sh` hook to avoid permission errors in containerized environments #265

Environment

Problem

Fix

Rationale

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature-request] Add --no-persistenced to 98-nvidia.sh hook to avoid permission errors in containerized environments #265

Description

Environment

Problem

Fix

Rationale

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature-request] Add `--no-persistenced` to `98-nvidia.sh` hook to avoid permission errors in containerized environments #265