Skip to content

skip invoking toolkit.Install when NRI Plugin is enabled#1754

Open
tariq1890 wants to merge 1 commit intomainfrom
nri-skip-toolkit-install
Open

skip invoking toolkit.Install when NRI Plugin is enabled#1754
tariq1890 wants to merge 1 commit intomainfrom
nri-skip-toolkit-install

Conversation

@tariq1890
Copy link
Copy Markdown
Contributor

@tariq1890 tariq1890 commented Apr 7, 2026

This PR fixes the "nvidia-operator-validator pod stuck indefinitely in Terminating" issue observed in clusters that use cri-o + crun

Problem

The toolkit by default sources a list of the absolute paths of low-level runtime binaries from the containerd/crio runtime config TOML file. This list is then prepended to the runtimes= ["runc", "crun"] list in the toolkit's config TOML file. Through this method, the toolkit binds the nvidia-container-runtime wrapper script to the correct low-level runtime binary. With the switch to the NRI Plugin mode, the toolkit no longer reads from the containerd/crio runtime config TOML file, so it does not have a way to inject the correct low-level runtime binary to the nvidia-container-runtime wrapper script. As a result, pods that were created with the nvidia-container-runtime bound to the low-level-runtime binary (sourced from the container runtime config TOML) before NRI Plugin enablement cannot be deleted afterwards as the wrapper-script just binds to runc instead (taken from the default list runtimes = ["runc", "crun"]. This discrepancy in nvidia-container-runtime before and after NRI Plugin enablement causes the KillContainer functions performed by the container runtime to fail leaving the pod stuck in Terminating permanently

Solution

Currently, the toolkit always clears the files in the toolkit root /usr/local/nvidia/toolkit on every run and installs the toolkit artifacts. With NRI Plugin, nvidia-container-runtime is no longer needed, so retain the toolkit installed files created from the toolkit deployment prior to the NRI Plugin enablement. This unblocks the termination of the pods created before NRI Plugin enablement, as the nvidia-container-runtime still resolves to the desired low-level-runtime binary as initially configured.

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 24059267912

Coverage increased (+0.003%) to 43.403%

Details

  • Coverage increased (+0.003%) from the base build.
  • Patch coverage: 2 uncovered changes across 1 file (3 of 5 lines covered, 60.0%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
cmd/nvidia-ctk-installer/main.go 5 3 60.0%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 14826
Covered Lines: 6435
Line Coverage: 43.4%
Coverage Strength: 0.48 hits per line

💛 - Coveralls

@tariq1890 tariq1890 requested a review from cdesiniotis April 7, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants