Instead of running a script on node startup, it may be better to use a docker image for quicker startup and robust versioning.
Here is a basic draft. I can create a PR of this if it makes sense.
FROM kindest/node:v1.31.4
RUN apt-get update && \
apt-get install -y gpg && \
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && \
apt-get update && \
apt-get install -y nvidia-container-toolkit && \
nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && \
nvidia-ctk runtime configure --runtime=containerd --set-as-default --cdi.enabled
COPY entrypoint /entrypoint
ENTRYPOINT [ "/entrypoint", "/sbin/init" ]
entrypoint
#!/usr/bin/env bash
# Unmount the masked /proc/driver/nvidia to allow
# dynamically generated MIG devices to be discovered
umount -R /proc/driver/nvidia
# Make it so that calls into nvidia-smi / libnvidia-ml.so do not
# attempt to recreate nvidia device nodes or reset their permissions if
# tampered with
cp /proc/driver/nvidia/params root/gpu-params
sed -i 's/^ModifyDeviceFiles: 1$/ModifyDeviceFiles: 0/' root/gpu-params
mount --bind root/gpu-params /proc/driver/nvidia/params
exec /usr/local/bin/entrypoint "$@"
Instead of running a script on node startup, it may be better to use a docker image for quicker startup and robust versioning.
Here is a basic draft. I can create a PR of this if it makes sense.
entrypoint