I am able to follow all the steps in the readme and get the expected responses, until the very last pod workload step of:
cat << EOF | kubectl --context=kind-${KIND_CLUSTER_NAME} apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: ctr
image: ubuntu:22.04
command: ["nvidia-smi", "-L"]
resources:
limits:
nvidia.com/gpu: 2
EOF
I get an error in the container (which never starts, so I cannot exec -it into into to investigate):
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.
I have the nvidia-device-plugin pods working correctly. Every step of the setup has worked so far until this very last one. Am I missing something?
More info:
I used this command to spin up the nvidia device plugin pods:
helm upgrade -i \
--kube-context=kind-${KIND_CLUSTER_NAME} \
--namespace nvidia \
--set gfd.enabled=true \
--set runtimeClassName=nvidia \
--set deviceListStrategy=volume-mounts \
--set deviceDiscoveryStrategy=nvml \
--create-namespace \
nvidia-device-plugin nvdp/nvidia-device-plugin
Output:
$ kubectl --context=kind-${KIND_CLUSTER_NAME} get pod -n nvidia
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-gpu-feature-discovery-r6t7s 1/1 Running 0 5m44s
nvidia-device-plugin-lsqvc 1/1 Running 0 5m44s
nvidia-device-plugin-node-feature-discovery-master-77b96ddqcxlw 1/1 Running 0 5m44s
nvidia-device-plugin-node-feature-discovery-worker-5n8nc 1/1 Running 1 (5m23s ago) 5m44s
$ kubectl --context=kind-${KIND_CLUSTER_NAME} get nodes -o json | jq -r '.items[] | select(.metadata.name | test("-worker[0-9]*$")) | {name: .metadata.name, "nvidia.com/gpu": .status.allocatable["nvidia.com/gpu"]}'
{
"name": "nvkind-vknmz-worker",
"nvidia.com/gpu": "1"
}
Cluster GPUs: cluster was created using ./nvkind cluster create
$ ./nvkind cluster print-gpus
[
{
"node": "nvkind-vknmz-worker",
"gpus": [
{
"Index": "0",
"Name": "NVIDIA A100 80GB PCIe",
"UUID": "GPU-a80141e3-fabe-57be-6d71-7c1b39e79553"
}
]
}
]
Error:
$ kubectl describe pod gpu-test
Name: gpu-test
Namespace: default
Priority: 0
Service Account: default
Node: nvkind-vknmz-worker
Start Time: Fri, 28 Feb 2025 14:26:13 +0000
Labels: <none>
Annotations: <none>
Status: Running
Containers:
ctr:
Container ID: containerd://1f6e8d3be9c54fdcb182f57320e925bcb1d64e408ea04b17740022cac04d87c0
Image: ubuntu:22.04
Image ID: docker.io/library/ubuntu@sha256:ed1544e454989078f5dec1bfdabd8c5cc9c48e0705d07b678ab6ae3fb61952d2
Port: <none>
Host Port: <none>
Command:
nvidia-smi
-L
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
Exit Code: 128
Started: Thu, 01 Jan 1970 00:00:00 +0000
Finished: Fri, 28 Feb 2025 14:32:02 +0000
Ready: False
Restart Count: 6
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stvml (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-stvml:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m6s default-scheduler Successfully assigned default/gpu-test to nvkind-vknmz-worker
Normal Pulling 8m5s kubelet Pulling image "ubuntu:22.04"
Normal Pulled 8m3s kubelet Successfully pulled image "ubuntu:22.04" in 1.715s (1.715s including waiting). Image size: 29545350 bytes.
Warning Failed 5m7s (x6 over 8m3s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
Warning BackOff 2m54s (x25 over 8m1s) kubelet Back-off restarting failed container ctr in pod gpu-test_default(94b2cc75-8ef2-4df6-98d9-2f4015a0dd3e)
Normal Created 2m17s (x7 over 8m3s) kubelet Created container: ctr
Normal Pulled 2m17s (x6 over 8m2s) kubelet Container image "ubuntu:22.04" already present on machine
I am able to follow all the steps in the readme and get the expected responses, until the very last pod workload step of:
I get an error in the container (which never starts, so I cannot exec -it into into to investigate):
I have the nvidia-device-plugin pods working correctly. Every step of the setup has worked so far until this very last one. Am I missing something?
More info:
I used this command to spin up the nvidia device plugin pods:
Output:
Cluster GPUs: cluster was created using
./nvkind cluster createError: