Troubleshooting — GPU
GPU not detected in the VM
Cause: the gpus field is not configured in the manifest, or the PCI passthrough did not properly attach the GPU to the VM.
Solution:
-
Verify that the
gpusfield is present in your manifest:vm-gpu.yamlspec:
gpus:
- name: "nvidia.com/AD102GL_L40S" -
Inside the VM, check PCI detection:
lspci | grep -i nvidia -
Verify that the NVIDIA kernel module is loaded:
lsmod | grep nvidia -
If no results, the drivers are not installed. See the next section.
Missing NVIDIA drivers
Cause: NVIDIA drivers are not installed in the VM, or the kernel headers do not match the running kernel version.
Solution:
-
Install prerequisites and drivers via cloud-init or manually:
sudo apt-get update
sudo apt-get install -y linux-headers-$(uname -r)
sudo apt-get install -y nvidia-driver-550 nvidia-utils-550 -
Reboot the VM after installation:
sudo reboot -
Verify the installation:
nvidia-smi
To automate installation, use the cloudInit field in your VM manifest so that drivers are installed at first boot.
GPU pod in Pending state
Cause: no node with available GPU, the cluster GPU configuration is missing, or the GPU Operator is not active.
Solution:
-
Check pod events:
kubectl describe pod <pod-name>Look for the
Insufficient nvidia.com/gpumessage. -
Verify that GPU nodes exist and have allocatable GPUs:
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.allocatable["nvidia.com/gpu"]}' -
Check the GPU nodeGroup configuration in the cluster manifest:
cluster.yamlspec:
nodeGroups:
gpu-workers:
minReplicas: 1
maxReplicas: 4
instanceType: "u1.2xlarge"
gpus:
- name: "nvidia.com/AD102GL_L40S" -
Make sure the GPU Operator addon is enabled:
cluster.yamlspec:
addons:
gpuOperator:
enabled: true
nvidia-smi fails in a pod
Cause: the GPU Operator is not enabled on the cluster, which prevents automatic driver installation and device plugin availability.
Solution:
-
Enable the GPU Operator addon on the cluster:
cluster.yamlspec:
addons:
gpuOperator:
enabled: true -
Apply the change:
kubectl apply -f cluster.yaml -
Verify that GPU Operator pods are running:
kubectl get pods -n gpu-operator -
Once the GPU Operator is operational, recreate your GPU pod.
GPU Operator not working
Cause: the addon is not enabled, the operator pods are in error state, or the nodes do not have physical GPU hardware.
Solution:
-
Verify that the addon is enabled in the cluster manifest:
cluster.yamlspec:
addons:
gpuOperator:
enabled: true -
Check the GPU Operator pod status:
kubectl get pods -n gpu-operator
kubectl describe pod -n gpu-operator <pod-name> -
Verify that nodes have GPU hardware:
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.capacity["nvidia.com/gpu"]}' -
If pods are in
CrashLoopBackOff, check the logs:kubectl logs -n gpu-operator <pod-name>