Version: 3.0.0-alpha (Diátaxis)

Troubleshooting — GPU

GPU not detected in the VM

Cause: the gpus field is not configured in the manifest, or the PCI passthrough did not properly attach the GPU to the VM.

Solution:

Verify that the gpus field is present in your manifest:
vm-gpu.yaml
```
spec:
  gpus:
    - name: "nvidia.com/AD102GL_L40S"
```
Inside the VM, check PCI detection:
```
lspci | grep -i nvidia
```
Verify that the NVIDIA kernel module is loaded:
```
lsmod | grep nvidia
```
If no results, the drivers are not installed. See the next section.

Missing NVIDIA drivers

Cause: NVIDIA drivers are not installed in the VM, or the kernel headers do not match the running kernel version.

Solution:

Install prerequisites and drivers via cloud-init or manually:

sudo apt-get update
sudo apt-get install -y linux-headers-$(uname -r)
sudo apt-get install -y nvidia-driver-550 nvidia-utils-550

Reboot the VM after installation:
```
sudo reboot
```
Verify the installation:
```
nvidia-smi
```

tip

To automate installation, use the cloudInit field in your VM manifest so that drivers are installed at first boot.

GPU pod in Pending state

Cause: no node with available GPU, the cluster GPU configuration is missing, or the GPU Operator is not active.

Solution:

Check pod events:
```
kubectl describe pod <pod-name>
```
Look for the Insufficient nvidia.com/gpu message.

Verify that GPU nodes exist and have allocatable GPUs:

kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.allocatable["nvidia.com/gpu"]}'

Check the GPU nodeGroup configuration in the cluster manifest:

cluster.yaml
spec:
  nodeGroups:
    gpu-workers:
      minReplicas: 1
      maxReplicas: 4
      instanceType: "u1.2xlarge"
      gpus:
        - name: "nvidia.com/AD102GL_L40S"

Make sure the GPU Operator addon is enabled:

cluster.yaml
spec:
  addons:
    gpuOperator:
      enabled: true

`nvidia-smi` fails in a pod

Cause: the GPU Operator is not enabled on the cluster, which prevents automatic driver installation and device plugin availability.

Solution:

Enable the GPU Operator addon on the cluster:

cluster.yaml
spec:
  addons:
    gpuOperator:
      enabled: true

Apply the change:
```
kubectl apply -f cluster.yaml
```
Verify that GPU Operator pods are running:
```
kubectl get pods -n gpu-operator
```
Once the GPU Operator is operational, recreate your GPU pod.

GPU Operator not working

Cause: the addon is not enabled, the operator pods are in error state, or the nodes do not have physical GPU hardware.

Solution:

Verify that the addon is enabled in the cluster manifest:
cluster.yaml
```
spec:
  addons:
    gpuOperator:
      enabled: true
```

Check the GPU Operator pod status:

kubectl get pods -n gpu-operator
kubectl describe pod -n gpu-operator <pod-name>

Verify that nodes have GPU hardware:

kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.capacity["nvidia.com/gpu"]}'

If pods are in CrashLoopBackOff, check the logs:
```
kubectl logs -n gpu-operator <pod-name>
```

GPU not detected in the VM​

Missing NVIDIA drivers​

GPU pod in Pending state​

nvidia-smi fails in a pod​

GPU Operator not working​

GPU not detected in the VM

Missing NVIDIA drivers

GPU pod in Pending state

`nvidia-smi` fails in a pod

GPU Operator not working