NVIDIA GPU Operator¶

The NVIDIA GPU Operator automates the management of NVIDIA GPU resources in Kubernetes clusters. It simplifies the deployment and management of NVIDIA drivers, the NVIDIA Container Toolkit, and the NVIDIA device plugin, enabling seamless GPU acceleration for containerized workloads.

Overview¶

The GPU Operator streamlines the following components: - NVIDIA GPU driver installation - NVIDIA Container Toolkit deployment - Device plugin configuration - GPU feature discovery - Time-slicing capabilities for resource sharing

Deploy Via ArgoCD¶

This will deploy the GPU Operator and tolerate all taints enforced by Juno. This allows for the GPU's to be passed through to workloads managed by Juno transparently.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gpu-operator
  namespace: argocd
spec:
  project: default
  destination:
    server: https://kubernetes.default.svc
    namespace: gpu-operator
  sources:
    - repoURL: https://helm.ngc.nvidia.com/nvidia
      chart: gpu-operator
      targetRevision: v24.9.0
      helm:
        values: |-
          daemonsets:
            tolerations:
              - key: nvidia.com/gpu
                operator: Exists
                effect: NoSchedule
              - key: CriticalAddonsOnly
                operator: Exists
              - key: juno-innovations.com/workstation
                operator: Exists
                effect: NoSchedule
              - key: juno-innovations.com/headless
                operator: Exists
                effect: NoSchedule
          node-feature-discovery:
            worker:
              tolerations:
                - key: "node-role.kubernetes.io/master"
                  operator: "Equal"
                  value: ""
                  effect: "NoSchedule"
                - key: "node-role.kubernetes.io/control-plane"
                  operator: "Equal"
                  value: ""
                  effect: "NoSchedule"
                - key: nvidia.com/gpu
                  operator: Exists
                  effect: NoSchedule
                - key: CriticalAddonsOnly
                  operator: Exists
                - key: juno-innovations.com/workstation
                  operator: Exists
                  effect: NoSchedule
                - key: juno-innovations.com/headless
                  operator: Exists
                  effect: NoSchedule
        releaseName: gpu-operator
        parameters:
          # Not needed for EKS nodes since it is preinstalled
          - name: "toolkit.version"
            value: "v1.17.0-ubi8"
          # Not needed for EKS nodes since it is preinstalled
          - name: "toolkit.enabled"
            value: "true"
          # Not needed for EKS nodes since it is preinstalled
          - name: "driver.enabled"
            value: "true"
          # Time slicing configuration
          - name: "devicePlugin.config.name"
            value: "time-slicing-config"
          - name: "devicePlugin.config.default"
            value: "any"
          - name: "driver.useOpenKernelModules"
            value: "true"
          - name: "nfd.enabled"
            value: "true"
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: true
    syncOptions:
      - CreateNamespace=true

Configuring Time Slicing¶

Time slicing allows multiple containers to share a single GPU, increasing hardware utilization efficiency. This setup will configure the GPU Operator to look for the time-slicing-config configmap in the gpu-operator namespace.

By default, it will apply slicing to any NVIDIA GPU that is detected. To modify this, you can push your own configmap to the gpu-operator namespace with the following format:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

The above config will slice any GPU into 4 slices. You can learn more about this configuration from the Official NVIDIA Documentation.

Verification¶

To verify the GPU Operator deployment:

kubectl get pods -n gpu-operator

You should see several pods running for the driver, toolkit, device plugin, and other components.

To check if GPUs are properly recognized:

kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu" != null)'

Troubleshooting¶

If you encounter issues with the GPU Operator:

Check the operator logs:

kubectl logs -n gpu-operator -l app=gpu-operator

Verify the driver installation:
```
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
```
For further assistance, contact Juno Support.