NVIDIA GPU Operator¶
The NVIDIA GPU Operator automates the management of NVIDIA GPU resources in Kubernetes clusters. It simplifies the deployment and management of NVIDIA drivers, the NVIDIA Container Toolkit, and the NVIDIA device plugin, enabling seamless GPU acceleration for containerized workloads.
Overview¶
The GPU Operator streamlines the following components: - NVIDIA GPU driver installation - NVIDIA Container Toolkit deployment - Device plugin configuration - GPU feature discovery - Time-slicing capabilities for resource sharing
Deploy Via ArgoCD¶
This will deploy the GPU Operator and tolerate all taints enforced by Juno. This allows for the GPU's to be passed through to workloads managed by Juno transparently.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gpu-operator
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: gpu-operator
sources:
- repoURL: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
targetRevision: v24.9.0
helm:
values: |-
daemonsets:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: CriticalAddonsOnly
operator: Exists
- key: juno-innovations.com/workstation
operator: Exists
effect: NoSchedule
- key: juno-innovations.com/headless
operator: Exists
effect: NoSchedule
node-feature-discovery:
worker:
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: CriticalAddonsOnly
operator: Exists
- key: juno-innovations.com/workstation
operator: Exists
effect: NoSchedule
- key: juno-innovations.com/headless
operator: Exists
effect: NoSchedule
releaseName: gpu-operator
parameters:
# Not needed for EKS nodes since it is preinstalled
- name: "toolkit.version"
value: "v1.17.0-ubi8"
# Not needed for EKS nodes since it is preinstalled
- name: "toolkit.enabled"
value: "true"
# Not needed for EKS nodes since it is preinstalled
- name: "driver.enabled"
value: "true"
# Time slicing configuration
- name: "devicePlugin.config.name"
value: "time-slicing-config"
- name: "devicePlugin.config.default"
value: "any"
- name: "driver.useOpenKernelModules"
value: "true"
- name: "nfd.enabled"
value: "true"
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: true
syncOptions:
- CreateNamespace=true
Configuring Time Slicing¶
Time slicing allows multiple containers to share a single GPU, increasing hardware utilization efficiency. This setup will configure the GPU Operator to look for the time-slicing-config
configmap in the gpu-operator
namespace.
By default, it will apply slicing to any
NVIDIA GPU that is detected. To modify this, you can push your own configmap to the gpu-operator
namespace with the following format:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
The above config will slice any GPU into 4 slices. You can learn more about this configuration from the Official NVIDIA Documentation.
Verification¶
To verify the GPU Operator deployment:
You should see several pods running for the driver, toolkit, device plugin, and other components.
To check if GPUs are properly recognized:
Troubleshooting¶
If you encounter issues with the GPU Operator:
-
Check the operator logs:
-
Verify the driver installation:
For further assistance, contact Juno Support.