Rocky Linux 9.5 Setup Guide¶

rocky_logo

Overview¶

Rocky Linux is an excellent choice for base nodes in your infrastructure due to its stability, long-term support, and enterprise compatibility.

Base Node Configuration¶

System Update¶

Begin with a system update to ensure all packages are current:

sudo dnf update -y

Docker Installation¶

While Docker is not strictly required for Kubernetes, installing it provides containerd and other necessary packages for Kubernetes. The installation process also configures container runtimes automatically.

# Add Docker repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo

# Install Docker packages
sudo dnf -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Enable and start Docker service
sudo systemctl --now enable docker

Recommended Packages¶

These additional utilities enhance functionality for networking, storage, and system management:

sudo dnf install nfs-utils iscsi-initiator-utils curl rsync -y

GPU Configuration¶

Follow these steps to prepare your system for GPU-enabled workstations and rendering nodes.

Update System for GPU Support¶

Ensure your system has the latest updates before proceeding with GPU setup:

sudo dnf update -y

Verify GPU Hardware Detection¶

Confirm that your system recognizes the GPU hardware:

ls -la /dev/ | grep dri

Video Device Mounted

If this passes, you should see the following output:

drwxr-xr-x  3 root root          80 Jan 30 14:03 dri

No Output

If you do not see this output, the kernel is not recognizing the GPU. Normally this is due to the kernel not having the correct firmware or the kernel needs to be updated. Please refer to the Rocky Linux Kernel Documentation for more information.

NVIDIA Driver Installation¶

To install the NVIDIA drivers, run the following commands. You can also reference the Rocky Nvidia Wiki directly.

1. Disable Nouveau Driver¶

First, disable the open-source Nouveau driver that may interfere with NVIDIA drivers:

sudo grubby --args="nouveau.modeset=0 rd.driver.blacklist=nouveau" --update-kernel=ALL

2. Install Required Dependencies¶

# Add EPEL repository
sudo dnf install epel-release -y

# Install development tools package group
sudo dnf groupinstall "Development Tools" -y

# Install kernel development packages
sudo dnf install kernel-devel -y
sudo dnf install dkms -y

3. Add NVIDIA Repository and Install Additional Dependencies¶

sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo
sudo dnf install kernel-headers kernel-devel tar bzip2 make automake gcc gcc-c++ pciutils elfutils-libelf-devel libglvnd-opengl libglvnd-glx libglvnd-devel acpid pkgconf dkms -y

4. Install NVIDIA Driver¶

Check available driver versions:

dnf module list nvidia-driver

Choose one of these installation options:

Option 2: Install the latest driver (recommended)

sudo dnf module install nvidia-driver:latest-dkms -y

Option 1: Install a specific driver version

sudo dnf module install nvidia-driver:<version>-dkms -y

5. Complete Driver Installation¶

After installing the driver, register the kernel modules and reboot:

sudo dkms autoinstall
sudo reboot

6. Verify Driver Installation¶

After the system restarts, verify that the drivers are installed correctly:

nvidia-smi

Drivers Installed

If this passes, you should see the following output but for your GPU model:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080        On  | 00000000:00:10.0 Off |                  N/A |
|  0%   33C    P8              13W / 215W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Confirm that the GPU devices are properly mounted:

ls -la /dev/dri/

Video Device Mounted Properly With NVIDIA Drivers

If this passes, you should see something similar to the following output with your cards and a render device:

total 0
drwxr-xr-x  3 root root        120 Jan 30 18:12 .
drwxr-xr-x 18 root root       3.4K Jan 30 18:13 ..
drwxr-xr-x  2 root root        100 Jan 30 18:12 by-path
crw-rw----  1 root video  226,   0 Jan 30 18:12 card0
crw-rw----  1 root video  226,   1 Jan 30 18:12 card1
crw-rw----  1 root render 226, 128 Jan 30 18:12 renderD128

NVIDIA Container Toolkit Installation¶

Now we need to configure the NVIDIA Container Toolkit. This will allow us to run GPU enabled containers which is then provided as a runtime in kubernetes via the CRI.

You can also reference the Rocky Nvidia Wiki directly.

1. Add NVIDIA Container Toolkit Repository¶

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

2. Install NVIDIA Container Toolkit¶

sudo dnf update
sudo dnf install -y nvidia-container-toolkit

3. Configure Runtime Environments¶

For containerd (required for Kubernetes):

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

If Docker is installed:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

4. Verify Container Toolkit Installation (Optional)¶

To confirm the NVIDIA Container Toolkit is working correctly:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Testing the NVIDIA Container Toolkit

If you would like to test the NVIDIA Container Toolkit, you can run the following commands. This is not needed for the installation but is a good way to verify that the toolkit is installed correctly.

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Working Toolkit

If you see the output of nvidia-smi then the toolkit is installed correctly.

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
INFO[0000] Loading config from /etc/docker/daemon.json
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that docker daemon be restarted.
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
de44b265507a: Pull complete
Digest: sha256:80dd3c3b9c6cecb9f1667e9290b3bc61b78c2678c02cbdae5f0fea92cc6734ab
Status: Downloaded newer image for ubuntu:latest
Thu Jan 30 23:44:20 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080        On  | 00000000:00:10.0 Off |                  N/A |
|  0%   30C    P8              12W / 215W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Firewall Configuration¶

No Route to Host

When deploying certain kubernetes distributions, you may see the following error in the container logs:

Error: No route to host

In many cases, this is due to the firewall blocking the connection. You can disable the firewall to check if that is the issue:

sudo systemctl stop firewalld
sudo systemctl disable firewalld

It's recommended that firewalld is either disabled or properly configured with appropriate whitelisting for your Kubernetes and container networking. However, detailed firewall configuration is beyond the scope of this documentation as each network environment is different.

SELinux Configuration¶

If disabling the firewall doesn't resolve connectivity issues, you may also need to disable SELinux:

# Disable SELinux
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

This changes SELinux from enforcing mode to permissive mode, which logs policy violations but doesn't enforce them. The second command makes this change permanent across reboots. Note that this may have security implications depending on your environment.

Next Steps¶

After completing this setup, Proceed to Orion cluster deployment by following our On-Prem Installation Guide.

For further assistance, contact Juno Support.