Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Using NVIDIA GPUs with Kubernetes After Dockershim Removal and Switching to Containerd

Tech 1

Kubernetes relies on the Container Runtime Interface (CRI) to communicate with container runitmes. Docker never implemented CRI natively, so Kubernetes historically included a built-in dockershim component to bridge this gap. With dockershim maintenance ending in Kubernetes 1.24+, users are migrating to CRI-compliant runtimes like containerd. While plenty of documentation covers GPU setup with Docker in Kubernetes, resources for containerd-based clusters are less common. This walkthrough focuses on that transition while ensuring Pods can access NVIDIA hardware.

All steps assume a working Kubernetes cluster with containerd already installed, and skip OS-specific prerequisites beyond minimal examples. The core goal remains the same regardless of runtime: make host GPUs visible and usible inside containers.

Step 1: Install NVIDIA Host Drivers

NVIDIA kernel drivers are required on every GPU-enabled worker node. The official .run installer works across most Linux distributions and simplifies cleanup, though it requires compilation tools and kernel headers matching the running kernel.

For Debian/Ubuntu-based systems:

# Install required build dependencies
apt update && apt install -y gcc make linux-headers-$(uname -r)
# Download driver script (adjust version based on GPU model)
wget https://us.download.nvidia.com/tesla/470.239.06/NVIDIA-Linux-x86_64-470.239.06.run
# Make executable and run in silent mode
chmod +x NVIDIA-Linux-x86_64-470.239.06.run
./NVIDIA-Linux-x86_64-470.239.06.run --silent
# Verify installation
nvidia-smi

A successful verification shows GPU details, driver version, and no errors.

Step 2: Set Up NVIDIA Container Toolkit

nvidia-container-runtime (now part of NVIDIA Container Toolkit) modifies OCI runtimes to inject GPU devices, CUDA libraries, and environment variables into containers when requested.

Configure Package Repository

# Import GPG key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Add repository for Debian/Ubuntu
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install Toolkit

apt update && apt install -y nvidia-container-toolkit

Step 3: Integrate Toolkit with containerd

containerd uses a TOML configuration file. First, generate a default config if none exists, then update runtime settings to use NVIDIA’s modified OCI runtime as the default for CRI workloads.

# Create config directory
mkdir -p /etc/containerd
# Generate default config
containerd config default | tee /etc/containerd/config.toml

Edit /etc/containerd/config.toml to modify these sections:

  1. Set default_runtime_name to nvidia in the CRI containerd plugin
  2. Add or update the nvidia runtime entry
  3. Use io.containerd.runc.v2 for runtime type
...
[plugins."io.containerd.grpc.v1.cri"]
  ...
  [plugins."io.containerd.grpc.v1.cri".containerd]
    snapshotter = "overlayfs"
    default_runtime_name = "nvidia"
    no_pivot = false
    ...
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
        runtime_type = "io.containerd.runc.v2"
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
        base_runtime_spec = ""
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
          BinaryName = "nvidia-container-runtime"
...

Restart containerd to apply changes:

systemctl restart containerd && systemctl status containerd

Step 4: Deploy NVIDIA Device Plugin

Kubernetes uses Device Plugins to advertise speecialized hardware to the scheduler. The official NVIDIA plugin registers GPUs as nvidia.com/gpu resources.

# Deploy a recent stable version
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Check plugin DaemonSet status:

kubectl get pod -n kube-system -l name=nvidia-device-plugin-ds

Successful logs from a plugin pod show NVML initialization, GRPC server startup, and registration with the local kubelet:

kubectl logs -n kube-system -l name=nvidia-device-plugin-ds

Step 5: Validate GPU Functionality

First test with containerd’s native CLI (ctr) to confirm runtime integration works outside Kubernetes:

# Pull CUDA base image
ctr image pull docker.io/nvidia/cuda:11.8.0-base-ubuntu22.04
# Run test with GPU 0
ctr run --rm -t --gpus 0 docker.io/nvidia/cuda:11.8.0-base-ubuntu22.04 gpu-test nvidia-smi

Next, test within a Kubernetes Pod using a simple CUDA vector addition workload:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-vector-demo
  namespace: default
spec:
  restartPolicy: Never
  containers:
  - name: cuda-vector-calc
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.8.0-ubuntu22.04
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["/bin/sh"]
    args: ["-c", "/usr/local/bin/vectorAdd"]

Apply the manifest and monitor status:

kubectl apply -f gpu-vector-demo.yaml
kubectl get pod gpu-vector-demo

Once the pod completes, check logs for a successful test message:

kubectl logs gpu-vector-demo

Kubernetes currently supports only whole-GPU scheduling; sharing a single GPU across multiple containers requires additional third-party tools.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.