NVIDIA GPU with Dynamic DirectPath IO (Passthrough) to Tanzu Kubernetes Grid (TKG) Cluster using vSphere with Tanzu

When provisioning a Tanzu Kubernetes Grid Cluster (TKC) using vSphere with Tanzu, you can easily request an NVIDIA GPU resource as part of the deployment, which can either be provided by NVIDIA vGPU or using PCIe passthrough with Dynamic DirectPath IO.

vGPU is great for those with a capable NVIDIA GPU, especially if the GPU will not be utilized 100% and you can share its resources amongst several VMs. However, if you do not have a capable GPU that supports vGPU, you can still provide you TKC workloads with a GPU resource using passthrough.

While playing with the Lenovo P3 Ultra, I unfortunately came to learn that NVIDIA RTX A5500 Laptop was NOT the same as an NVIDIA RTX A5500 🙁

Not ideal, but I guess NVIDIA did not want to add this additional device to their test matrix and hence their ESXi graphics drivers would not detect the GPU as vGPU capable. I knew that I could still use the NVIDIA GPU via passthrough but to my surprise, I just needed to get the NVIDIA drivers installed onto the TKC worker nodes.

That was much easier said than done as all the documentation that I could find on both VMware and NVIDIA website had detailed instructions for vGPU configuration but there was little to no documentation on how to use NVIDIA GPU in passthrough mode with vSphere with Tanzu. I came across a number of different NVIDIA solutions when it comes to k8s, but it was not very clear on which would be interoperable with vSphere with Tanzu and I eventually figured it out with the help pointing me in the right direction.

It was actually super easy, once you knew the exact steps! 😅

Pre-Req:

A TKC that has already been provisioned with an NVIDIA GPU using Dynamic Path IO

The NVIDIA GPU Operator is the easiest way to get the NVIDIA driver deployed for any k8s-based deployment where you will consume an NVIDIA GPU including a TKC using vSphere with Tanzu. I initially tried the NVIDIA device plugin for Kubernetes but that would require changes to the Ubuntu TKr images, which I was really hoping, was not needed. Below are the three easy steps to get the required NVIDIA drivers running on the TKC worker node!

Step 1 - You will need Helm installed on your local system and after authorizing into your TKC cluster, run the following command to deploy the NVIDA GPU Operator. In the example below, I am deploying to a k8s namespace called gpu-demo, which I had pre-created earlier.

helm install --wait --generate-name --set operator.defaultRuntime=containerd --namespace gpu-demo nvidia/gpu-operator

Step 2 - Ensure all that all pods from the NVIDIA GPU Operator is up and running by running the following command:

kubectl -n gpu-demo get pods

Step 3 - Finally, we can confirm that NVIDIA GPU resource is consumable within the TKC worker node by deploying simple demo app that performs some CUDA operations.

Create a YAML file (gpu-demo.yaml) that contains the following:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
  namespace: gpu-demo
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vectoradd
      image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
      resources:
        limits:
          nvidia.com/gpu: 1

Then deploy the demo application by running the following:

kubectl -n gpu-demo apply -f gpu-demo.yaml

If everything was setup correctly, we should see the following in the gpu-demo container logs by running the following command:

kubectl -n gpu-demo logs cuda-vectoradd

More from my site

Thanks for the comment!Cancel reply