MS-A2 VCF 9.0 Lab: Deploying Model Endpoint with DirectPath I/O using VMware for Private AI Services (PAIS)

In this final blog post, we will now deploy several AI model endpoints (downloaded from Hugging Face), configure our private data source which can be a shared location (Google Drive, Confluence, Microsoft Sharepoint, or S3-compatible endpoint) or using local files and then consuming them using an AI Agent that is built using VMware for Private AI Services (PAIS).

As mentioned in the very first blog post of this mini-series, my goal was to get hands experience with PAIS but without the need to have an NVIDIA GPU capable of vGPU, which would also require an NVIDIA AI for Enterprise (NVAIE) license.

Luckily, we can use an NVIDIA GPU via DirectPath I/O, thanks to the backend plumbing the PAIS Engineering team have built and had shared with me 😊

For my proof of concept, I am using an ASUS NUC 14 Performance, which has an NVIDIA GeForce RTX 4070 mobile GPU (8GB VRAM). The ASUS NUC 14 is running alongside my Minisforum MS-A2 setup, is only used to deploy the completions model endpoint. The use of the ASUS NUC 14 is purely for prototyping and experimentation purposes to demonstrate that anyone can play with PAIS within their lab environment. I plan to use a more powerful NVIDIA GPU setup, which I will share more details at a later point for those interested.

References:

Requirements:

Step 1 - Enable DirectPath I/O on the ESXi host that contains the physical NVIDIA GPU that will be used to deploy the PAIS Completions Model Endpoint.

Step 2 - Create a new VM Class in the VCFA Namesapce that includes the desired vCPU, Memory (including reservation) and the DirectPath I/O device that was configured from Step 1. In my example, I have created a VM Class called pais-gpu with 6 vCPU, 32GB of reserved memory (reservation is required for DirectPath I/O) and the NVIDIA RTX 4070 from my Intel NUC 14 Performance

Step 3 - We can deploy a PAIS Model Endpoint using either the PAIS Service UI or by using the VCFA Kubernetes context. Since the Kubernetes flow is slightly involved due to the need to retrieve the auto-generated secret for TLS trust as the UI simply provides a drop down to select from, I will be going through the Kubernetes experience for those looking for a non-UI method but the overall workflow is the same.

We need to retrieve the auto-generated secret that contains the Harbor TLS trust for our model store, before it can deploy the model. Login to your VCFA Namespace and then run the following command which will look for the secret matching "reg-creds" and make a note of the label which we will need in the next step.

# Create new context to login to VCFA Namespace
vcf context create legal --endpoint auto01.vcf.lab --api-token $VCF_CLI_VCFA_API_TOKEN --insecure-skip-tls-verify --type cci
vcf context use legal:oversight-kny28:oversight

# Refresh token to login to VCFA Namespacec
vcf context refresh legal:oversight-kny28:oversight

kubectl get paisconfiguration default -o json | jq -r '.status.children[] | select(.kind=="Secret" and (.name | contains("reg-creds"))) | .name'

Step 4 - Update the completitions-gpu-modelendpoint.yaml example to match your environment:

ociRef - The model that you would like to use from your model store
pullSecrets.name - The TLS secret trust from the Step 3
routingName - Routing name for accessing your model via API (all lower case)
virtualMachineClassName - Name of your VM Class from Step 2
storageClassName - Name of the Storage Class you wish to use
namespace - The VCFA namespace

In addition to the completions model endpoint, we will also need an embeddings model endpoint, which you can refer to the embedddings-cpu-modelendpoint.yaml example, which does not require a GPU and can just use CPU. You will need to update the same fields as the completion model.

Once you have saved your changes for both files, you can deploy them using the following command:

kubectl apply -f embedddings-cpu-modelendpoint.yaml
kubectl apply -f completitions-gpu-modelendpoint.yaml

The deployment can take several minutes as a new VKS Worker Node will be deployed for each model endpoint and for the completion model, it will also install the desired NVDIA GPU guest driver as part of the workflow.

You can monitor the high level progress using the following command:

kubectl get modelendpoint

Step 5 - For more details on what is actually happening inside of our VKS Node that will be attaching the DirectPath I/O device, we need to look at the GPU operator which is running within that VKS Cluster. To do so, we need to retrieve the kubeconfig credentials.

You can easily do this by identifying the PAIS instance ID, which an example is highlighted using the vSphere UI under the specific VCFA Namespace. In my example, it is pais-f0f2bc6f-cacd-4047-84aa-ac310f35c3a0

The kubeconfig secret is simply this ID appended with -kubeconfig, so we can then run the following command and store the kubeconfig into a file that we will use to login to VKS Cluster:

kubectl get secret pais-f0f2bc6f-cacd-4047-84aa-ac310f35c3a0-kubeconfig -o jsonpath='{.data.value}' | base64 -d > pais-vks-cluster

Step 6 - We can now connect to the VKS Cluster by passing in the kubeconfig that we had retrieved from the previous command and we are looking at the gpu-operator namespace to ensure all pods are up and running.

kubectl --kubeconfig pais-vks-cluster -n gpu-operator get pods

While troubleshooting my setup, I found looking at the nvidia-driver-daemonset pod logs to be useful which you can do by running the following command (replace pod ID):

kubectl --kubeconfig pais-vks-cluster -n gpu-operator logs nvidia-driver-daemonset-zrn2j

Here is a snippet of the logs where the gpu-operator will automatically install the required NVIDIA GPU Guest Drivers, whether it is coming from NVIDIA GPU Cloud (NGC) or the free open source drivers as we had defined in our deployment override.

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 6 CPUs online; setting concurrency level to 6.
Unable to locate any tools for listing initramfs contents.
Unable to scan initramfs: no tool found
Installing NVIDIA driver version 570.148.08.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.15.0-140-generic/build'

Kernel output path: '/lib/modules/5.15.0-140-generic/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules:

  [##############################] 100%
Kernel module compilation complete.
Kernel messages:
[  108.107841] Timeout policy base is empty
[  108.133432] device genev_sys_6081 entered promiscuous mode
[  108.141182] device antrea-gw0 entered promiscuous mode
[  109.896119] device gpu-oper-ea735a entered promiscuous mode
[  109.896180] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  109.896202] IPv6: ADDRCONF(NETDEV_CHANGE): gpu-oper-ea735a: link becomes ready
[  118.138944] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  118.138980] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-d-facff4: link becomes ready
[  118.139790] device nvidia-d-facff4 entered promiscuous mode
[  118.166903] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-c-f68f9e: link becomes ready
[  118.168925] device nvidia-c-f68f9e entered promiscuous mode
[  163.166304] device nvidia-c-f68f9e left promiscuous mode
[  165.014514] device nvidia-c-d2336c entered promiscuous mode
[  165.014707] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  165.014727] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-c-d2336c: link becomes ready
[  409.007568] nvidia: loading out-of-tree module taints kernel.
[  409.010631] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  409.014728] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

[  409.015912] nvidia 0000:02:01.0: enabling device (0000 -> 0003)
[  409.016855] nvidia 0000:02:01.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[  409.060420] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  570.148.08  Release Build  (dvs-builder@U22-I3-AE18-23-3)  Wed May 21 07:03:28 UTC 2025
[  409.086333] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  570.148.08  Release Build  (dvs-builder@U22-I3-AE18-23-3)  Wed May 21 06:55:37 UTC 2025
[  409.090796] nvidia-modeset: Unloading
[  409.180918] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (570.148.08):: Installing

  [                              ]   0%
Unable to determine whether NVIDIA kernel modules are present in the initramfs. Existing NVIDIA kernel modules in the initramfs, if any, may interfere with the newly installed driver.

  [##############################] 100%
Driver file installation is complete.
Running post-install sanity check:: Checking

  [##############################] 100%
Post-install sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 570.148.08) is now complete.

Parsing kernel module parameters...
Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

Once both model endpoints are available, we are ready to proceed to the next step in creating our AI Agent!

Step 7 - You have a couple of options of bringing in your private data, either data sources from a shared location or upload a local file. For simplicity purposes, we will use the latter option by creating a new knowledge base and selecting our embedding model that we had deployed from Step 4.

Step 8 - We can now create our first AI Agent by linking to our knowledge base and then select the competitions model that we had deployed from Step 4.

Step 9 - Lastly, we can now try out the AI Agent that we had just built with a couple of clicks, backed by your own private data and using an AI model that you have approved within your organization!

This is pretty darn cool if you ask me!? PAIS provides the "easy" button for enterprise organizations to quickly build and experiment with AI Agents using their own private data in a seamless way, definitely look forward to hearing what our users do with PAIS both in a lab settings and production deployments.

Thanks for the comment!Cancel reply