Support for Single Root I/O Virtualization (SR-IOV) was first introduced back in 2012 with the release of vSphere 5.1 and enables for a physical PCIe device to be shared amongst a number of Virtual Machines. The networking industry was the first to take advantage of the SR-IOV technology and could be used to help reduce latencies and improve overall CPU efficiencies for vSphere-based workloads that were network intensive.
Since SR-IOV is an extension of the PCIe specification, it can also be used benefit other non-networking devices. In 2016, AMD introduced their MxGPU technology which added SR-IOV capabilities to their GPUs which was then used to power VMware Horizon workloads, but this functionality was only available during the vSphere 6.0 and 6.5 release.
GPU sharing these days are synonymous to one vendor, NVIDIA. In 2015, VMware and NVIDIA teamed up to accelerate Enterprise desktop workloads through the integration of NVIDIA's vGPU (formally GRID) technology with the release of both VMware Horizon View and vSphere 6.0.
NVIDIA continues to dominate the GPU market in 2023, however another vendor has re-entered the market with an interesting solution that is enabled by the latest vSphere 8.0 Update 2 release ...
vSphere 8.0 Update 2 adds support for graphics and AI/ML workloads on Intel ATS-M
What is really exciting about the new Intel Datacenter Flex GPUs is that these GPUs are enabled in vSphere using SR-IOV which means that these are the first Enterprise Intel GPUs that can be shared across Virtual Machines simliar to that of NVIDIA vGPU!
A new ESXi Intel Datacenter Flex SR-IOV driver (7.x and 8.x) is required to take advantage of the two new Intel Datacenter Flex GPU (Flex GPU 140 and Flex GPU 170) and this gave an idea! 💡
While the enablement of these Intel SR-IOV GPU devices are designed for their datacenter GPUs, I was curious if this new ESXi SR-IOV driver might work with their integrated consumer GPUs (iGPU) as they actually support the SR-IOV capability and are commonly found in the popular Intel NUC and other consumer Intel platforms.
Note: Interestingly, the new Intel Arc discrete GPUs do NOT actually support SR-IOV, which really surprised me given the Intel iGPU have support for SR-IOV.
Installing Intel SR-IOV ESXi Driver
For my setup, I am using an Intel NUC 13 Pro and running the latest ESXi 8.0 Update 2 release. Download either the ESXi 7.x or 8.x idcgpu and idcgputools offline bundles and upload them to your ESXi host and then run the following ESXCLI commands to install:
esxcli software component apply -d /vmfs/volumes/datastore1/Intel-idcgpu_220.127.116.116-1OEM.800.1.0.20613240_22435138.zip
esxcli software component apply -d /vmfs/volumes/datastore1/Intel-idcgputools_18.104.22.1686-1OEM.800.1.0.20613240_22439847.zip
A new ESXCLI namespace will be available under intdcgpu and when attempting to run the following command:
esxcli intdcgpu devices list
Configuring SR-IOV for Intel iGPU
You can enable SR-IOV using the ESXi Embedded Host Client by navigating to Manage->Hardware->PCI Devices and select the iGPU and then click on Configure SR-IOV along with the desired Virtual Functions (VFs).
However, I found that no matter how I configured SR-IOV for the iGPU, it keeps showing the reboot message in the UI and more importantly, SR-IOV is not actually enabled when you check lspci as shown in the screenshot above.
There might be an issue with enabling SR-IOV for these iGPUs, but luckily there is another way using the command-line. You will need to identify the iGPU PCI device segment, bus and slot which you can use lspci command to do show, also shown in the screenshot above.
To list the number of supported VFs, we can run the following command:
vsish -e get /hardware/pci/seg/0/bus/0/slot/02/func/0/maxNumVFs
For my iGPU, it supports 7, so lets now configure the maximum by running the following command:
vsish -e set /hardware/pci/seg/0/bus/0/slot/02/func/0/enableVFs 7
We can then confirm the configured VFs by running the following command:
vsish -e get /hardware/pci/seg/0/bus/0/slot/02/func/0/currentNumVfs
Lastly, we need to restart the VMkernel device manager by running the following command:
kill -SIGHUP $(pidof vmkdevmgr)
If we now run the lspci command, we should now see both the physical iGPU device along with our configured VFs as shown in the screenshot below.
While the command-line properly shows the VFs, the ESXi Embedded Host Client still does not show the correct info. The only way that I was able to fix this was by re-configuring SR-IOV and make sure to specify the same number of VFs as you did earlier and then the UI will finally show the correct info.
Putting aside the persistency issue, I was hoping that that I could now re-run the "esxcli intdcgpu devices list" command and everything would work!
In speaking with VMware Engineering, I came to learn that the Intel SR-IOV ESXi driver today are currently locked to just the Flex GPU 140 and 170 devices respectively. This means, that other SR-IOV capable Intel GPUs will not be detected with the Intel SR-IOV ESXi driver 😔
IMHO, this is a huge missed opportunity for Intel, especially given the massive interests with experimenting with AI/ML workloads for end users and developers. While I can understand that their initial focus is for the datacenter , Intel certainly could have relaxed the requirements of the driver to not excluded their consumer GPUs. In fact, AMD recently announced that they would provide an SDK to port NVIDIA CUDA applications that could then run on BOTH AMD Datacenter and Consumer GPUs, which I think is a brilliant move on their part!
Lastly, Enterprise datacenters are not the only place for running GPU-intensive workloads. At the Edge, we (VMware) continue to see the demand for more and more non-traditional Enterprise hardware being the defacto standard for many organizations and being able to leverage SR-IOV capable GPUs at the Edge would certainly broaden Intel's impact as they compete with others in this space.