If you have an AMD Ryzen processor and you are planning on use the NVMe Tiering feature with either VMware vSphere Foundation (VVF) or VMware Cloud Foundation (VCF) 9.0, you will need to apply the following workaround for your VMs to properly boot.
Note: This workaround is only required on AMD Ryzen (Consumer) CPU with NVMe Tiering enabled and does not affect AMD EPYC, Intel Xeon or Intel Core (Consumer) CPUs with our without NVMe Tiering.
On an AMD Ryzen CPU that has NVMe Tiering enabled, when powering on a VM, you might notice the operating system does not fully boot and the VM console may become unresponsive. After spending some time debugging with Engineering, it looks like there are some issues with specific AMD Ryzen CPU instructions that is causing the VM to behave this way when NVMe Tiering is enabled.
While a fix has already been implemented, it did not make the VVF/VCF 9.0 as it was identified late in our release cycle, furthermore AMD Ryzen processors are not officially on the Broadcom Compatibility Guide (BCG), so the fix will come in a future update of VCF 9.
Fortunately, Engineering was able to provide me with a viable workaround for users under this scenario which is to add the following VM Advanced Setting to all VMs running on an AMD Ryzen system that has NVMe Tiering enabled:
monitor_control.disable_apichv = "TRUE"
For individual VMs, applying this configuration after the initial deployment is not an issue but if you have automated workflows such as deploying VVF or VCF, you may run into further issues as you need to power off the VM to apply the workaround.
A more scalable way to roll out this change is to actually add the configuration to /etc/vmware/config as shown in the command below which will apply it globally for all VMs within an ESXi host, which is exactly what we want without needing to touch individual VMs or worry about when they are powered on/etc. This was a trick I had recalled almost a decade ago for rolling out other VM Advanced Settings, so glad we have this additional trick to make this more scalable.
echo 'monitor_control.disable_apichv ="TRUE"' >> /etc/vmware/config
Unlike applying the change directly to a VM, the global method takes effect immediately and now when you power on a VM, it will no longer be affected by this issue. This issue will be resolved in a future patch/update of VVF/VCF 9.0.
Thanks for the comment!