I had been troubleshooting a stubborn CPU utilization issue with a workload that, over time would also overrun the CPU on my physical ESX host. The assumption was that the workload was causing the issue, but after several rounds of collecting various ESX performance statistics, there was nothing conclusive that the workload was the culprit.
An interesting observation from some of the ESXi VMkernel Engineering team was that my VMkernel log contained a large number entropy errors:
NRandomHwrng: 246: Out of entropy, refreshing
The engineering team suspected that these entropy issues could actually be the root cause of the issues I had observed, especially as they have seen something simliar in another case when entropy requests fail.
To confirm this theory, an RDSEED (CPU instruction for random number generation) speed test was created by Engineering to compare it across some AMD Zen 3 and Zen 4 systems that team had accessed to. The result was that the Zen 4 CPU was much slower (50x) in generating entropy requests and also had higher failure rate compared to Zen 3 processors.
Note: Simliar slowness has also been observed on Zen 5 CPUs, so the issue is not limited to just Zen 4 processors.
This certainly explains the issue as I was using a Minisforum MS-A2 with AMD Ryzen 9 7945HX for my VMware Cloud Foundation (VCF) environment, which is a Zen 4 based system!
Fortunately, ESX has support for multiple entropy sources and while RDSEED is the default, if your physical processor supports it, we can fall back on other options to workaround this issue.
SSH to your ESX host or remotely call ESXCLI (via PowerCLI as example) and update entropy sources to using interrupts, which is the recommendation from Engineering:
esxcli system settings kernel set -s entropySources -v 1
You will need to reboot the ESX host for the change to go into effect.
Note: This also applies to Nested ESX VMs running on an AMD Zen 4/5 system
Not only does this resolve the high CPU utilization that I had observed, especially when you start running more VMs configured with larger number of vCPUs (12-24), but it has also brought down the overall CPU utilization of the ESX host from my testing.
Whether you are an MS-A2 owner or running other AMD-based Ryzen systems, this is something you will definitely want to configure to make sure it is operating as efficiently as possible!
Were you experiencing any host crashes with this issue? I experience some random crashes on my Gmktec K8 and find these log entries too but no matter what I try - i cannot resolve.
It didn’t get that bad, host CPU was 100% and starved hostd, but depending on load, I wouldn’t be surprised. If fix doesn’t help, it’s possible it’s HW issues
Yeah unfortunately the entropy log entries are probably a red herring for me as I disabled entropy but I still receive intermittent crashes. Tried different RAM, and NVMe but still issues! No PSOD either, just straight reboot.