We had a few users who had reported issues with using the VMware Event Broker Appliance (VEBA) solution where VEBA was not receiving any events from the connected vCenter Server. This was really puzzling for the team to debug because the user clearly saw events in both the vSphere UI as well as using vSphere Automation Clients like PowerCLI.
After a bit of debugging with a few of our users (huge thanks to Michael Gasch for driving this), we discovered that in certain environments, the generated sequence number that is used for the vCenter Event ID has overflowed and causes the value to have a negative number. To further complicate the debugging, there are actually two ways of fetching vCenter Server Events using the vSphere API. The first is to just look at the LatestPage property, which will return the most recent events and not care about event ID and the second is to use CreateCollectorForEvents() which is more of an event stream and it does care about the event ID being non-negative. You can probably guess which vSphere API VEBA was using, not only because of our check-pointing feature but LatestPage could lose events from a client request point of view for chatty environments.
With the actual root cause identified, we were happy that this was not an issue with VEBA but it did highlight a potential vSphere issues for certain environments where events may appear to be missing. We have also reported the issue to VMware Engineering and improvements are being worked on, the chances of running into this scenario is believed to be low. While most customers may not run into this problem, it can certainly is not easy to diagnosis, especially if you are not using VEBA. We recently had another user who ran into this exact problem and we were able to quickly point them to VMware Support for remediation.
I started to think, how could a user quickly identify whether they are having this problem, especially if they are not using VEBA? I decided to reproduce the issue locally and here is how you can check if your vCenter Server is affected.
Open a browser to https://[VC_FQDN]/mob/?moid=EventManager&doPath=latestEvent and login with an administrative account. Look for either chainId or key and if the value is negative, then you are affected by this issue. You can also refresh the URL which will pull in other events and they should also have negative values for those properties. If you are affected by the issue, please open a VMware Support request and reference PR #2906239.
_M_P says
Thank you, mr. Lam, for another very interesting post!
BTW, you'd fix
https://[VC_FQDN]mob/?moid=EventManager&doPath=latestEvent
to
https://[VC_FQDN]/mob/?moid=EventManager&doPath=latestEvent
William Lam says
Thanks for catch! Its been updated