Physical hardware failure is inevitable, this is true whether it is running in your on-premises datacenter or in the Cloud like VMware Cloud on AWS. Although vSphere HA will automatically restart all affected VMs after detecting a host failure, there is usually additional activities that must be performed by a customer such as notifying all impacted application owners and even creating an incident ticket for hardware replacement.
With VMware Cloud on AWS, the hardware replacement is done automatically for you but the downstream activity of notifying application owners to verify the application is functional is still managed by the customer. There are many ways in how customers can manage such incidents and one solution that I am a huge advocate of is taking advantage of the powerful vCenter Server Events, which has over 1700+ events, not to mention any of the 2nd/3rd party events.
When an ESXi host fails, the com.vmware.vc.HA.DasHostFailedEvent event will be generated which contains all the relavent information related to the host failure including the specific hostname/IP, when the incident occurred and details about the vSphere Cluster and Datacenter is also provided. This information is visible using the vSphere UI but it can also be programmatically retrieved using the vSphere API, which is how the vSphere UI renders this information.
Note: Everything described in this blog post including the VEBA example is applicable to any environment that contains vCenter Server and is not limited to just VMware Cloud on AWS.
Here an is an example of what this looks like when using PowerCLI's Get-VIEvent:
EventTypeId : com.vmware.vc.HA.DasHostFailedEvent
Severity :
Message :
Arguments :
ObjectId : host-47
ObjectType : HostSystem
ObjectName : 10.2.32.5
Fault :
Key : 3340
ChainId : 3340
CreatedTime : 7/9/2020 8:59:42 AM
UserName :
Datacenter : VMware.Vim.DatacenterEventArgument
ComputeResource : VMware.Vim.ComputeResourceEventArgument
Host : VMware.Vim.HostEventArgument
Vm :
Ds :
Net :
Dvs :
FullFormattedMessage : vSphere HA detected a possible host failure of host 10.2.32.5 in cluster Cluster-1 in datacenter SDDC-Datacenter
ChangeTag :
With this information, customers can certainly query vCenter when a host failure occurs and then perform any additional tasks and automation. An alternative approach is to leverage an event-driven model where you can automatically trigger a notification and/or automation task when this event occurs by leveraging the VMware Event Broker Appliance (VEBA) solution.
This is actually quite timely as we just had a customer who created and contributed back a VEBA function that nicely demonstrates how they have used VEBA to automate notifications when a host failure occurs. In addition, this customer took this a step further and instead of just focusing on the host failure event, they also added logic to identify all impacted VMs and create an awesome email report that would automatically be sent to all affected parties!
Not only is this really cool but it is also nostalgic for me as this was the first task I had out of college to create a VM impact assessment report when an ESXi host failed and I REALLY wish VEBA had existed back then!
For more details on using this VEBA function, please check out this blog post by Robert Guske, fellow VMware colleague who helped support this customer in creating this function. This is a perfect example of the power of VEBA and how easy it is to get started!
Thanks for the comment!