There have been quite a few questions lately around vCenter Server Alarms for VSAN, one in particular that I have noticed is around individual disk failure for VSAN. Outside of the generic default datastore alarms, there seems to be only two VSAN specific alarms:
I figure there must be other useful alarms that we could create, especially after showing how you can create a vCenter Server Alarm to monitor the VSAN component count threshold based on a particular VSAN VOB. I took a look around and found the following VSAN specific VOBs which could be useful for creating additional vCenter Alarms.
|VSAN clustering services have been enabled.
|VSAN device has come online.
|VSAN clustering services have now been enabled.
|VSAN now has at least one active network configuration.
|A previously reported vmknic now has a valid IP.
|VSAN Node: Near node component count limit.
|VSAN device is under permanent error.
|Failed to create a new disk group.
|Failed to add disk to disk group.
|VSAN device has gone offline.
|VSAN clustering services have been disabled.
|VSAN device Memory/SSD congestion has changed.
|A vmknic added to VSAN network configuration doesn't have valid IP. Network is not ready.
|VSAN doesn't haven any redundancy in its network configuration.
|VSAN is operating on reduced network redundancy.
|VSAN doesn't have any networking configuration for use.
|A vmknic added to VSAN network configuration doesn't have valid IP. It will not be in use.
Looking at the list above, the following two VOBs seems like they would be useful for alerting on a disk failure is:
Disclaimer: There are no guarantees that a disk error or failure will automatically trigger these VOBs due to the unknown nature of how a disk may be fail, especially if it is intermittently.
Even though we can not simulate a disk error on a physical disk, we can still do some magic using a Nested VSAN environment. The worse case scenario that you could run into is that one of the disk just goes completely offline. We can simulate a similar behavior in a Nested ESXi environment by removing one of the virtual disks from the Virtual Machine (not deleting it).
To demonstrate the following scenario, here are the steps to create a vCenter Alarm for the following two VOBs:
Step 1 - Create a new vCenter Alarm and give it a name. Select “Hosts” for Monitor and “Specific event occurring …” for Monitor:
Step 4 - There are two ways in which you can trigger the alarm. You can either create a new Virtual Machine which will try to write to the Nested ESXi VM in which you remove the Virtual Disk or you can rescan the storage adapter for the Nested ESXi VM. In my environment, I happen to have a VM running on an NFS datastore and I performed a Storage vMotion of the VM onto my VSAN Datastore using the default FTT=1 policy on a three node VSAN Cluster. This immediately triggered the alarm as seen in the screenshots below: