As some of you may have heard, there is currently a known issue with NFS based datastores (includes VSA NFS datastores) after upgrading to vSphere 5.5 Update 1. The issue causes NFS datastores to disconnect and go into an APD (All Paths Down) state. VMware is currently aware of the problem and you can follow KB 2076392 for the latest updates.
While going through my Twitter stream this morning, I noticed an interesting question from fellow Blogger and friend Jase McCarty who asked the following:
I was quite surprised to hear that there were no vCenter Alarms being triggered for this issue. I decided to take a look at the KB to better understand the symptoms and see if there was anything I could do to help. From what I can tell, the only way to identify this particular problem is by looking at the logs which the KB has an example of what you would see.
Once I took a look at the logs, I knew there was at least two methods in which one could get alerts. One option would be to leverage vCenter Log Insight and create a query based on the particular string but no every customer is using Log Insight and it does require a bit of setup. The second more obvious option for me would be to key off of the VMkernel VOBs that are being generated which I have written about in the past for detecting duplicate IP Addresses for ESXi and VSAN component threshold count.
Here are the steps to create vCenter Alarm:
Step 1 - Create a new vCenter Alarm and give it a name. Select "Hosts" for Monitor and "Specific event occurring ..." for Monitor for
Step 2 - For the Trigger, you will add the following VOB entries (just copy/paste them in)
- esx.problem.storage.apd.start
- esx.problem.vmfs.nfs.server.disconnect
- esx.problem.storage.apd.timeout
Note: The alarm will activate if ANY of the VOBs are seen since it is an OR statement. It would have been nice to be able to group these together to generate the alarm
Once the alarm has been created, you will at least have a way to get notified if you are potentially affected by this problem. I would still highly recommend you subscribe to KB 2076392 for all the latest updates.
vroomblog says
Thanks for the Alam, there is a way to have the same for FC storage ?
William Lam says
Take a look at this article for other vSphere VOBs including generic storage ones that you can use http://www.virtuallyghetto.com/2014/04/other-handy-vsphere-vobs-for-creating-vcenter-alarms.html
Admin says
Is there a way the alarm triggers are reported in the FAT client v/s web Client?
I have the screenshots, not sure if I can attach to the comment.
Steve H says
Do these VOB alerts work on vmware 5.0?
William Lam says
They should, but you can always confirm by checking whether these VOBs have been defined in 5.0
Jim Millard says
The alarms are nice, but I've noticed two things about them: 1) they never go from red to green after being tripped, and 2) there's no information about the datastore that tripped the alarm.
Yes, the instructions above indicate that there are limitations in the way the alarm trigger works (the "or vs and" factor), but it's sort of weird to see these alarms tripped after upgrading to 5.5U2 _and_ removing NFS stores from the cluster...
William Lam says
Jim,
1) I forget off hand if you could create an alarm that will send an alert but not stay red. For most cases, admins would want to see it and then ACK, else you never know when an alarm was fired off unless you were watching it.
2) You're right, this is an area we could improve in. I would guess that if you were using the API, you could pull more information about the object that tripped the alarm, I thought this was possible within the Events view when an alarm tripped but haven't tested it myself.
stacycarter says
William,
Is there any way to have the vCenter alert include the NFS datastore name in the body of the email, rather than just the UUID?
William Lam says
Yes, you'll need to identify which alarm environmental variable that contains that info. Some more details https://pubs.vmware.com/vsphere-60/index.jsp?topic=%2Fcom.vmware.vsphere.monitoring.doc%2FGUID-AB74502C-5F01-478D-AF66-672AB5B8065C.html and https://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.bsa.doc_40/vc_admin_guide/working_with_alarms/r_alarm_environment_variables.html
What I normally do is just print out all environmental variables as part of the trigger, identify which variable I need as part of a given alarm. You may also want to check out this recent Reddit thread which could be helpful https://www.reddit.com/r/vmware/comments/4b6lpq/change_the_summary_line_of_email_sent_by_vcenter/
stacycarter says
Hey William - We tried printing out all environmental variables as part of the trigger, however 4 of the variables were blank for this alarm, including the one I expected to contain the datastore name.
VMWARE_ALARM_EVENT_VM
VMWARE_ALARM_EVENT_NETWORK
VMWARE_ALARM_EVENT_DATASTORE
VMWARE_ALARM_EVENT_DVS
The rest of the environmental variables we printed out did have info, but did not contain the datastore name š
William Lam says
Not all VMWARE_ALARM* variables will always be populated, will depend on the event triggered. In this particular case, I suspect the "datastore" which the alarm triggered off of is stored in another variable ...
Would you mind sharing the other VMWARE_ALARM* properties that was returned?
stacycarter says
Sure. Here is what we got (redacted):
VMWARE_ALARM_NAME = [name we gave alarm]
VMWARE_ALARM_ID = [alarm id]
VMWARE_ALARM_TARGET_NAME = [host fqdn]
VMWARE_ALARM_TARGET_ID = [host id]
VMWARE_ALARM_OLDSTATUS = Gray
VMWARE_ALARM_NEWSTATUS = Red
VMWARE_ALARM_TRIGGERINGSUMMARY = Event: All paths are down
Summary: Device or filesystem with identifier [***********] has entered the All Paths Down state.
Date: [date alarm triggered]
Host: [host fqdn]
Resource pool: [cluster name]
Data center: [datacenter name]
Arguments:
eventTypeId = esx.problem.storage.apd.start
objectId = [host id]
objectName = [host fqdn]
1 = [datastore identifier]
VMWARE_ALARM_DECLARINGSUMMARY = ([Event alarm expression: All paths are down; Status = Red] OR [Event alarm expression: All Paths Down timed out, I/Os will be fast failed; Status = Red] OR [Event alarm expression: Lost connection to NFS server; Status = Red])
VMWARE_ALARM_ALARMVALUE = Event details
VMWARE_ALARM_EVENTDESCRIPTION = Device or filesystem with identifier [***********] has entered the All Paths Down state.
VMWARE_ALARM_EVENT_USERNAME =
VMWARE_ALARM_EVENT_DATACENTER = [datacenter name]
VMWARE_ALARM_EVENT_COMPUTERESOURCE = [cluster name]
VMWARE_ALARM_EVENT_HOST = [host fqdn]
VMWARE_ALARM_EVENT_VM =
VMWARE_ALARM_EVENT_NETWORK =
VMWARE_ALARM_EVENT_DATASTORE =
VMWARE_ALARM_EVENT_DVS =
stacycarter says
Hi William - Just checking in to see if you were able to figure out which variable the datastore name is stored in? Did the additional info below help at all? Thanks!
William Lam says
It looks like you may have to construct the Datastore Name from "1 = [datastore identifier]" as its not included as part of the alarm.