Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts? What information do you look for when you prioritize?
Some alerts, like vsan disk balance don't work well for all vsan deployments, for some clusters it's always going to be on. Alerts like "haven't tested HCL against online test" aren't a priority, more informational. I have to read the alert and look at what's really affected.
Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart from acknowledging and snoozing, what are the other actions you would like to take?
I acknowledge alerts all the time based on high usage, because they don't self clear. Also some things you fix don't self clear, so I'm always acknowledging alerts to see if they are currently valid. I would expect that if I snooze alert, that it takes another recurrence to bring it back up, not just re-raise the alarm because it was snoozed. I would like option buttons for "disable this alert for this host" or for this cluster if you are viewing at the cluster object level. The next screens showing the line graph are an awesome next step to see occurrence rates. maybe a link to "open this in Log Insight" or some other tool.
Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold?
Sort by VMs, look at heat map, see what's causing it. Then look at that object to see what the frequency is of causing it. If it's everything, then I assume some sort of scan or hotfix reboot/deployment and I assume high usage is expected. If they are new VMs, and they don't have a long history line, then I would look if there's another cluster for that VM. At worst I raise a ticket to the app owner to discuss utilization.
Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this view help you to address this scenario?
yes yes and yes, i like the view with the line graph, and then the bullets of when it occurred and for how long, i like that display. In the top part where it has CPU capacity for the vCenter wide view, does nothing for me. With many clusters, there could be one with 10% free and one with 90% free, having a cluster wide %age doesn't help me at all.
Q5: What would you do next from here?
narrow down to the cluster view, see if it's all hosts or one host. go to vm view, same, is it many or just one. Then decide on relocation or just wait it out.