I am in the process of creating 'server availability' dashboards for our business but need scalability since creating a new "button" widget per server is very inefficient. The list widget is nice in this respect as it allows me to create large server list or use USM group to fill in the host.
The issue is that the list widget (or list view) does not allow me to specify the alarm filter of each host. So essentially I need to rely on some sort of QOS metric to reflect a server availability. What QOS could help me identify server availability?
e.g; QOS_NET_CONNECT_PACKETLOSS? It tells me there inconstancy but not "server down" confirmation.
net_connect Metrics - CA Unified Infrastructure Management Probes - CA Technologies Documentation
This may help you in this regard.
System Uptime can be monitored through either cdm, net_connect, ntperf, or snmpget.
Computer uptime, as reported by the cdm probe will be shown as cumulative, in minutes, since the robot was started. If you are looking for availability you might try using the net_connect probe to ping a device and then base your SLA on the maximum response time you consider the node to be 'available' versus unavailable.
Uptime is the number of seconds the computer has been running since reboot. It is not that useful for an SLA because it is a counter that continues to rise until a reboot occurs.
net_connect (most common approach)
You can use net_connect to ping the node every x minutes. Then use SLA/SLO engine to set a good value for each sample - like zero or greater - which shows that the computer was responding. If the computer does not respond, the net_connect probe will send a NULL which will be counted as a 'fail' by the SLA engine. The SLA engine then calculates the number of good samples vs the number of failures and provides a percentage of the SLA period in which the computer was 'up.'
\System\System Up Time
In Windows, System Up Time is the elapsed time (in seconds) that the computer has been running since it was last started. This counter displays the difference between the start time and the current time.
You can do a web search for the uptime OID and use the MIB browser and browse for it and add it to your snmpget configuration.
MIB variable: sysUpTime
Create a new QoS definition for uptime and use the collected value as desired.
In UMP, I would probably recommend a dashboard view or maybe even a List View depending on your needs.
In USM the uptime stats from cdm (QOS_COMPUTER_UPTIME) when collected, would be displayed under the given node->Metrics Tab->System->Host
Your input is much appreciated. Certainly things to consider.
Right now there are unfortunate limits when reflecting your suggestions in the dashboards/list views.
E.g. One host may have many "targets" and there is no granular way of selecting one specific target in a group of host servers.