We also attempted to determine if an agent is disconnected. In the case of a Java agent, if the agent is reporting then the connection status is reporting to one of our collectors.
Using the various summary alert setting we found that if we set the following on a summary alert:
metric grouping: wild card for the specific agent connection status so it will pick up no matter which collector it is reporting to.
agent expression
(.*)\|Custom Metric Process \(Virtual\)\|Custom Metric Agent \(Virtual\) \((.*)@5001\)
metric expression:
Agents\|<agent host>\|AIXAgent\|PerfMonAgent:ConnectionStatus
Comparison Operator: Greater than
Trigger Alert: Whenever Severity Changes
Combination: all
Danger threshold: 1
Danger Periods over: 80
Observed Periods 80 (20 minutes)
Caution threshold: 1
Periods over threshold: 42
Observed periods: 42 (10:30 minutes)
Then we have a summary alert that bundles all of the agent connection statuses into catagories (java, windows, aix, etc) and then have an action defined on the catagory summary alerts.
Been working pretty well the only draw back is if you have a long duration ADS in which the agent does not return after reboot and is outside of the 20 minute window.
To work around this, we built a dashboard that displays all of the summary alerts and review them after system reboots to insure all alerts are reporting and not grayed out.
Hope this helps,
Billy