Does anyone have a solution to how to detect and alert on an agent that is no longer reporting within APM?
Created two alerts on agent connection status
Test 1 ConnectionStatus 15 sec resolution, Greater Than, Whenever Severity Changes, combinations all, threshold 1, periods 8/8 with caution set to 1, 4/4.
Test 2 Connection Status 15 sec resolution, Not Equal To, Whenever Severity Changes, combinations all, threshold 1, periods 8/8 with caution set to 1, 4/4.
ADS - Cron, 90/14/14/4/8/?/2016 so 2:14 pm on 8/4 for 90 minutes
1. set unmount to 10 minutes
2. restart enterprise managers
3. Stopped the epagent to insure that a alert would generated
4. Started the agent - agent changed collectors from collector 1 to collector 2
3. set ADS for 90 minutes
4. waited till ADS was active
5. stop the epagent on target server
6. waited till ADS ended
7. The agent jumped between the two collectors during preparation of the test and on the first collector the agent was unmounted but after more than 12 hours the agent has not unmounted from the second collector. On the second collector the agent is grayed out with no metrics, namely the agent connection status.
I did not receive an alert from either agent connection status alerts
This was my best guess to have APM be aware of an agent being down/stopped/unreachable after an ADS.
We have 577 agents and 734.8k metrics with a MOM and seven collectors. We do not see any performance issues.
We have tried to build java script calculators to scan the agent connection status but with so many agents the java script calculator will stop and not report on all of the agents.
APM version 10.0
Agent environment performance agent version 10.0.0.12
I made this into a new thread and into a question
I would consider using it because of the problem you described (loadbalancing).
The cluster calc will read the status metric from each connected agent per EM and report them as new metrics to the MOM. Setup your alerts based on these new metrics to ensure you're not getting false-positives.
Thank you Hiko.
I believe this should be what Hiko was referring to:
Hope that helps.
Thank you Aryne.
I will download this since more than likely since I downloaded this a few years ago, it may have changed.
After installation, I'll test it out to see if it won't answer the problem of an agent shutdown/fail/stopped during an ADS and after the ADS not receiving an alert that the agent is down.
Since Hiko has provided a solution that should resolve this issue, I am marking as answered. Please report the result of your findings on completion.
After copying the ConnectionStatus.js to my test MOM, found there was a log message that produced a log line per agent. I edited the file and commented out the line 38
\\log.info("zzz thisUid :" + thisUid);
Let this run for a few minutes and then stopped an agent. The "Clusterwide Agent Connection Status" reported back the expected value of 3. After a while the value went to 0, where I would expect it to go if the agent is unmounted.
I restarted the MOM while the epagent was stopped and the agent no longer had a Clusterwide Agent Connection Status.
After a while, I mounted the agent and waited to see if the script would pick up on the re-mounted agent. Did not see the clusterwide status.
Started the agent again and refreshed the workstation by closing the metric tree and reopening it.
I'm going to stop the agent again and then wait for the agent unmount to insure that this script will continue to report zero till the MOM is restarted. In this test I did not wait long when the status went to zero to insure that it was the unmount.
On the MOM restarting and clearing the list of all of the unmounted agents, I'm going to look into writing out the agent list each cycle and then to initialize the list, read the file.
Will post an update after my testing and then will work on the .js modifications to capture the agent unmount condition.
The EM's agent unmounts is set to 10 minutes (40 cycles) to try to get more test cycles in per day.
After about 4 cycles (1 minute) after the 10 minute unmount, the agent's clusterwide agent connection status went to "no current data"
So, if there was an ADS that was longer than the unmount duration, and the agent went down at the start of the ADS, when the ADS ended, we would get no alert since there is no data.
On the MOM there is about a dozen or so agents, quite unlike our non-production and production environments that have more than 500 agents each. Not so sure if the java script can keep up as it sits and after the permeate storage to a file modification to try to continue to report on an unmounted agent after an ADS.
APM is not designed to keep metrics alive for agents that no longer report data. This is to enhance performance and data collection.
Just keep in mind how this will impact the cluster performance over time and as you add more agents.
Harhiko is correct to point out that this use case does not really fit
what APM is intended to do - you do not really need to check for
disconnected agents every 15 seconds.
The challenge is that while CA-APM can alert when an agent disconnects,
if the agent does not return with 60 minutes, then that agent is removed
from the Explorer tree - until the hour or day that it returns.
What you need is an external service that notices the agent has
disconnected (em logs this), checks if the agent has reconnected and
then notifies (possibly) if it did not. So you need a data structure
(list of disconnected agents), a cmd script to check if a those agents
(in the list) have come back, and a cron entry to execute the service
every 15-60 minutes - or maybe just once per day since no one seems to
noticed they have gone missing for more than an hour anyway!
This way, you get a list of all disconnected agents, when they were last
seen and you can then decide what to do about it. The query load should
be insignificant (assuming dozens, not thousands). Basically a
persistent store of disconnected agents with periodic ping() to see if
they came back.
Thank you Mike and Hiko.
If you could only see what we have done to APM through the EPAgent it would make you shake your head.
I fully agree that what I'm attempting to do is way outside the intended scope of Application performance monitoring and I go as far as calling it infrastructure monitoring. But my end users don't want me to use CA UIM/HP SiteScope to do the typical ping the host, ping a port, check a process, run a URL/SOAP, check the state of an OS service/daemon, to check to see if a process is running.
With the epagent connection status, this would show that we can get a response from a process on the hosting system. In context of the application agent (Java) the connection status would show that the JVM is responding and not dead, locked up or a wire between the collector and the agent isn't broken.
Looked into using/parsing the MOM logs but did not see any agent disconnect or unmounts messages and I have the logs turned up to verbose. Then there is a matter of building the disconnection list with the agents load-balancing between collectors and the original collector un-mounting the agent after 60 minutes.
I have tried to sell the idea of the application performance would suffer and we could alert on the performance, errors, stalls then during the root cause discovery, find that there isn't a host/JVM or enough instances for a function/service. That discussion didn't go that far since team ownership of the application and thus the performance of the application might as well be on a different planet.
So the only fault case using a "all" APM alert on the collector's agent connection is the un-mount cases with and without ADS. In the last five years we have only had two events contributed to the JVMs with application agents were not started or were not added to the start up scripts or stop reporting. then un-mounted and no one was alerted.
Using the ConnectionStatus.js or defining an Alert with "Not Equal To" and combination "all" will provide alerting on the basic agent connection status going into error state (3) but as soon as the agent is un-mounted, then the agent connection status would be like it never existed, (no data).
So my next test would be using the Alert with combination "all" and extending the un-mount time from 60 minutes to longer than our longest ADS then find a way for the alert to send out a message after the ADS that the agent, epagent or application agent, did not return after the ADS. This should continue to gather "3" till the ADS would allow the alert to send out an email alert that the agent is not reporting after the ADS.
Extending the un-mount time on the enterprise managers did not take/work, set the time to 120 minutes and the agent went unmounted after 30 minutes.
The ADS started at 14:35 with a duration of 60 minutes. The unmount was configured for 120 minutes and all EMs were restarted before this test. The agent was stopped around 14:37 and was unmounted at 9:00 but the graph above shows the agent reporting from 14:32 till 15:08
With the ADS ending at 15:35 and the connection status stopped reporting at 15:08 there would be no trigger alert notification setting that would cause an alert to be sent out. Since after 15:08, 30 minutes before the ADS was set to expire, there was no-data.
From these behaviors, and that attempting to use the agent connection status as a infrastructure check isn't in the nature of what APM was intended or how APM functions, we are shifting of focus to cover this monitoring hole with one of our infrastructure monitoring tools (CA UIM or HP SiteScope).
Thanks Billy. From you reported so far, this sounds promising