APM Version : 10.5.2.92
Total Agents: 1580
Number of Collectors: 9
On the APM Status Console, we have an active clamp on one of our collectors:
The number of agents on the nine collectors vary between 118 - 250. Typically once a month, the APM is restarted to pick up OS patches and during that time, the agents will jump between collectors during start up then will level out after a few hours of load-balancing. The problem with that is, during this process a collector may have had over 400 agents connect and disconnect till the agents are load balanced.
1. What is the behavior of a collector that has a historical.agent.limit clamp (400)?
2. Depending on the resulting behavior of a collector, is this really an error level clamp/limit?
3. Again, depending on the resulting behavior, is there a way to address this issue without resorting to assigning agents to specific collector or collectors, basically artificially dividing the agents into groups of less than 400 per collector?
4. What is the APM cluster impact when one, or more of the collectors are reporting the disconnected.historical.agent.limit? In the nut-shell, when do I as the APM admin need to take preventative actions?
The following reference might be helpful. The section "introscope.enterprisemanager.disconnected.historical.agent.limit" describes why this clamp occurs, how the EM responds, and has some suggestions on how to address.
apm-events-thresholds-config.xml - CA Application Performance Management - 10.5 - CA Technologies Documentation
Just some suggestions to check:
"If there is no historical agent that the Enterprise Manager can automatically unmount, this means that CA APM users mounted manually all the disconnected historical agents. The Enterprise Manager never tries to unmount a disconnected historical agent that a CA APM user mounted manually."
Would you have mounted any agents manually?
Would you see such a message?
My suggestion would be to try to use introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector=always to prevent the clamp issue during the startup.
I have covered the most common issues and recommendations regarding clustering in this KB
Introscope Enterprise Manager Troubleshooting and - CA Knowledge
See point # 15 that covers loadbalancing
So, read through the doc, and thought, let us go look for these disconnected and mounted, agents, which I would typically go to the custom metric host to locate any agent that is grayed out.
To my surprise, there are no agents that are grayed out. Then clicked on the Agents folder and did a search for "ConnectionStatus" and all have a value of 1.
I would expect to see a hundred or so, mounted but disconnected (greyed out) agents under the Agents folder but I don't. The collector has around 177 total agents but the APM status console is still reporting the active clamp on the
I couldn't find anywhere on the custom metric host where there was a metric that I could use to gauge against the APM console.
Looking at the clamp line, the clamp occurred on July 22. Our agents unmounts after 24 hours of being disconnected.
So it looks like a ghost message that is stuck in APM status console since I am not able to see any signs that the collector associated to the active clamp has the specific condition.
Anyone know how to kick the APM console so it will check again and clear the message?
Thank you Francis.
There are only two people with administration rights which includes mounting and unmounting agents. Neither of us have mounted or unmounted any agents.
We haven't seen any messages on unmounting historic agents to make room.
Thank you again,
I hope this helps,
Thank you Sergio.
I will be going through the KB line for line against our new 10.5.2 cluster.
On the load balance, I could see where setting to staywithhistoricalcollector to always would help, but currently it does not appear like the collector that has the warning about the historic agent limit has any disconnected mounted agents. So it appears like the APM Status Console thinks there clamp of historic agents but there does not appear to be.
The setting to always, how does that impact failures, if one of the collectors were to fail or unable to accept agents, will the agents move to a different collector till the failed collector returned to service?
The message was from July 22, and I can reason it was due to a cluster restart. I would expect that after 24 hours, the disconnected agents would unmount due to the introscope.enterprisemanager.autoUnmountDelayInMinutes=1440, and I'm guessing that the historic.agent.limit shouldn't count unmounted agents.
On a cluster restart, I would expect to see quite a few APM Status Console messages, which should clear after the cluster has became balanced plus 24 hours to unmount the disconnected agents.
Sergio also just updated this older KB covering that property: Tip for loadbalancing configuration when upgrading - CA Knowledge
Sergio can confirm but I believe introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector=always means that:
- as long as the Collector is up the agent will wait for a connection to it even if if it is overloaded.
- if the Collector is down the agent will be redirected to another Collector
Hope that helps
Thank you Lynn.
Now, I'm a bit confused how this setting might help in my case since at least once a month, all collectors are stopped which would trigger the second clause. Then during starting the collectors, the agents more than likely will get to the collector that is the one it is trying on it's original collector list before the collectors/MOM were shutdown (MOM first, so that the collector list is not updated).
Could it be that this specific APM Status Console message/alert has the alert trigger set to "Whenever Severity Increases" and not to "Whenever Severity Changes", thus no clearing the active clamp since I do not see any metrics showing that the collector has more than a few hundred active agents and don't really see any metrics on the custom metric host that might be a historic agent count. But then again, I could be missing the metric driving the active clamp alert.
SergioMorales for any more input he may have.
I am thinking that even though agents might initially get to their preferred Collector during the startup process, as the MOM tries to load balance the metric load across the Collectors during the dynamic startup period the suggested setting could still help to avoid agents being subsequently moved around Collectors.
Regarding the alert not clearing from APM Status Console perhaps you can create a support case on that so we can research it in more detail as to why the alert is not clearing.
I have opened support case 01171999
It is always great to see a good conversation taking place. I combined Tom's, Sergio's and Francis's responses into one response
Support Case: 01171999
APM Status Console - reporting "disconnected historic agent limit" - active clamp
We found that there was a reporting/refresh issue with the APM Status Console, where when agents stopped reporting, and there were more than the out of the box setting of 400, the active clamp alert would appear and would not clear even after the agents had unmounted.
1. We deployed a revised "/product/enterprisemanager/plugins/com.wily.introscope.em_10.5.2.jar"
2. Set the File: apm-events-thresholds-config.xml "introscope.enterprisemanager.disconnected.historical.agent.limit" threshold value="1"
3. Shutdown an epagent
4. Manually unmounted the agent
5. After a short bit, the APM Status Console cleared the active clamp notice.
Hope this helps,
Thank-you Billy Cole for letting the Community know