Use case: an agent generates way too many metrics (e.g. unique SQL). What are the possibility to stop the agent from floading the cluster?
1. we do not want to reduce the clamp on the collector(s) and impact all agents
2- there can be many agents on the same remote host ~ so ipfilter on ip is not practical.
3- the cluster has 2 MoMs and 10 collectors
4- managers are v9.7.1
5- agents are v9.7 v9.5 v8.2
The v9.7 agent is now cluster aware and thus can jump from one collector to the other without checking with the MoM.
+ What are the different possibilities to stop the agent "now" from sending its metrics?
<the person who has access does not have the skill to analyse the root cause>
+ same question but ways to set the MoM / Collectors from refusing the agent metrics?
For Florian will the ACC manage that in the v9.8 release.
One possible option is the agent side metric clamp. Requires a jvm restart though.....
# Agent Metric Clamp Configuration
# The following setting configures the Agent to approximately clamp the number of metrics sent to the EM
# If the number of metrics pass this metric clamp value then no new metrics will be created. Old metrics will still report values.
# The value must be equal to or larger than 1000 to take effect. Lower value will be rejected.
# The default value is 50000.
# You must restart the managed application before changes to this property take effect.
Did you try a simple shutoff on the Agent or the problematic metric tree node? (Right-click in the investigator on the Agent or a metric node and select shut-off). This is even preserved across EM and Agent restarts? That’s the quick fix.
The long term fix would consist of:
1) Either disable SQL metrics entirely for this Agent. A bit brute force, but solves the problem once and for all, but you lose SQL metrics. You can keep them in Transaction Traces though.
2) Either use the SQL normalizer to aggregate the problematic SQL queries. (Usually the culprit is the same SQL query which shows unique variants, like “SELECT FROM TEMP001, SELECT from TEMP002, etc…” or some comments automatically generated by hibernate. The SQL normalizer works well for this kind of problems.
3) Either clamp the problematic Agent. If your Agent is generating say 50.000 unique metrics, and 45.000 of them are SQL, you can problably clamp your Agent at 10.000 for example.
4) To “reduce” the load, you could disable some of the fancier metrics for SQL details, only keep Average response time and responses per interval for example, remove stalls, concurrent invocations and errors. Doesn’t solve the metric leak problem but attenuates it.
In general, I think our OOTB clamp settings are too high. I believe they’re set at 50.000 metrics, I would advise you to reduce it at least to 20.000, possibly 10.000 at the EM level. If you look at your average number of metrics per Agent, I’m pretty sure it’s much smaller than 3000 anyhow.
Thank you all.
- when you "Shut Off" the agent or the metric tree what is happening?
is the MoM sending a message to all the collectors telling them not to store the related metrics or it (?col?) telling the agent not to send the metrics to its collector?
- also why don't we see these "Shut Off" listed in the Enterprise Manager Map?
- about the clamps which one, both?
introscope.enterprisemanager.agent.metrics.limit -> 20000
introscope.enterprisemanager.metrics.live.limit -> 20000
I'm trying to get a definitive answer on your first question.
About the clamps, the one you want to change is introscope.enterprisemanager.agent.metrics.limit
This one is per Agent. The other one is per EM, so you clearly don't want to touch this one.
Got an answer to your first question.
"It won’t apply to all collectors. It will find the collector of the agent or metric that is “shut off”, then MOM sends the “shut off” request to that collector only."
Thank you very much.
Ouch!! that explains what I see: when I look at the histrorical data I saw the metric disappear then reappear for maybe 55 minutes when the collector 1 was down and the agent reconnected to the next collector. Can we call this a bug?
Also, I do see the downed collector in the "Enterprise Manager Map" picture, but nothing in the "Important Events" pane, is there some configuration missing in my v9.7.1 managers?
Shouldn't we also see these "Shut Off" events listed in that Map.
Yes, you can open a support ticket about it.
I've hit a similar issue with the application building dynamic SQL that are not normalized so each of the statement strings generates a meta data write. The meta data write is for the metric label, ie the SQL statement and at the peak, the collectors buckled under the meta data write load with the agent clamp of 20,000 metrics.
There is a SQL normalizer setting that you can group the statements into a single metric, such as all the Selects, deletes, updates, counts, but each broader metric would depend on the applications need/use of the SQL and what is more of a performance question.
CA APM .NET Agent Implementation Guide
CA APM Java Agent Implementation Guide
For our issue on the number of dynamic sql and what we would lose if we shut down the agent, the development staff adjusted their SQL generation to be more prepared statement, limited the end user's ability to search across every single field and ordered their where clauses so there are now only about a hundred statements instead of thousands.
Hope this helps,
I keep on thinking about your post unfortunatelly, I do not get much support from the concerned teams to correct the SQL.
So I was thinking that one day I should learn Mike Sydor's python script he wrote to generate the kpi for the Analytic server and apply same to read the SQL strings for an agent and generate the normalized data.
you can denay agents using loadbalancing.xml in /config/lodbalancing.xml. I hope it will solve your problem.
you can find the more detailes in chapter 7 in admin and configuration guide.
Exclude an Agent from a Particular Collector or Set of Collectors
The following example shows how to exclude an agent named MyAgent from a Collector named MyHost.
<agent-collector name="Exclude MyAgent from a Collector Example">
<collector host="MyHost" port="5001"/>
The following example shows how to exclude multiple agents from multiple Collectors.
<agent-collector name="Exclude Two Agents Assigned from Two Collectors Example>
<collector host="MyOtherHost" port="5001"/>
Thank you. I like very much the idea about shutting off some of the agent metrics (aka SQL detail) tree as opposed to the complete agent.
However for good measure and certainly needed in case of emergency, I tried your idea (with v9.7.1). So I added an <include> followed by an <exclude> to the loadbalancing.xml. As per the notes in that file, I also edited the MoM config with introscope.apm.agentcontrol.agent.allowed=false
I restarted the MoM just to make sure all these changes are activate and all appears to work.
Actually I did not think of this: the 'allowed' collectors list is sent by the mom to all the related agents themselves and the resulting denied agents list appears in the APM Status Console ~ pretty cool.
The challenge for the non-initiated will be to vi the file and not make a mistake. Hence, I am back into preferring the 'shut off' which should be cluster worthy and also reported in this APM Status Console (Case #70317). Should be easy to fix