I have run into a scenario where the response times of my application are increasing after instrumenting APM, I have tried disabling the deep tracing and deep instrumentation, and it didn't help, do you have any idea on what else I can do in this case ? P.S: there are not many SQL metrics or JMX counters that I am collecting.
I have each agent collecting > 13000 metrics, is that a reason ? because I have been told that each agent can support upto 5000 metrics without causing overhead, Thanks.
The number of metrics alone does not suggest overhead. The more precise measure is to look at all of the 'responses per interval' metric, for the agent that is having problems. Lots of metrics can sometimes result from having instrumentation that is going too low in the call stack. You generally want to avoid instrumenting components that are invoked more than 5000 times per interval (15 seconds). And applications that have little or no business process are poor candidates for depp instrumentation.
So order the 'responses per interval' metric, for the agent in question, from greatest to lowest - and this will give you a quick assessment of what might be the problem.
Thanks Mike, that's helpful.
The problem in my case is that we have uninstalled the agent now, I have mounted the agent back again and looked at the historical data.
There are quite a few resources whose metrics are collected close to 26k for 2 minutes (since this is historical data, i cannot put the live mode on / add a specific time range of 15 seconds) and if split 26k metrics (collected over two minutes) for 15 second interval - 26k/8 = 3250 metrics / 15 seconds which is still < 5000 metrics / 15 seconds.
In this case, can I believe that the metric explosion is not a reason for overhead ?
It is possible that specific tracers are causing the overhead for your application because of the methods it is calling.
Can you quantity the original performance overhead with the agent in percentage terms?
Are you using full or typical (pbl) instrumentation?
Thanks for the response Lynn.
We are using Typical pbl instrumentation and yes, I see that the response times are doubled after instrumentation which makes me saying that agent overhead is 100%
That is a significant overhead!
Did you get any relative CPU usage measurements as well?
In this scenario you will probably need to get a test environment up with the agent and start to disable individual tracer options to measure the impact of each disablement. spyderjacks would have more suggestions on that or you can create a Support case and we can then review the agent logs, application thread dumps etc in more detail to try to determine what tracers are causing the overhead.
One example of a tracer that can sometimes cause performance impact is Socket Tracing, so once you get a test agent environment up as a start you can try to comment that out in the toggles-typical.pbd file i.e.
Thanks Lynn, tried doing this but no luck.
As a next step, I am evaluating each and every pbd flag and calibrate its impact.
Converted to discussion for Manoj to report back & other continued inputs from community members
Might help if we knew more about the environment.
Which version of the CA APM agent?
Which Application Servers and versions?
It might also help if you look into the agent's autoprobe and introscopeagent.log for the agent and see if you see any errors or non-typical behavior. In Autoprobe, you want to look to see if the agent is reinstrumenting often by searching for a single pbd file and seeing how many times the agent loaded it.
In the IntroscopeAgent log, look for errors in the communication to the collector.
In the application, look to see if there are a high number of error traces and what is causing the error traces.
We had two issues, one with version 10.0 and another with 10.5.2.
In 10.0, WebLogic running Oracle JVM 1.8, we hit two issues. One with the WebLogic Diagnostics Framework (byte instrumentation) and with the Oracle JVM 1.8 java script engine (nashorn).
In 10.5.2 we found that the SOA performance Management (SPM) was enabled by default within the base agent and that the SPM was not compatible with WebSphere 7.0.
For the JVM, how does the CPU and Memory (garbage collection) look like? If there isn't enough CPU or memory the agent will multiply the impact/effects and drive much more CPU than typical. If there isn't enough Memory, or too much memory, the Garbage Collection is going to be more often or has to compress more of the fragmented memory. Both of which would drive CPU.
How is the network looking on the JVM (agent) host side?
What is running within the JVM? Web Applications, Services (EJB), Services (SOAP)?
If you are using the typical PBL, that is "typically" all the general elements without any of the really deep stack tracers.
If you are running version 10.5.2+, try to comment out the SPM pbd out of the IntroscopeAgent.profile.
Hope this helps,
Thanks for listing down the possible root causes, we have uninstalled the agents from the PROD environment considering the overhead that were creating. We are working on instrumenting the agents in the Non-PROD environment, once we are done with that I will get back to you with what I see in the agent logs, Thank you.
FYI, we are using 10.3 for JBOSS (java) agent and 10.5.2 for MOM, JBOSS 6.4.4, Linux 64 bit OS, JDK 1.8.0_92