Now and then, unexpected overhead ‘appears’ and it’s up to you to hunt it down and eliminate it.
This happens because folks are deploying agents and not really taking care of the APM configuration. Almost everybody has the appropriate process – the pre-production review – yet somehow, there is nothing about APM in that review. Typically, the review is only for patch levels, jar versions, and other considerations, apparently left over from the medieval period or jurassic age. For this ‘new’ agent technology, apparently, there is no reason to check anything about that configuration. Does it have good visibility? Is it compatible with the other technologies that make up the application? Does it have excessive overhead? Does it generate excessive metrics? Does it have any value at all!
No. Nothing to see here. It seems to work on application X that we deployed last quarter. It should work for application N – which also has a bunch of new framesworks that we have never used before. No worries – ship it. Really?
And then overhead “appears”. Wow – never saw that coming! Really? Really???
Finding Potential Overhead
So how do you find this overhead? Open the APM workstation, navigate to the agent and look at the data. The answer is in there. As soon as the agent is running, the answer is in there.
The "rule" is this - anything more than 5,000 invocations per 15 second interval will contribute to excessive overhead. Exactly how much? That's hard to say. But without fail, this guideline has saved the bacon of many an APM practitioner, over the years. I discovered this property when trying to use APM against a SPECmark benchmark. Really bad idea... but very interesting results. We'll save that story for another day.
So here is how to assess the potential overhead:
- Navigate to the agent of interest
- Select the “Search” tab
- Select an appropriate historical range
- Some period where the agent was running
- BEST to have simulated load or performance test
- BETTER to have a User Acceptance Test
- GOOD enough if you wait until production
- Seriously! You don’t even test your code before production?
- Search on keyword “Responses”
- Click on Value column to sort, largest to smallest
- Set the resolution to “15 seconds”
So the result is pretty obvious: 1,000,000 >>>> 5,000 – and this is the first source of the reported overhead. Note that the next value is also fairly large, so let’s check that one as well:
This component passes the test (1,800 < 5,000) and so ends our search – pretty easy!
However, you might be concerned that the 3rd component, which also has a large number of invocations, might be a potential source. OK – so let’s look at that one as well:
“Crikey!! 180,000 >>> 5,000 so this is a problem too?”
Well, not so much. This is limited to the startup of the application and while it would be a problem if this app was restarted frequently, since this is a batch application, we can safely discount this potential overhead. You could, of course, continue moving through the list. I find the <down arrow> the quickest way to navigate. But as we already sorted the list, largest to smallest, it is unlikely we are going to find any other overhead candidates for this application.
Nuking the Potential Overhead
Now to remove the offending component we have to have a little bit more knowledge about what we are tracing and where it is configured. In this case, the component is an EJB-Entity Bean. This is a java component that takes care of a persistent data structure - not at all surprising, considering we are looking at an in-memory database application. But you don’t really need to know what it is – you just need to keep track of the name of the component.
Finding where it is configured is determining whether this component is defined in any of these locations: the basic tracing configuration, an extension module or a custom file. Usually the component name is enough to identify the starting point. If you are in doubt as to which file holds the tracing definition, there is no shame is using grep() to get a quick list of files that have a potential definition.
In this case, you will need to navigate to the ~/wily/core/config directory to begin your hunt. For me, with that 11 year head start with APM ;-) I’ll just go to where the EJB tracing lives, in the toggles-typical.pdb and switch off the entity bean tracing as follows:
The RED comment is the change, which we make in two places.
"Why two? "
Because I really don’t want to figure out what version of the EJB standard the application is following. I want to get back to production ASAP!
"What’s all the other RED underlines? "
Stupid Notepad() trying to spell check a directives token…. You’d think a software, written by a software company, run by a software engineer… would know better. It is still quicker than using eclipse()….
Anyway, save the edits and get somebody to recycle the application, and await the next run.
Earning That Cold One
The next step, and most important, is validation. You need to confirm that the configuration tuning was successful (no typos) and that it alleviates the overhead. If there is a typo, then you will have broken the instrumentation and you will not get any instrumentation metrics. Rude. Crude. But all on you!
On this case, the client was already tracking the total run time for the batch, so we needed to wait a weekend to get the results. They only run this batch on the weekend. Here is what they reported:
The astute might notice that the ‘no wily’ run earned 12 hrs and 48 min, while the ‘tuned config’ earned 14 hrs and 45 min. But you need to look at the records volume – 955k ‘tuned’ vs. 878k ‘no wily’. I think that is close enough and since we took off almost 7 hrs from the previous run time – about a 32% decrease – not shabby at all.
For the Professional
Whenever you drop a change like this into an agent configuration, you need to make it obvious that there is something different with this agent. The easiest way to do this is to rename the wily directory to something like ‘wily_mybatch’ which suggests that there is something custom going on in there. This change, of course, needs to be propagated through the WAS admin console (when WAS is used) so that all of the agent properties will make sense. Today we need to keep track of these changes manually so that an upgrade does not clobber our tuning. Let’s hope the future brings a more automated capability.
Even if you do not suspect an overhead problem, please run this simple, powerful test. Don’t get bit by the reality that not every app will respond appropriately to instrumentation. Don’t trust, don’t assume – always verify. It takes 5 minutes and will help you avoid tons of frustration.
A screen shot of your findings, along with a copy/paste of the metric names (screen resolution is tough) and anybody on the alias can help you knock high overhead back to the history books. You manage what you measure.
"What else can I do to make sure my monitoring configuration is ready for production? "