So let's say it's a Friday afternoon, out in the land of IT. Suprisingly, nothing is going wrong. So you connect into a stream of Nicki Minaj's "Anaconda", rummage through the remains of your box lunch... and decide to go take at look at the ABA console - to see if there is anything fishy going on.
And then you see it.... "O M G !!! Look-it that anomaly!!!"
But the incident queue is quiet. Email is flowing. There is nothing in your work queue. You prairie-dog up out of your cube and everybody seems quiet and cozy. Why is this happening???
Why does the default ABA configuration FAIL?
The default configuration fails because it generates excessive and inaccurate anomalies and this is due to the wrong types of metrics being sent to ABA.
OK... so how do you know that?
Here is an example of an anomaly that is otherwise uncorrelated with an actual production performance problem:
You can see that we have a high score (93), component count of (122) beginning with “WTG” and deviation of (100) units. Looks like trouble!
If we go to the APM Workstation Investigator, which is currently necessary because it can show the “Min, Max and Count” - we can easily explain why this is happening.
Our search scope, on the left, is the entire app instance/cluster, so we catch all of the “WTG” transactions. We are searching on “WTG” and find, to no surprise, all of the ”WTG” metrics. These are the Front-End components that correspond to the CEM transactions that the default ABA Configuration is looking for. We are looking at a (7) day historical range, including the date of the anomaly and basically, there is nothing going on here! There are no response times going crazy.
Really - nothing going on. These “WTG” transactions have a weekly frequency or invocation rate of 250-500 per week. Looking at the graph, we see that the transaction is in fact running very sporadically. What these means for ABA is that it tends to score these frequent appearances and disappearances as “anomalies”.
The threshold for a meaningful KPI, based on experience, is (10,000) invocations per week. Metrics that are less than (10,000) invocations (counts) are not consistent enough to base an operational threshold. Metrics that are greater than (10,000) invocations are good candidate KPIs. Why does the IRS track money movement of $10,000 or greater? The other cash movements just don't matter! Apparently, the same rule applies to APM ;-)
If we apply the KPI technique, and focus on “Average”, for that same search scope, we get a very different insight into what is really significant for the application.
Here we find that other components, with an “Average Response Time”, are appearing at up to (10M) invocations per week! These are significantmetrics and thus candidate KPIs. If we generate an ABA configuration from these metrics, we are going to be giving ABA a much better chance at detecting significant anomalies.
So what do we do next?
You need to start gathering KPIs about you applications. This you need to do this just to do a better job in taking advantage of APM information.
If you want to take advantage of ABA technology, you reallyneed to start gathering KPIs.
It's actually pretty easy, even mechanical, to get a comprehensive list of suspect KPIs. Then you need to feed that list into some code that churns out regex appropriate for the Analytic.properties file that is the ABA configuration. Add then you are "cooking with gas"...
I should probably share that process in my next post, which will be as soon as I finish validating the code... or as soon as this post hits (500) views... whichever happens first? Deal?