DX Application Performance Management

Why Does ABA Find Anomalies When There Is Nothing Wrong In Production?

By spyderjacks posted 09-25-2014 02:52 PM

  

So let's say it's a Friday afternoon, out in the land of IT.  Suprisingly, nothing is going wrong.  So you connect into a stream of Nicki Minaj's "Anaconda", rummage through the remains of your box lunch... and decide to go take at look at the ABA console - to see if there is anything fishy going on.

 

And then you see it....  "O M G !!!  Look-it   that    anomaly!!!"

 

But the incident queue is quiet.  Email is flowing.  There is nothing in your work queue. You prairie-dog up out of your cube and everybody seems quiet and cozy.  Why is this happening???

 

Why does the default ABA configuration FAIL?

 

The default configuration fails because it generates excessive and inaccurate anomalies and this is due to the wrong types of metrics being sent to ABA.

OK... so how do you know that?

Here is an example of an anomaly that is otherwise uncorrelated with an actual production performance problem:

WTG_metrics_ABA.png

 

You can see that we have a high score (93), component count of (122) beginning with “WTG” and deviation of (100) units.  Looks like trouble!

If we go to the APM Workstation Investigator, which is currently necessary because it can show the “Min, Max and Count” - we can easily explain why this is happening.

WTG_metrics_Investigator.png

Our search scope, on the left, is the entire app instance/cluster, so we catch all of the “WTG” transactions.  We are searching on “WTG” and find, to no surprise, all of the ”WTG” metrics.  These are the Front-End components that correspond to the CEM transactions that the default ABA Configuration is looking for.  We are looking at a (7) day historical range, including the date of the anomaly and basically, there is nothing going on here!  There are no response times going crazy.

 

Really - nothing going on.  These “WTG” transactions have a weekly frequency or invocation rate of 250-500 per week.  Looking at the graph, we see that the transaction is in fact running very sporadically.  What these means for ABA is that it tends to score these frequent appearances and disappearances as “anomalies”.

 

The threshold for a meaningful KPI, based on experience, is (10,000) invocations per week.  Metrics that are less than (10,000) invocations (counts) are not consistent enough to base an operational threshold.  Metrics that are greater than (10,000) invocations are good candidate KPIs.  Why does the IRS track money movement of $10,000 or greater?  The other cash movements just don't matter!  Apparently, the same rule applies to APM ;-)

 

If we apply the KPI technique, and focus on “Average”, for that same search scope, we get a very different insight into what is really significant for the application.

ART_KPIs_Investigator.png

Here we find that other components, with an “Average Response Time”, are appearing at up to (10M) invocations per week!  These are significantmetrics and thus candidate KPIs.  If we generate an ABA configuration from these metrics, we are going to be giving ABA a much better chance at detecting significant anomalies.

 

So what do we do next?

 

You need to start gathering KPIs about you applications.  This you need to do this just to do a better job in taking advantage of APM information.

If you want to take advantage of ABA technology, you reallyneed to start gathering KPIs.

 

It's actually pretty easy, even mechanical, to get a comprehensive list of suspect KPIs.  Then you need to feed that list into some code that churns out regex appropriate for the Analytic.properties file that is the ABA configuration.  Add then you are "cooking with gas"...

 

I should probably share that process in my next post, which will be as soon as I finish validating the code... or as soon as this post hits (500) views... whichever happens first? Deal?

12 comments
0 views

Comments

10-06-2014 05:08 PM

Mostly we are on the same page, but all my exhortations to the customer that APM has tremendous benefits, but you have to work hard falls on deaf ears (as you know well), so I am trying to push this string from the other side. I agree that we we cannot know what is truly an anomaly from just the time-series data, but given a smarter & longer look at that data; we can start to refine what we need to look at. (Wish that ABA had a human feedback method, we give it an "attaboy" when it correctly indicates an anomaly)

 

I have helped others filter the data going into ABA/Prelert along the same lines (different criterion) as you indicate. I have not been able to do this in a structured way, but really, really hope that your efforts works, .i.e produces any answer (good, bad, indifferent).

 

My real issue is that Prelert had some really interesting features and algorithms (almost magic) that could be improved with some better (internal) knowledge of the data produced by APM. It does look like the current implementation takes 1/2 step forward and 2 steps back. I am not advocating for the perfect solution (real tuning/diagnostics takes work), but was hoping that ABA would do basic filtering/correlations and make life easier. I cheer your efforts, but was hoping that the "Big Data" hype might push ABA to bigger and better. Maybe later, but obviously not now! If there is anything that I can do to help your efforts, please let me know.

 

PS: Pigs will fly given a sufficiently large engine, just like an airplane but with different aerodynamics.

10-06-2014 04:46 PM

The alert definition capability for ABA DID NOT make it into the 96 release.

10-06-2014 04:43 PM

Mike,

When you're using ABA, it will notify you that it found an anomaly by showing an alert notification in WebView. Can then go to the workbench to see the event.

 

There is not an alert action item in the Management Module you can configure for this.

10-06-2014 04:34 PM

Hiko, I did not see that and will be looking for it at the next customer.

 

I did talk with Tony Wood who indicated that there was no alerting mechanism, so maybe a comment needs to go back inside CA to get everyone on the same page.

10-04-2014 05:31 PM

I agree that the intent was to eliminate the need to have someone opine as to what a 'quality' metric should be, and, if I may extend it, to determine a threshold to let you know that the 'quality' metric was in trouble or otherwise misbehaving.  But it wasn't "reasoned" - it was simply "plausible".

 

 

And I've heard this tune before...  it's the dream of all Infrastructure Management to find root cause, with little or no additional information - other than what they are directly monitoring.  It how IT folks lull themselves themselves to sleep each night!   But it remains a dream.  Especially the dream where you can derive and maintain a baseline from time-series data alone!!!  That's just ****!

 

 

It is not just the repeated, industry-wide failures over the years.  And sure, there are always a couple of edge cases that keep hope alive - but they never succeed outside of the controlled environment.  It is a fundamental, mathematical certainty, that says you cannot validate a system, simply based on the data of that system.  You always need another source to validate - and it simply cannot be the very data you are trying to validate.  I'm talking Kurt Godel here - foundations of mathematical analysis.  Heavy, seminal thinking.

 

 

Now Kurt was speaking specifically to proofs on various systems of mathematics but I feel, specifically regards the 2nd incompleteness theorem, that it extends quite nicely to the issue at hand - that being "time-series data".  Can I ever expect to prove that a given time series has a specific meaning?  Absolutely not - without first correlating with some other source of information, which is itself independent of the time-series.  That's what correlating actually means!   So that's it - game over for auto-detecting anything of merit from time-series data alone.

 

 

Could I design, train and build an expert system, using all of the information sources available to a 'human' triage person (of some reasonable skill)?  Sure, that's achievable.  And I'm not asking for limitless sources of information.  I simply want to build correlations between APM, Configuration Management and Incident Management.  Now that's something I have confidence in that is simply way more than "plausible" - which is what a lot of IT decisions seem to be made on, and what a lot of our products seem designed to deliver.  But it still needs to be built and validated - and organizationally we still appear stuck on "red, yellow, green" and a tools-centric approach...

 

 

I don't think that is sufficient -- to simply bring forward solutions that are "plausible". That's why I write 'vendor neutral best practices' - it simply should not matter what technology you use - solving performance problems is the same process for everybody (excepting the work-arounds for bugs and missing functionality).  But with Prelert's "proprietary algorithm' and other super sophisticated technologies (magic?), I was calling "foul" on this one from day one.  I was really hoping to be proved wrong... but Kurt kept reminding me that there is no magic in software engineering that somehow removes it from the laws of mathematics.  There is nothing in the literature supporting this strategy of auto-correlating time-series - and that should really be a clue...  The pig will not fly.

 

 

So will my "autogeneraton of custom ABA configurations, based on KPIs" result in a better ABA experience?  I don't yet know.  But I have (4) pilot sites and a chance to prove it does, does not, or no difference.  I'm trying to move from "plausible" to "demonstrable".  No magic.  No secrets.  Facts or nothing.

 

 

I also know that KPIs + Baselines + Thresholds gives me a **** handy monitoring configuration, even if I am deprived of the golden promise of Analytics.  Using, of course, the very effort that IT wants to avoid.  I get that.  It falls to myself and my brother APM-SWAT to innovate and show how to automate the processes we already know are successful.  No magic, all meat and no fat (plausible) aspirations.  Let's hope it all stays in the road-map!

10-03-2014 02:04 PM

Hi Mike,
The alert is automatically generated as a console event on WebView.

10-03-2014 01:55 PM

Michael, I would re-iterate that Prelert made a reasoned attempt at predicting/showing anomalies based on time-series data ... with two exceptions; the "no data" issue and the length of time-series. Many years back David Seidman made an initial attempt at that same analysis with "baselining", it never worked, but it was a good idea. Prelert continued with that effort albeit with a different method which certainly looked promising.

 

The whole point of both efforts was to eliminate the need to employ someone who could determine what was a "quality" metric and setup an alert based on it.

10-03-2014 01:43 PM

Hiko, I was at a customer over the summer who had the original Prelert and now ABA installed. We played with ABA but could not find any alert mechanism. I talked this week with Tony Wood who told me that current ABA does not have an alert feature and he could not tell me in which new version alerting would be implemented.

 

If you know of an alter mechanism, would love to hear about it as I am working with a customer this month who will install ABA.

10-02-2014 08:49 AM

See my earlier post on Big Data - what does it mean for APM- for some discussion on the limits of what APM data can do... all by itself.  We will be able to do a confirmation once we have a platform that establishes baselines and integrates other data sources - such as configuration and incident management.  You simply cannot look at time-series data alone and predict anything. (without referencing an external baseline or data source)

 

Our current APM platform is not yet capable of such workflows and integrations, so we are stuck with manual processes, spreadsheets and python... until we stop chasing alerts and start managing software quality.

 

Regards 'useless' data... clean it up? - fat chance!! "Mo' metrics..... mo'better!

 

This is one mind-set that has (to date) resisted all efforts to adjust - and what leads weak APM practices into frustration and failure.  It is not the quantity - but the quality of the metrics that matters.  Check out my earlier post What are KPIs and how can I get some quick , and also docs like Collecting Baselines - Finding out which Metrics Matter.

10-02-2014 03:20 AM

it really needs to be an engine which does corolate the data on its own without any "engineer type input", the only input it should need is a confirmation if what it found is correct or not.

 

i personaly would even preffer if all the data is taken in to account (if you have useless data it shouldnt be there in the first place)

10-01-2014 07:13 PM

Hey mrisser3,

We've missed ya here!

 

Anyways, doesn't the alert notification from ABA fulfill your idea about a primitive alert mechanism? Something different?

10-01-2014 05:13 PM

The whole point of Prelert (before ABA) was that you fed it APM data and Prelert would declare the anomalies and find the correlations (thru statistics without knowledge/analysis of the application ...i.e that you did not need someone trained in APM and the application to recognize a problem) . But Prelert had a problem; it was adapted from another product written to handle data that was always present (Stock market info. if I remember correctly) and when non-constant data was fed by APM, it would created false anomalies. I worked with a customer in New England and the Prelert creator and noted the problem then. Recognizing that Prelert/ABA has a processing problem (only 100K metrics), it becomes critical that you limit the input and obviously the issue of lack of constant data (which the article seem to peg at 10K/week) means that the algorithm has been improved but not corrected. This all but defeated the original purpose of Prelert/ABA.

 

The original Prelert product had a visual feedback mechanism (varying sized/colored/count of bubbles) that was meaningful, once you understood what it was saying) and an alert mechanism (I never saw it, but was told that it was there), both lacking in the current ABA implementation. The analysis/correlation function is nice if you want to constantly watch WebView, but are we encouraging customers to do that? Even a primitive Alert mechanism (alert:an anomaly happened) would be welcomed.