View Only

Background

Application monitoring demands automatic alerting; manual construction of alert thresholds and configuration is far too time consuming to be practical, and rarely captures the main purpose, which is to notify operators in the event of problem. Legacy approaches, used by CA and by its competitors, use predict math and other analytics to notice when the actual measurements differ from the predicted ones. These sorts of approaches require prediction training, which can be as draining and difficult a proposition as manually configuring alerts. In addition, when the model doesn’t match the actual measurements it often isn’t clear if the application is broken, or the model is broken. The more complex the mathematics get, the more the approach appears as a “magic show” to customers, who naturally feel like they may be getting tricked.

Statistical control charts are the invention of Walter Shewhart, a quality control analyst that developed them while working for the Western Electric Company in the early 20^{th} century. Control charts anticipate that the signal they monitor has a stable signal, within a particular range. By calculating the standard deviation, Shewhart showed that simple comparisons against bands of standard deviation could effectively identify points at which the signal is exhibiting uncontrolled variance, something like how an earthquake registers on a seismometer. This kind of control charting has come to be referred to as the Western Electric Rules.

CA researchers and engineers discovered that customer application latencies closely mirror the pattern of the dial tone that Shewhart was monitoring. In particular, when they are well behaved they show stable measurements within a range. When the application is suffering from a problem, uncontrolled variance is a hallmark of developing problem and the Western Electric Rules are excellent at identifying it quickly.

While competitors are still trying to “predict the weather”, CA has been using Shewhart’s rules to quickly and simply identify drastic changes in climate. Customers can easily understand the math behind the approach, see it act in response to their metrics, and feel confident that the stream of alerts they receive accurately reflect problems in their myriad systems.

CA has customized the Western Electric Rules output to produce a gradient range of values or stability, from zero (perfectly stable) to thirty (perfectly unstable). These values are painted in a special display, not discussed in this article.

Summary

This invention can monitor any application metric, but works best with metrics that have a consistent signal within a particular range. For simplicity, we will talk about latency metrics, which tend to have these properties.

Latency metrics produce measurements at a fixed interval. A double exponential smoothing algorithm with seasonality included makes a prediction of the current latency based on latency measurements in the past (a learning period must elapse in order for this prediction to be available).

Once a prediction is available, a standard deviation is calculated and there bands of standard deviation are calculated and the actual measurement is compared. There are four rules that can be breached:

- 1. Any single data point falls outside the 3σ limit from the centerline.

- 2. Two out of three consecutive points fall beyond the 2σ limit on the same side of the centerline.

- 3. Four out of five consecutive points fall beyond the 1σ limit, on the same side of the centerline.

- Six out of six consecutive points increasing or decreasing.

Weights are assigned to these rules (unit less integers), with the heaviest weight assigned to the first and gradually lighter weights for the following.

An array of cells is constructed with a configurable size. Each cell represents a specific interval of measurement. The sum of the most recently measured rule breaches is assigned to the first cell. As new measurements are taken, the cells are aged. For example, if the window has twenty cells, then breach sums will be forgotten on the twentieth measurement following their introduction. [DAVE: off by one?]

The sum of all breaches in the window directly impacts the variance intensity for that signal. Variance intensity is an arbitrarily chosen range of values where the lowest value indicates perfect stability and the highest indicates perfect instability.

Caution and danger thresholds define the transition points from stable, to moderately unstable, and finally severely unstable. Thus, the variance intensity is calculated as a function of the total window breach sum, the caution threshold, and the danger threshold.

In addition, CA introduced the notion of decay. Decay is a configurable sliding discount on the value of the window cells: most recent have full value, and least recent suffer the full discount. Intuitively, high levels of decay can “recover” a window more quickly, truncating any trailing variance intensity after a brief, but severe incident. Without decay, the variance intensities will often drop suddenly when a few cells of high breach sums age out.

Advantages over Existing Approaches

Once variance intensities are available, CA can generate alerts, realize the intensities in user interface visualizations, detect the beginning of complex problems, and in general greatly simplify the task of “automatically alerting” or performing stability assessments for a particular business service.

Multi-variate and uni-variate baselining that rely on predictive models have been unsuccessful and require complex configurations to succeed in real environments. CA’s new approach great simplifies the effort required on the part of the customer, while still producing top quality information about the stability of the underlying systems.

In addition, variance intensities can be collapsed into scalars that describe stability of a signal over time. These collapsed values make triage of highly complex systems much easier to accomplish.

12 comments

10 views

I would prefer to do a recorded performance.

Live webcasts can be exciting, but the pace is inevitably very slow compared to a carefully orchestrated recording. I suspect that many more people will listen to and benefit from a recording than from a live webcast. It sounds like there is interest, so I will make a video about Differential Analysis. If there is *still* interest in a webcast, let's discuss then.

Ok, great. Thanks Chris_Kline I really think that would help. I think 10.x customer need to familiarize themselves with DA and what better way to do it then APM Community lending a helping hand

If there is anything I can do to help then please let me know.

Cheers

This a great post but I'd have to agree with nkarthik on presenting this during an APM webcast or schedule one. If this has already been done then please provide the recording of it for customers so they can use it for review and/or reference.

PoetryFan Chris_Kline Guenter_Grossberger

Thanks

Manish

Hi Aaron,

- the learning period is in the range of 8-10 minutes if I remember correctly. Just look at the DA typeview!
- the array of cells is also called window. You can configure the Window Length and other parameters in the "Differential Analysis" Management Module in the Management tab of WebView:
- Danger and Caution threshold: if the slider is further to the right/+ fewer alerts will be generated.
- Window Length: how long should a deviation influence the variation intensity? Bigger window = slower influence of current vs older values.
- Decay: how much influence shall older values in the window have (move slider to right/+ yields less influence).
- Which rules shall be applied?

- As you can see here standard deviation is applied in the first 3 rules. The fourth rules fires if the value is always increasing over 10 intervals. Or 6? PoetryFan please clarify!

In our tests the default values provided the best results. In that regard DA is EPIC as you should never have to change from the default values.

Of course, you can do your own tests, e.g. by running test cases where the ART first is steady for the learning period (~10 min) and then

- suddenly gets worse
- slowly gets worse
- has some "outliers"

Then you can play around with one parameter at the time and compare the results - and share them with us!

Ciao,

Guenter

Really a good explanation of DA and clearly something to build on, thanks a lot. few ponits:

- learning period - how long is that period and what kind of learning is it? e.g. single metric baselining, or is it only reffering to the array of cells (this still seems magic )?
- σ - stands for standard deviation for anyone who does not know by heart .
- The whole configurable part is still a bit blurry as far i understand you can configure array of cells, decay, variance intencities, and a few other things. Somehow a list of parameters and which of them i can and cant configure and how they influence each other would be nice.
- Is this correct: the standard deviation is used by the 4 different rules and each rule has a weigth assigned which gives it a total score which is put in relation of the variance intencities and according to that creates an alert?
- Numbering in the blog is screwed up

But in the end DA should be E.P.I.C but if we have to write a blog post to understand it, it's still not reached the Easy part of E.P.I.C. So what are possible ways to improve this even further?

Great job Dave! AaronRitter this post is for you!

I've posted a video on the inner workings of Differential Analysis.

Understanding Differential Analysis in APM 10