Application monitoring demands automatic alerting; manual construction of alert thresholds and configuration is far too time consuming to be practical, and rarely captures the main purpose, which is to notify operators in the event of problem. Legacy approaches, used by CA and by its competitors, use predict math and other analytics to notice when the actual measurements differ from the predicted ones. These sorts of approaches require prediction training, which can be as draining and difficult a proposition as manually configuring alerts. In addition, when the model doesn’t match the actual measurements it often isn’t clear if the application is broken, or the model is broken. The more complex the mathematics get, the more the approach appears as a “magic show” to customers, who naturally feel like they may be getting tricked.
Statistical control charts are the invention of Walter Shewhart, a quality control analyst that developed them while working for the Western Electric Company in the early 20th century. Control charts anticipate that the signal they monitor has a stable signal, within a particular range. By calculating the standard deviation, Shewhart showed that simple comparisons against bands of standard deviation could effectively identify points at which the signal is exhibiting uncontrolled variance, something like how an earthquake registers on a seismometer. This kind of control charting has come to be referred to as the Western Electric Rules.
CA researchers and engineers discovered that customer application latencies closely mirror the pattern of the dial tone that Shewhart was monitoring. In particular, when they are well behaved they show stable measurements within a range. When the application is suffering from a problem, uncontrolled variance is a hallmark of developing problem and the Western Electric Rules are excellent at identifying it quickly.
While competitors are still trying to “predict the weather”, CA has been using Shewhart’s rules to quickly and simply identify drastic changes in climate. Customers can easily understand the math behind the approach, see it act in response to their metrics, and feel confident that the stream of alerts they receive accurately reflect problems in their myriad systems.
CA has customized the Western Electric Rules output to produce a gradient range of values or stability, from zero (perfectly stable) to thirty (perfectly unstable). These values are painted in a special display, not discussed in this article.
This invention can monitor any application metric, but works best with metrics that have a consistent signal within a particular range. For simplicity, we will talk about latency metrics, which tend to have these properties.
Latency metrics produce measurements at a fixed interval. A double exponential smoothing algorithm with seasonality included makes a prediction of the current latency based on latency measurements in the past (a learning period must elapse in order for this prediction to be available).
Once a prediction is available, a standard deviation is calculated and there bands of standard deviation are calculated and the actual measurement is compared. There are four rules that can be breached:
- 1. Any single data point falls outside the 3σ limit from the centerline.
- 2. Two out of three consecutive points fall beyond the 2σ limit on the same side of the centerline.
- 3. Four out of five consecutive points fall beyond the 1σ limit, on the same side of the centerline.
- Six out of six consecutive points increasing or decreasing.
Weights are assigned to these rules (unit less integers), with the heaviest weight assigned to the first and gradually lighter weights for the following.
An array of cells is constructed with a configurable size. Each cell represents a specific interval of measurement. The sum of the most recently measured rule breaches is assigned to the first cell. As new measurements are taken, the cells are aged. For example, if the window has twenty cells, then breach sums will be forgotten on the twentieth measurement following their introduction. [DAVE: off by one?]
The sum of all breaches in the window directly impacts the variance intensity for that signal. Variance intensity is an arbitrarily chosen range of values where the lowest value indicates perfect stability and the highest indicates perfect instability.
Caution and danger thresholds define the transition points from stable, to moderately unstable, and finally severely unstable. Thus, the variance intensity is calculated as a function of the total window breach sum, the caution threshold, and the danger threshold.
In addition, CA introduced the notion of decay. Decay is a configurable sliding discount on the value of the window cells: most recent have full value, and least recent suffer the full discount. Intuitively, high levels of decay can “recover” a window more quickly, truncating any trailing variance intensity after a brief, but severe incident. Without decay, the variance intensities will often drop suddenly when a few cells of high breach sums age out.
Advantages over Existing Approaches
Once variance intensities are available, CA can generate alerts, realize the intensities in user interface visualizations, detect the beginning of complex problems, and in general greatly simplify the task of “automatically alerting” or performing stability assessments for a particular business service.
Multi-variate and uni-variate baselining that rely on predictive models have been unsuccessful and require complex configurations to succeed in real environments. CA’s new approach great simplifies the effort required on the part of the customer, while still producing top quality information about the stability of the underlying systems.
In addition, variance intensities can be collapsed into scalars that describe stability of a signal over time. These collapsed values make triage of highly complex systems much easier to accomplish.