I stumbled over a problem with alerts and their evaluation.
Short example to explain the problem: I use a metric grouping on a metric and it's value is either 0 or 1. 0 means there is no problem, 1 means there is a problem. An alert is used on this metric grouping with the following settings: Resolution : 15 sec; Comparison Comparator : Equal To; Trigger Alert Notification : When Severity Increases; Combination : All; Notify by Individual metric : disabled; Danger => Threshold : 1; Periods Over Threshold / Observed Periods: both 1. Within the danger action list is a email notification action which informs that there is a problem. Caution: Default values => Threshold : 0; Periods Over Threshold / Observed Periods: 1
So for my understanding the email notification should be send when the value of the metric has changed from 0 to 1. Unfortunately, sometimes the email is send again even if the problem still exists (it's been reported by the metric). I had a deeper lock at the metric and found out that in some cases there isn't a value reported for the metric in one interval. You can see this behaviour in the image below at 01:00:30. There is no data point.
In had the idea of increasing the periods over threshold and observed periods both to 2 because I thought that this will wipe out the wrong evaluation if data is missing for one period. Unfortunately it doesn't.
I looked deeper in the documentation and found out that Introscope provides metrics which show the alert status. The following picture shows this metric for the above alert. So the alert was in the danger status (value 3) as expected. During the one period of missing data it has the value of 0 (not reporting). So it shows the value as described in the documentation.
Taking this facts into consideration I think that Introscope includes "not reporting" periods into alert evaluation. This results in sending a email at 01:01:00 because the severity increases after it has decreased before due to the missing data in one period.
What do you guys think? Does Introscope include the not reporting periods into alert evaluation? Do I use the alerts incorrect?
Basically what I want is that Introscope doesn't fire again an action after periods of missing data occurred and basically the alert is still in danger status.
Yes Introscope does include non-reporting periods into alert evaluation as normal state and hence in your case alert is retriggered.
To achieve what you want you will need to use the approach outlined above i.e instead of using 1 Periods over Threshold\Observed Periods use a diff. number for e.g 3 out of 4 periods observed ,however there was a bug in this logic on EM side in older versions(any release <9.5.3), likely why it didn't work for you during testing. If you are using older version of EM then upgrade and modify alert settings as suggested to get desired behavior.
thanks for your fast and helpful response. I will try your described approach. We use the 9.5.3 version so it should work out.
However, it looks like a workaround to me to overcome wrong alert evalution (clearly my opinion). So, is there any good reason why "Non reporting" periods are used in the evaluation? From my point of view this is wrong behaviour. At this point, Introscope cannot tell whether the period exceeds a threshold or not. If the alert is in caution or danger status its meaning gets "There is no problem" which isn't true. As result, the severity decreases and increases afterwards and the action will be fired again. You can't answer it. So why not skip this periods for evaluation.