I have the requirement to model the alerting on an end-to-end measurement (e2e) in the following way. Unfortunately this requirement comes from upper management, so it would be nice if I could realize it:
Fortunately we only have 2 metrics that we need to report like that (growing at max to like 10), but I dont think I can model sth like that with Introscope Alerts (or can I?).
Is there any way to extend Introscope to solve this. I think on a solution where I have a metric that keeps the current status of the problem and read the metrics myself and change the metric value based on there being two consecutive above events or two consecutive below events. I could even have additional signaling metric that helps me to register the problem being solved (increase the metric to 1 and reset it to 0), so i can put an alert onto that... but I am rambling.
So do you see any way of solving this?
Thanks you so much,
I might be rambling as well, and it might get hairy, but hear me out:
Now, as long as "StefansMetric" is below the warning threshold, you're fine, and you would set "AlertGroupX" metrics to 0. As soon as "StefansMetric" exceeds the a given threshold, you add a warning entry in your array. If in the next interval "StefansMetric" is down to normal again, you remove the warning entry from the array. The "AlertGroupX" metrics are still 0 in these two intervals.
If "StefansMetric" stays above the critical threshold, you add new entries to the array. And you always count how many entries you have. If "StefansMetric" stays in a critical state for 4 intervals, you set "AlertGroup2" to 1, and send an email to this group using a normal alert. And so it goes on and on
As soon as you're back to normal, for more than two intervals, you check which of your "AlertGroupX" metrics have the value 1, and set all of them to 2. The alert would claim that this exceeds the "danger" level, but in fact you send an email saying everyone can sleep well again In the next interval you set all of the "AlertGroupX" metrics back to 0 again...
Call me if you have any questions.
Thanks for sharing this approach. You are right, this is possible and should not be too hard to realize.
We will try this one out.
Might be able to use staged alerts on the same metric group but it would require a bit of thinking to get all the periods in line to get the behavior you would like.
Basic behaviors of the simple and summary alerts being used in a cascading fashion might work. But if you continue having such complicated alerting behavior might want to look into an alert management offering such as PagerDuty.
On the Alerts, (while this isn't a full solution, just gives you a notion of what you could do with native alerts)
First, gotta come up with a drop-dead interval that is if a problem still is occurring is the final critical state. This would be your observed period setting of your second alert.
So alert one would have the following settings
Danger Threshold : 95
Periods Over Threshold: 5
Observed Periods: 12 (three minutes)
Caution Threshold: 85
Periods Over Threshold: 2
Observed Periods: 20 (five minutes)
Trigger: Whenever severity Changes
This alert would provide your initial notification, you could trim the periods over to observed to be a closer ratio.
Periods Over Threshold: 12
Trigger: Whenever Severity Changes
Periods Over Threshold: 20
This alert would provide your on going notification, you could trim the periods over to observed to be a closer ratio. So you could do a ratio of of 2 of 3 to get the two periods good is good. Might help if you plot out the state of the alert and run through sort of like a truth diagram to try to figure out the behavior.
Third Alert (summary)
You would include your initial stage and on going alerts into this summary alert and then set the "for" to be "All Alerts" instead of "Any Alert" so that if either of them are in error, you will get a notice, (add your alert actions to the summary), and will only send the good when both of them are in good state.
There is also a community Idea topic on CA improving the alert management within their products so might want to add a vote to it.
Hope this helps,