DX Application Performance Management

Expand all | Collapse all

Alerting: Complicated alerting requirement to starting and ending a "problem" and sending mails on problem start & solved

  • 1.  Alerting: Complicated alerting requirement to starting and ending a "problem" and sending mails on problem start & solved

    Posted 08-19-2014 10:54 AM

    Hi,

     

    I have the requirement to model the alerting on an end-to-end measurement (e2e) in the following way. Unfortunately this requirement comes from upper management, so it would be nice if I could realize it:

    • A problem situation starts as soon as two consecutive intervals are above the threshold for the measurement
    • The problem is fixed as soon as two consecutive intervals are below the interval
    • Note that the problem is only considered as fixed as soon as two consecutive intervals are below the threshold. This means that the sequence AABABABABA (A: Above, B: Below) would be considered to still have a problem
    • The group of people to inform is increasing with growing duration of the problem, say after 4 intervals of a problem group x is informed, after 8 intervals group y is informed
    • After the problem is solved (2 consecutive intervals below) an email should be sent to everyone that was informed of the problem

     

    Fortunately we only have 2 metrics that we need to report like that (growing at max to like 10), but I dont think I can model sth like that with Introscope Alerts (or can I?).

     

    Is there any way to extend Introscope to solve this. I think on a solution where I have a metric that keeps the current status of the problem and read the metrics myself and change the metric value based on there being two consecutive above events or two consecutive below events. I could even have additional signaling metric that helps me to register the problem being solved (increase the metric to 1 and reset it to 0), so i can put an alert onto that... but I am rambling.

     

    So do you see any way of solving this?


    Thanks you so much,

    Stefan



  • 2.  Re: Alerting: Complicated alerting requirement to starting and ending a "problem" and sending mails on problem start & solved

    Posted 08-26-2014 12:00 PM

    Hi Stefan!

    The normal alerting functionality would, in my opinion, not be able to cover the functionality that you are looking for. But you might be able to do something using a combination of alerts and javascript calculators.

    I might be rambling as well, and it might get hairy, but hear me out:

    With a javascript calculator you can read the value of your metric, and generate a new ones. Let's call your original metric "StefansMetric" and your new metrics "AlertGroup1", "AlertGroup2", etc..

    You can keep the previous values of your metric in an array to check if it has been in a specific state over x periods. Also, you would need to define the thresholds of "StefansMetric" and keep some information within the javascript calculator.

     

    Now, as long as "StefansMetric" is below the warning threshold, you're fine, and you would set "AlertGroupX" metrics to 0. As soon as "StefansMetric" exceeds the a given threshold, you add a warning entry in your array. If in the next interval "StefansMetric" is down to normal again, you remove the warning entry from the array. The "AlertGroupX" metrics are still 0 in these two intervals.

    If by any means, it still exceeds the warning level, you set "AlertGroup1" to 1. You define a normal alert on the metric, and send an email to the first group for exceeding the "caution" level. The idea is to keep track of what alerting level you're at in the javascript calculator.

    If "StefansMetric" stays above the critical threshold, you add new entries to the array. And you always count how many entries you have. If "StefansMetric" stays in a critical state for 4 intervals, you set "AlertGroup2" to 1, and send an email to this group using a normal alert. And so it goes on and on

    As soon as you're back to normal, for more than two intervals, you check which of your "AlertGroupX" metrics have the value 1, and set all of them to 2. The alert would claim that this exceeds the "danger" level, but in fact you send an email saying everyone can sleep well again In the next interval you set all of the "AlertGroupX" metrics back to 0 again...

     

    I warned you, this might get a little hairy... muaaaah. And I haven't really tried it out myself yet. But if you feel like doing some javascripting, it might be funny to give it a try.

    Call me if you have any questions.

     

    Cheers,

    Stig



  • 3.  Re: Alerting: Complicated alerting requirement to starting and ending a "problem" and sending mails on problem start & solved

    Posted 08-27-2014 02:51 AM

    Hi Stig,

     

    Thanks for sharing this approach. You are right, this is possible and should not be too hard to realize.

     

    Pros:

    • A way to solve our problem
    • Easy way to configure the Alerts using standard Introscope Alerts
    • Easy way to configure the mails to be sent using standard Introscope Actions

     

    Cons:

    • Thresholds are configured within the javascript,
      • which is usually harder to change than just adapting an alert within management modules.
      • which is not transparent to the user
    • Two additional metrics per Alert Type are necessary
    • A bit magic, no real "hard coupling" of the metrics, alerts and actions, with the metrics the alerts work on being automatically created by javascript

     

    We will try this one out.

     

    Thank you,

    Stefan



  • 4.  Re: Alerting: Complicated alerting requirement to starting and ending a "problem" and sending mails on problem start & solved

    Posted 08-28-2014 09:32 AM

    Stefan,

     

    Might be able to use staged alerts on the same metric group but it would require a bit of thinking to get all the periods in line to get the behavior you would like.

     

    Basic behaviors of the simple and summary alerts being used in a cascading fashion might work.  But if you continue having such complicated alerting behavior might want to look into an alert management offering such as PagerDuty.

     

    On the Alerts, (while this isn't a full solution, just gives you a notion of what you could do with native alerts)

    First, gotta come up with a drop-dead interval that is if a problem still is occurring is the final critical state.  This would be your observed period setting of your second alert.

     

    So alert one would have the following settings

     

    First Alert

         Danger Threshold : 95

              Periods Over Threshold: 5

              Observed Periods: 12 (three minutes)

              No Actions

     

         Caution Threshold: 85

              Periods Over Threshold: 2

              Observed Periods: 20 (five minutes)

              Trigger: Whenever severity Changes

              No actions

    This alert would provide your initial notification, you could trim the periods over to observed to be a closer ratio.

     

    Second Alert

         Danger Threshold : 95

              Periods Over Threshold: 12

              Observed Periods: 12 (three minutes)

              Trigger: Whenever Severity Changes

              No Actions

     

         Caution Threshold: 85

              Periods Over Threshold: 20

              Observed Periods: 20 (five minutes)

              No actions

    This alert would provide your on going notification, you could trim the periods over to observed to be a closer ratio.  So you could do a ratio of of 2 of 3 to get the two periods good is good.  Might help if you plot out the state of the alert and run through sort of like a truth diagram to try to figure out the behavior.

     

    Third Alert (summary)

        You would include your initial stage and on going alerts into this summary alert and then set the "for" to be "All Alerts" instead of "Any Alert" so that if either of them are in error, you will get a notice, (add your alert actions to the summary), and will only send the good when both of them are in good state.

     

    There is also a community Idea topic on CA improving the alert management within their products so might want to add a vote to it.

     

    Hope this helps,

     

    Billy