DX Unified Infrastructure Management

  • 1.  Best practice for alerting only after multiple occurrences?

    Posted Jul 27, 2009 06:14 PM
    I have a number of cases where I need to monitor for multiple
    occurrences of an event (or performance counter) within a period of
    time. I know that I can use ntevl to look for the event and can use the
    NAS to hide/show the alert and send an email only after multiple
    occurrences. To do this I believe it will require the following (please
    correct me if I'm wrong):
    • A profile in the probe (ntevl or ntperf) to monitor for the event
    • An AO profile to make the first alert invisible
    • An AO profile to make the alert visible if the Message Counter exceeds x
    • An AO profile to clear the alert if there are no additional events with y minutes
    Setting
    this up once isn't that big of a deal; however, by the time I get
    setting up all of our monitoring I anticipate having dozens of such
    cases.

    My concern is that there is no clear link between the
    profe profile and the various AO profiles. This means that a month from
    now it will be very challenging for me - and especially anyone else -
    to figure out how things work.

    Are there any best practices for setting up monitoring of this type?

    Thanks,
    Chris

    BTW
    - I realize that this functionality is built into the CDM probe, but
    that applies to only a microscopic subset of performance counters.


  • 2.  Best practice for alerting only after multiple occurrences?

    Posted Jul 27, 2009 07:35 PM
    Chris,

    I believe the first AO profile you mention would actually be a pre-processing rule.  That is probably what you had in mind, but I wanted to make sure.

    You could use a special subsystem ID in all of the ntevl watchers that you want to treat this way.  Then your two AO profiles and one pre-processing rule can be configured to match on subsystem ID.  Then as you add more watches that need this special treatment, you do not have to make more AO profiles or pre-processing rules.

    As far as how to associate them with each other, you have the names and categories.  Coming up with a convention for those will help.

    -Keith


  • 3.  Best practice for alerting only after multiple occurrences?

    Posted Jul 27, 2009 09:34 PM
    Hi Keith,
    Yes, you're correct about the first profile being a pre-processing rule.

    Couple of questions -
    If one of the other admins gets an alert and wants to make a change, normally they would just right-click on the alert and select "configure probe...". Do you know of a good way to help them discover the other pieces of the puzzle at work in getting the alert to the console or their inbox? Also, this is complicated by the fact that for a scenario that came up this morning I will need one more AO profile to run a script which restarts a service.

    My other challenge is that some times I will need to suppress 2 events before raising the alert and other times 3, 4, 10, 20 or more events depending on the circumstances. It seems to me that this would require separate AO profiles for each case. Am I wrong?

    Thanks for the help!

    -Chris


  • 4.  Best practice for alerting only after multiple occurrences?

    Posted Jul 27, 2009 11:06 PM
    Chris,

    As far as finding the other pieces of the puzzle, I think that will be less of an issue if you create AO profiles that can be used for multiple watchers rather than using a 1:1 relationship.  It will still be an issue, and I think the only way to address it is good documentation.  I personally prefer to use a wiki.

    Yes, you would need separate AO profiles for anything you want treated differently.  To simplify a bit, you might be able to encode the count into the subsystem ID.  So if you were using 3.1 as your base subsystem ID, then 3.1.X could mean to wait for X events before making the alarm visible.  You still have to create at least one AO profile for each value of X you wish to support, but that makes it easier for anyone to remember the meanings of the subsystem IDs.  I suppose another possibility would be to create a single AO profile that calls a Lua script, and the script could look at the subsystem ID to get X and use that as the counter.

    If creating separate AO profiles, you might also want to limit the options for the values of X.  For example, it might not make a lot of sense to support both 3 and 4 because they are so close, but 10 and 20 are very different and probably worth supporting separately.  It just depends on how flexible those making the rules can be.

    -Keith


  • 5.  Best practice for alerting only after multiple occurrences?

    Posted Jul 30, 2009 06:33 PM
    Thanks Keith.

    Time to play!


  • 6.  Best practice for alerting only after multiple occurrences?

    Posted Oct 02, 2009 09:06 PM
    I got sidetracked, but now I'm on the job.

    I've managed to set this up and most of it is working fine. Rather than using subsystem, I'm putting the tag in the Message because I want the admin to know that the notification is because there were x events in y minutes. The string I'm adding looks like this .

    I have a pre-processing rule that sets the alert to invisible if the Message string matches /(?i)3\+.in.10min/.

    I have an AO profile that sets the alert to visible if the Message string matches /(?i)3\+.in.10min/ AND the Message Counter is greater than 2 (on arrival).

    Finally, I have another AO profile that runs on overdue age of 10 minutes AND closes the alert if the Message string matches /(?i)3\+.in.10min/ AND Message counter is less than 3.

    This works well with one exception. When the "Close" AO profile runs it completely closes the alert and resets the message counter. This is a problem in the following scenario:

    (Time: 0 minutes)
    T:0   One alert received - Message Count=1 - (alert set to invisible)
    T:5   One alert received - Message Count=2 - (alert set to invisible)
    T:10 Alert closed by AO profile on overdue age 10 minutes
    T:11 One alert received - Message Count=1 - (alert set to invisible)
    T:12 One alert received - Message Count=2 - (alert set to invisible)

    I would expect an alert to be Visible at T:12 because there were 3 events within 10 minutes (T:5, T:11, T:12). However, because the AO profile ran at T:10 and closed the alert the Message Count was reset and therefore the count at T:12 is only 2 instead of 3.

    Make sense?

    Any ideas on how to work around this?

    Thanks,
    Chris


  • 7.  Best practice for alerting only after multiple occurrences?

    Posted Oct 03, 2009 12:51 AM
    You could probably just setup a single AO profile that runs a script upon matching your message expression. Something along the lines of: (not tested working, it's just an example)

    local minutes = 10
    local maxcnt = 3
    local a = alarm.get()
    --I've commented out the following code because it's superflous and unnecessary
    --if timestamp.diff(a.supptime,"%m") > minutes then --if last suppression time is greater than minutes
    --  action.close(a.nimid)
    --else
     
    local b = alarm.query("select count(*) as cnt from NAS_TRANSACTION_LOG where nimid = '" .. a.nimid .. "' and time >= datetime('" .. a.time_arrival .. "','-" .. minutes .. " minute')") --get count of alarms with time greater than or equal to a.time_arrival - minutes
      if b.cnt == 0 then -- if there isn't a candidate alarm within the transaction, delete the alarm
        action.close(a.nimid)
      elseif b.cnt >= maxcnt then -- if number of alarms are greater than or equal to maxcnt, then make visible
        action.visibility(true, a.nimid)
      else
        action.visibility(false, a.nimid) --there are candidate alarms, but the maxcnt hasn't been reached
      end

    --end

    P.S - In this example the alarm will only be deleted if the last suppression time is greater than the specified minutes.

    Update: Modified the script to work a little better and added some comments to describe the flow


  • 8.  Best practice for alerting only after multiple occurrences?

    Posted Oct 05, 2009 05:08 PM
    Chris,

    We do something similar for certain alarms, and we decided that "resetting" the counter on an alarm after the time limit expires from the first occurrence was acceptable.  When I realized this would result in some alarms being closed when they should not, I made sure to discuss this with our engineers and managers.  They decided that the risk of missing an important issue was very low, so we decided to proceed.

    Of course, that was before Lua was added to the NAS, so now you do have other options for processing the alarms.  If it is important to you to get the alarm in the scenario you described, I would recommend trying the script above.  If you really wanted to be fancy, you could even have the code read the count and minutes right out of the alarm message rather than setting them at the top!

    -Keith