DX Unified Infrastructure Management

 View Only
  • 1.  Auto-Operator best practice

    Posted Sep 21, 2015 03:52 PM

    I have recently created an auto operator rule that modifies a high number of alerts on arrival and it seems to be causing an excessive amount of alarm list refreshes.  It makes it difficult to concentrate on an alert without pausing the alarm window.  I am wondering if there is something I could have done differently to be more efficient in how the alerts are being handled.

     

    My desire is to set values on several of the custom_# fields of every alert upon arrival, however due to the impact, I have currently limited the action to only minor and greater alerts.  Even with that additional filter, I am seeing the alarm window refresh continuously.

     

    I don't believe I can use a pre-processing script as the script needs to be able to pull details of the alarm (severity and origin time at the moment) in order to set proper values in the custom fields.

     

    The profile has an action type of script, with an action mode of "On message arrival" with a filter for Minor, Major and Critical and a count of less than 2.  In theory it should match every alert that is Minor or greater, but only on the first occurrence of the alert.  However, when I look through the NAS logs, I'm seeing the same alert get matched every 5 seconds until another instance of the alert comes through. 

     

    1. Why are alerts being matched multiple times?
    2. What is the best way to modify these fields with the least impact on the overall system? I need to modify them as early in the process as possible as I will eventually be adding automation to create tickets on message arrival.

     

    I've attached a screenshot of my profile settings below:

    Auto Operator profile.png



  • 2.  Re: Auto-Operator best practice

    Posted Sep 22, 2015 04:01 AM

    I can't see anything wrong in your profile, really. Your message count should prevent it from recurring, then again i'm sure the alert isn't coming in every 5 seconds either. Are you 100% positive the log entry is coming from the AO and not the script?

     

    You could also try "overdue 1s" to see if there's any difference.

     

    -jon



  • 3.  Re: Auto-Operator best practice

    Posted Sep 22, 2015 02:26 PM

    Hi Jon,

     

    The log entries are definitely coming from my script, but my thought was that the script should not even trigger on an alarm more than one time with my settings.  It should trigger when the alarm arrives, and only when the count is less than 2.  Since I get a log entry every 5 seconds from my script, that tells me that the AO profile is triggering multiple times on an alarm.  Here's a sample of log entries that I created while trying to troubleshoot this.  As you can see, the count is 0 both time, but each alarm ID was processed twice within 5 seconds.  The logs continue like this until the alarm reaches a higher count.  (The count in the log entry is actually the alarm.suppcount field, so it shows 0 here, but the alarm shows 1 count in the UI).

     

    Sep 21 12:53:23:604 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63740; count: 0

    Sep 21 12:53:23:605 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63554; count: 0

    Sep 21 12:53:23:606 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63403; count: 0

    Sep 21 12:53:23:607 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63513; count: 0

    Sep 21 12:53:23:607 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63817; count: 0

    Sep 21 12:53:23:608 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63901; count: 0

    Sep 21 12:53:28:609 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63740; count: 0

    Sep 21 12:53:28:609 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63554; count: 0

    Sep 21 12:53:28:610 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63403; count: 0

    Sep 21 12:53:28:617 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63513; count: 0

    Sep 21 12:53:28:618 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63817; count: 0

    Sep 21 12:53:28:618 [8608] nas: Set Custom3-Custom5: Alarm ID: HM52157069-63901; count: 0

     

    I tried your suggestion of overdue 1s, and that definitely seems to quiet the logs and allow things to run more smoothly.  It's unfortunate that I need to have a delay at all as it means I will need to apply a second (larger) delay on processing alerts for automatic ticket creation.  I think I can make it work for now, but if you come up with any other suggestions on how to add data to the custom fields as early in the process as possible, I'm ready to listen .

     

    Thanks for the suggestion!



  • 4.  Re: Auto-Operator best practice
    Best Answer

    Posted Sep 27, 2015 10:26 PM

    Note that "on arrival" applies to the arrival of the message that triggered the AO. So if your script modifies the alarm, that alarm is put back on the bus and "arrives" again. So, essentially what you do in this situation is create a loop where every time you update the message it causes another update.

     

    The "On overdue age" option is badly named but translates to "Run once the alert is older than this". It fires only once because the alert is only ever older than the specified value once.

     

    I've also found that small values of overdue age are unreliable. It seems that there is a point where it takes a time stamp and then writes that to the local database and then it looks up everything that should run and if it took longer than that one second to get committed to the database, you'll lose it. In the whole scheme of things, 5 seconds is probably better than one.

     

    Also note that the nas only runs one profile at a time so if you have a bunch firing, you could see some unexpected results as some get delayed because of the others.

     

    -Garin



  • 5.  Re: Auto-Operator best practice

    Posted Sep 29, 2015 11:34 AM

    Thanks for the info Garin, that definitely explains the behavior I was seeing.  I'll have to take a closer look at all of the AO profiles that are currently configured to ensure they are properly optimized and running at appropriate times.