DX Unified Infrastructure Management

 View Only
  • 1.  The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 21, 2019 06:02 AM
    The email alerts are triggering 3 to 4 times for same alert


  • 2.  RE: The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 21, 2019 09:57 AM
    some probing questions to help
    are the emails coming from the same source
    what is sending the emails
    are the alarms contents 'exact'y the same

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 3.  RE: The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 21, 2019 11:15 AM
    Careful with the term "same" - likely they are not. Check the suppression counter - bet it's different each time.

    Couple things to look at - first on the AO profile, are you using "on arrival" or "overdue age"? On arrival will give you a message each time a new message arrives - so if you have a probe that tests every 30 seconds, you will get an email every 30 seconds - it's what was configured after all. Overdue age sends an alert once the alert is a particular age and since one reaches an age only once, you will get only one email.

    Less likely is suppression keys - you need to look at your nas configuration to determine how that logic is set up because there are several options - more than can be gone into here at this point with so little information.

    It's also possible your alert is matching several AO profiles. Without any information about your configuration it's impossible to guess.


  • 4.  RE: The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 22, 2019 02:15 AM
    Hi ,

    We have configured currently as " message on arrival" in AO rule .

    We are receiving the same email multiple times for same content , i.e for same source same cpu alert/memory alert . Please suggest the resolution for this .

    My doubt :  On overdue age --> 30 s , 2 minutes , 5 minutes 

                    which one should I  configure overdue age as 30 s or 2 min or 5 min ?

                   What will be the impact ?


  • 5.  RE: The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 22, 2019 02:26 AM
    Why don't you follow the recommandations in your duplicate entry: https://community.broadcom.com/enterprisesoftware/communities/community-home/digestviewer/viewthread?GroupId=1315&MessageKey=1040c422-1d71-4055-ad55-8d5f8a8b2335 ?


  • 6.  RE: The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 22, 2019 03:26 AM
    Hi Garin ,

    Please let me know overdue age of how much time will help me in resolving this issue ?

    if overdue is set to 30 s or 2 min or 5 min -- is that postponing the alert for 30 s /2min/ 5 min ? --  ours is a financial institute and we require alert as and when it triggers actually . So , after giving overdue will there be any impact in the alerts that trigger ? 

    My concern : email alert should come as and when threshold breaches and it should not be repeated  3 or 4 times . please let me know what is need to be done ?


  • 7.  RE: The email alerts are triggering 3 to 4 times for same alert

    Posted Aug 22, 2019 10:19 AM
    Bottom right on most of the configuration GUIs there's a "Help" button - will take you to the online documentation - generally. It is very helpful.

    Here is the nas article:

    https://docops.ca.com/ca-unified-infrastructure-management-probes/ga/en/alphabetical-probe-articles/nas-alarm-server/nas-im-configuration

    Pulling the text out of the section on Action Mode (sorry to the copyright people...):

    #########################

    Action mode

    On messages arrival

    Performs the selected action immediately when the alarms arrive.

    Note that this time setting is disabled for some of the actions (close, command, new_alarm and escalate_level), as it is not advisable to perform these actions if the same alarm message (with the same source, sub-system and severity) arrives hundreds of times (Message suppression).

    On overdue age

    Performs the selected action when the age of the alarm exceeds the specified threshold. Select one of the predefined values in the list or type another value of your own choice (use the same format as used in the list).

    On every AO interval

    Performs the selected action on every Auto Operator check interval.

    On every interval

    Performs the selected action on every interval specified. Select one of the predefined values in the list or type another value of your own choice (use the same format as used in the list).

    Important! If you set On every interval, the action type defined in the profile will occur at the set interval as long as the matching criteria is true. For example, if you create a profile that sends an email when an alarm is at critical severity and set On every interval to 1 minute, an email will be sent every minute until the alarm severity changes.

    On trigger

    Select this option when the Trigger mode is selected in the action category.

    The lower portion of this screen displays the options for setting triggers.

    You can select one or more triggers. The Auto operator performs the selected action immediately when the trigger specified is true.

    Example:

    Provided that the profile is not de-activated due to operating period and/or scheduler settings:

    You select Action type = script and Action mode = trigger. You select a script to be executed and a trigger to trigger the action.

    Imagine that the properties dialog for the selected trigger is set to trigger on Message string *Oslo*, the selected script will be run as soon as an alarm message containing the word "Oslo" in the message text appears.

    Note that this choice will restrict the number of Action types available.

    ###########################

    Your CDM probe will create an alarm every time it detects the problem - if you have CDM set to test every 60 seconds, you will get an alarm every 60 seconds. That's the way UIM works. The probes don't generally keep track of state (some do but that's the exception) - they just keep reporting the status of the situation. 

    The nas then has the idea of "suppression" to take all these messages coming in and determine if it's just a message about an existing problem continuing to occur or if it is a new one and then to react appropriately.

    This should answer your question about the behavior and why your configuration is wrong based on the little description about how you want it to work.

    Where I in your shoes and I wanted to get one email message when a problem occurred, I would make the action mode "on overdue age" and set the time to 80 seconds. 

    You expressed a concern that introducing 30 seconds into the email delivery would damage your process. Keep in mind that most of the UIM probes test for a problem periodically - are you going to be cycling these tests every second? No - you will have some interval the tests are run at. When a problem is found, the probe just creates the event and stores it in a local queue. That queue is periodically polled - it's not event driven - so you will have delay there. When the event gets to your central hub it will go into another queue for the nas. That is also polled and is serial - if the nas is unable to keep up with the volume the queue will back up - exactly what it is designed to do. Then your message gets converted to email and sent. SMTP is inherently polled also - and it's not guaranteed delivery - lots of opportunity for loss of messages there. Then your email client - probably also polling though some (like Outlook) are event driven. And then you have the human that needs to respond. We are definitely based on polling. Can you guarantee that there will be a zero second delay in your employee getting that email message while they've gone off to the bathroom? Or your oncall guy is at dinner on a first date? 

    The reality of the whole process is that whether it's 2 seconds or 10 minutes, your average response time isn't going to be impacted.

    What I can tell you is that UIM, in what I have been told is a very complicated configuration that I have, does all of it's stuff in about 90 seconds. It works pretty well in that sense of things. The humans using the information generated are the problem with response times. 

    And additionally, you will find that if you get too close to the problem time wise you will get a lot of noise. I intentionally delay every alert in my system by 80 seconds. Most of my tests cycle at either 30 or 60 seconds. Delaying 80 seconds means that I generally get at least two detections of the issue before telling anybody. That makes a huge difference is how those pesky humans perceive the messaging - people reboot systems all the time - if someone kicked a power cord out and then plugged it back in and the server recovers fine on it's own, do you really need to wake someone at 2am to look at the server with no issue? As a manager you might say yes but you'd be wrong.

    ​​