DX Infrastructure Manager

Expand all | Collapse all

NAS - Auto-Operator Rules Not Working Correctly

  • 1.  NAS - Auto-Operator Rules Not Working Correctly

    Posted 11-29-2017 06:40 AM

    Follow details about our environment:

     

    CAUIM VERSION: 8.2

    NAS VERSION: 4.91

    SNGTW VERSION: 2.12

     

    In our monitoring environment, we use CAUIM for network management and ServiceNow for ticket management, ITSM. We have a integration between CAUIM and ServiceNow that is performed using the SNGTW probe. The operation is simple:

     

    •       If a Major or Critical alarm is generated by CAUIM; after 5 minutes it is signed by the Auto-Operator for the Optimal user. SNGTW probe uses Optimal user to sent the alarm to ServiceNow. Done! An incident is opened using this simple workflow.

     

    Auto-Operator Rule, we made one rule for each hub:

     

     

    SNGTW Probe using the optimal user.

     

    I think that we have used this integration since 2013 and the integration rule is that only MAJOR or CRITICAL alarms can be signed for the Optimal user, however today an alarm that was in the CLEAR status was signed for the Optimal user and therefore an incident was generated with Severity CLEAR and with Message CLEAR.

     

    On 28-11-2017 at 08:04:02 a CLEAR alarm was assigned to optimal user, and I do not know why.  

     

    The incident was opened in ServiceNow, with a clear message and severity – CLEAR, but everything is OK with NAS configuration.

     

    nas LOG

    Nov 28 08:04:02:034 [16632] nas: ExecEvent: OVERDUE rule='Service - HS1B-Unidas', trigger='internal', nimid=BC25433479-57555, age=301s, ACTION:assign Nov 28 08:04:02:034 [16632] nas: ExecEvent: Rule='Service - HS1B-Unidas' on nimid='BC25433479-57555' with ACTION:assign, age:301s, status:OK

     

    NAS.CFG

    <Service - HS1B-Unidas>

    active = yes

    action = assign optimal

    overdue = 5m

    level = major,critical

    hub = HS1B-Unidas

    visible = 1

    order = 16

    break = no

    </Service - HS1B-Unidas>

     

    Searching on ServiceNow I encountered other cases similar to this(Severity = CLEAR) and I found 2300 cases since 2015 with 1464 incident only in 2017.

     

    Please, would anyone have an#y suggestions or ideas about this case?



  • 2.  Re: NAS - Auto-Operator Rules Not Working Correctly

    Posted 11-29-2017 09:38 AM

    I suspect the problem might not be related to NAS not matching the correct severity but NAS configuration instead. Looks like from a couple of tests I've done, if an AO rule matches an alarm correctly and, as in your case, after 5 minutes it assigns it to the defined user, whenever the alarm "changes" severity it keeps the assignment. When the alarm is "cleared" by a clear alarm sent by the probe what it does it changing the severity from Critical or Major to clear. So it should be expected that the clear alarm is assigned even if the AO should only assigns Critical and Majors 

     

    Now, looking at your screenshot:

     

    What can I assume from it is that you have the NAS set to display clear alarms - By default you don't see them.

    Below is the configuration where you set this (Unticking the box below enables clear alarms to be stored in the NAS_ALARMS table)

     

     

    I'm not familiar with how SNGTW scans the alarms but if this is done on the NAS_ALARMS, it might explain the behavior. SNGTW finds an alarm that is assigned to 'optimal' and hence it triggers the ticket.

     

    So the question is: Do you have "Accept automatic aknowlegdment' of alarm unticked in your NAS? if yes, is that for a reason?

     

    If all of the above doesn't help you, an other thing I would check is if you have any AO script that is also assigning alarms as I see "assigned by administrator" in the transaction history (when it should only be auto-operator if its an AO that is assigning the alarm).

     



  • 3.  Re: NAS - Auto-Operator Rules Not Working Correctly

    Posted 11-29-2017 12:54 PM

    Marco,

     

    Thank you very much for your tests and your answer.

     

    I think that you understood that Major and Critical alarms must be assigned to the user defined via AO even if the alarm has been CLEARED before 5 minutes, as it should be checked by the specialist. Am i right?

     

    However, since we use the Nimsoft, if an alarm is CLEARED before 5 minutes it is not assigned to the user definided via AO. I have many examples of this, it is the behavior expected by us for our daily operational routine

     

    And yes, your guess is correct, CLEARED alarms are displayed on our alarm console.
    The "Accept Automatic Acknowledgment" option in our NAS probe is unticked.

     

    We use clear alarms as a premise, so that specialists and engineers know that the incident has been resolved and 

    are allowed to close the incident in ServiceNow.

     

    About your last question, about the "assigned by administrator", I do not understand this behavior
    in our NAS probe. We do not have scripts running in the NAS probe.  I do not know how to tell you why this appears written "assigned by administrator". 



  • 4.  Re: NAS - Auto-Operator Rules Not Working Correctly

    Posted 11-30-2017 05:07 AM

    Hi Jean, got it. Does this happen on a regular basis or you've only seen it once? Does it happen only with cisco_monitor alarms?  Are you able to reproduce the problem?



  • 5.  Re: NAS - Auto-Operator Rules Not Working Correctly

    Posted 11-30-2017 05:30 AM

    Marco, hello.

     

    We have identified this behavior in the last week. This is not a behavior
    expected by us. Expected behaivor is that only Major / Critical alarms are signed by the
    AO for the optimal user.

     

    I generated some reports in ServiceNow, follow the numbers. In the year 2017, Nimsoft opened 52350 incidents in ServiceNow. Of the 52350 incidents, 1464 opened incorrectly with Severity CLEAR.

     

    This case does not occur only with cisco_monitor, we have identified this with all probes from our environment, like cdm, interface_traffic, snmpget, etc.

     

    Unfortunately I cannot reproduce this behavior, since it is automated by AO and I do not have action.

     

    Regards,

    Jean Gomes



  • 6.  Re: NAS - Auto-Operator Rules Not Working Correctly

    Posted 11-30-2017 07:02 AM

    My hypothesis of what's happening there:

    the critical alarm came in at 7.59.00 and After 5 Minutes exactly (when age=301s, at 8:04:02:034) alarm is assigned and at the same time (seconds later) a clear got in (could be a corner issue). The result is what I explained in my first comment. the alarm is assigned but changed severity. I would analyze different alarms and see what happened on those other cases. NAS should match a clear alarm as such but it matches the critical alarm and it assigns it if the alarm is still there after 5 minutes. Also I would check if there was interaction of the administrator who manually interfered this.

     

    However, regardless all of the above, I believe you will still run into this issue as clear alarm that are assigned will be visible. For example when clear gets in after 5 minutes, and on those cases you will get tickets opened.

     

    I think you can resolve this issue by enabling "Accept automatic acknowledgment'. If this option is enabled clear alarms won't be visible so no ticket will be ever created for Clear alarms. 

     

    But  you say that you use this feature with this scope of "specialists and engineers know that the incident has been resolved and are allowed to close the incident in ServiceNow."

     

    However, I understand The probe is able to automatically change the incident status to Closed or Resolved when the corresponding alarm is closed  (only when the alarm is "Aknowledged" which means closed). 

     

    If you enable "Accept automatic acknowledgment' alarms that are cleared will:

     

    1. If clear arrives before 5 minutes since the critical or major > it will CLEAR and CLOSE the alarm so NO ticket will be raised in SNOW

    2. If clear arrives after 5 minutes since the critical or major > It will CLEAR and CLOSE the assigned alarm and as consequence it will Also close the SNOW ticket. 

     

    Are you aware about this? Is there any reason why you are not using this feature?

     

     



  • 7.  Re: NAS - Auto-Operator Rules Not Working Correctly

    Posted 12-03-2017 10:58 AM

    I'll toss in my experience here as I've seem something similar though not with the service now gateway. 

     

    The problem we ran into was that, while the NAS AO operation are guaranteed to be chronological in nature, the down stream processes are not. 

     

    That was combined with the fact that some probes operate on a copy of the data and some only get a reference to the current data.

     

    What happened occasionally to us was that an alarm would come in and fire the AO profile associated with the create. This then queued an "open case" event for another process to pick up and operate on. This hand-off only passed the NIMID of the event. 

     

    Most of the time, things were fast enough that the data referred to by the NIMID was static between the AO running and the other process acting on the data. But in cases where the NAS alarm queue got backed up, you could have thousands of create and clears processed in a couple seconds which bogged down the downstream process and its work queued.

     

    If an event had an open and close in a batch of quickly processed data like this what would happen is that when the downstream process got the "open event" request, it looked up the alarm data using the NIMID and read an alarm that was in a cleared state. It then happily processed this as an open event and created "closed" cases as new ones.

     

    This further confused things because the "close event" process only tried to close cases that were in a non-clear (closed) status and so when a fraction of a second later this process got the "close event" request it failed to do anything because it was asked to close an open event that had a closed  status.

     

    Essentially it was an ugly race condition. 

     

    We wound up addressing this by introducing logic at each step of the process that checked to make sure the data being operated on was sensible and hadn't been changed by something else along the way. And when it was appropriate, the process made its own local copy of data and stopped referencing the source data by NIMID.

     

    -Garin