DX NetOps

 View Only
  • 1.  CRITICAL Chassis down alarm did not clear after router back online

    Posted Jan 10, 2018 02:18 PM

    Time to time we notice a critical chassis down alarm not clearing by itself.  How do we troubleshoot to root cause?  BGP reestablished few minutes later.  It shows STALE but it should have cleared by itself.

     

     

     

     

     

     

    ITGC2W000145%/d/CA_Spectrum/vnmsh > show alarms -a mh=0x50e171

    ID Date Time PCauseId MHandle MName MTypeName Severity LastOccurDate&Time Ack Stale Assignment Status

    46955415 01/05/2018 02:12:41 0x10f71 0x50e171 VNDDRTR1.vfc.com Rtr_Cisco MAJOR 01/05/2018 02:12:41 No Yes 13845619 05/20/2017 15:28:40 0x1030a 0x50e171 VNDDRTR1.vfc.com Rtr_Cisco OK 01/06/2018 06:06:18 Yes Yes 46955413 01/05/2018 02:12:41 0x10f69 0x50e171 VNDDRTR1.vfc.com Rtr_Cisco CRITICAL 01/05/2018 02:12:41 No Yes ITGC2W000145%/d/CA_Spectrum/vnmsh > ./disconnect

     

    Events Tab

    Jan 5, 2018 2:14:30 AM EST    VNDDRTR1.vfc.com    "A ""cbgpFsmStateChange"" event has occurred, from Rtr_Cisco device, named VNDDRTR1.vfc.com.

     

    The BGP cbgpFsmStateChange notification is generated
            for every BGP FSM state change. The bgpPeerRemoteAddr
            value is attached to the notification object ID.

     

    bgpPeerLastError = 0.0
    bgpPeerLastError.bgpPeerRemoteAddr = 10.255.28.29
    bgpPeerState = established
    cbgpPeerLastErrorTxt =
    cbgpPeerPrevState = openconfirm"
    Jan 5, 2018 2:14:30 AM EST    VNDDRTR1.vfc.com    A bgpEstablished trap has been received for this device.  The peer router is 10.255.28.29, the current state is established, and the LastError is 0.0.
    Jan 5, 2018 2:14:30 AM EST    VNDDRTR1.vfc.com_Se0/0/0    The BGP Peering session from VNDDRTR1.vfc.com to US MPLS Sprint AS1803 is established.



  • 2.  Re: CRITICAL Chassis down alarm did not clear after router back online

    Broadcom Employee
    Posted Jan 11, 2018 10:37 AM

    The BGP notification isn't configured to clear chassis alarms...unless you were just posting that to note the device was back online.  If the alarm is stale, that may be why it didn't clear.  Were you able to manually clear it?

    Cheers

    Jay 



  • 3.  Re: CRITICAL Chassis down alarm did not clear after router back online

    Posted Jan 11, 2018 05:08 PM

    I can clear it manually.  I did post BGP message as an event to indicate the device has connectivity now only 2 minutes later from alarm timestamp.  Don't understand why this particular alarm just did not clear on its own.  Notice other minor/major/critical stale alarms.  I can clear all those and start monitoring when they occur. 

     

    Since device seems to be reachable within 3 minutes of the alarm time stamp just curious on how to investigate why chassis down alarm did not clear on its own?



  • 4.  Re: CRITICAL Chassis down alarm did not clear after router back online
    Best Answer

    Broadcom Employee
    Posted Jan 12, 2018 03:53 PM
      |   view attached

    If the device comes back online, the chassis down alarm should clear on it’s own.  The only time I’ve seen where it doesn’t is if there was a customization on the Chassis event/alarm instead of using the default (check the /custom/Events/EventDisp).  If you have the default event configuration and this keeps happening, you may need to open a case so we can review the data.

    Cheers

    Jay



  • 5.  Re: CRITICAL Chassis down alarm did not clear after router back online

    Posted Jan 15, 2018 10:04 AM

    Ok, we don't have a EventDisp under /custom/Events.  I'll keep an eye out from this point forward to note when stale alarms occur.  Especially right before and after a Spectroserver shutdown/restart as I know that is one possibility when this can occur.