Brocade Fibre Channel Networking Community

Expand all | Collapse all

Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

  • 1.  Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

    Posted 03-25-2013 09:09 AM

    We have 2 Brocade 48000 directors running Fab OS V5.3.2c (yes - I know we're behind).  At the same time yesterday morning, BOTH switches logged an error on CP0 and successfully failed over to CP1.

    From CP0 error log

    2013/03/24-03:54:55, , 772, FFDC, WARNING, IBM_2109_M48, kSWD: Detected unexpected termination of: ''zoned:0'RfP=925,RgP=925,DfP=0,died=1,rt=125771772,dt=83402,to=50000,aJc=125670272,aJp=125636971,abiJc=-1027802072,abiJp=-1027835372,aSeq=7636,kSeq=0,kJc=0,kJp=0,J=1256

    2013/03/24-03:54:55, , 773,, WARNING, IBM_2109_M48, HA State out of sync.

    2013/03/24-03:54:55, , 774,, INFO, IBM_2109_M48, First failure data capture (FFDC) event occurred.

    2013/03/24-03:54:56, , 775,, WARNING, M48_Switch_2, Switch status changed from HEALTHY to MARGINAL.

    2013/03/24-03:54:56, , 776,, WARNING, M48_Switch_2, Switch status change contributing factor CP: CP non-redundant.

    2013/03/24-03:58:16, , 777,, WARNING, IBM_2109_M48, Trace dump available (Slot 5)! (reason: PANIC)

    2013/03/24-03:58:16, , 778,, WARNING, IBM_2109_M48, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.

    2013/03/24-03:58:17, , 779,, INFO, IBM_2109_M48, Processor rebooted - Software Fault:Software Watchdog

    2013/03/24-03:58:36, , 780,, INFO, M48_Switch_2, Config file change from task:PDMIPC

    2013/03/24-03:58:36, , 781,, INFO, IBM_2109_M48, HA State is in sync.

    2013/03/24-03:54:55, , 3168,, WARNING, IBM_2109_M48, HA State out of sync.

    2013/03/24-03:55:53, , 3169,, INFO, M48_Switch_2, iSNS Client Service is disabled.

    2013/03/24-03:55:53, , 3170,, INFO, M48_Switch_2, Config file change from task:PDMIPC

    2013/03/24-03:55:58, , 3176,, INFO, M48_Switch_2, Previous message repeated 6 time(s)

    2013/03/24-03:56:06, , 3177,, INFO, M48_Switch_2, Config file change from task:PDMIPC

    2013/03/24-03:56:06, , 3178,, ERROR, IBM_2109_M48, CP in Slot 5 set to faulty because CP ERROR asserted.

    2013/03/24-03:56:06, , 3179,, INFO, M48_Switch_2, Config file change from task:PDMIPC

    2013/03/24-03:56:09, , 3182,, INFO, M48_Switch_2, Previous message repeated 3 time(s)

    2013/03/24-03:56:43, , 3183,, WARNING, M48_Switch_2, Switch status changed from HEALTHY to MARGINAL.

    2013/03/24-03:56:43, , 3184,, WARNING, M48_Switch_2, Switch status change contributing factor CP: CP non-redundant.

    2013/03/24-03:57:26, , 3185,, INFO, IBM_2109_M48, Resetting standby CP (double reset may occur)

    2013/03/24-03:57:29, , 3186,, INFO, IBM_2109_M48, CP in slot 5 not faulty, CP ERROR deasserted.

    2013/03/24-03:58:10, , 3187,, WARNING, IBM_2109_M48, Trace dump available (Slot 5)! (reason: PANIC)

    2013/03/24-03:58:10, , 3188,, WARNING, IBM_2109_M48, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.

    2013/03/24-03:58:10, , 3189,, INFO, M48_Switch_2, Config file change from task:PDMIPC

    2013/03/24-03:58:35, , 3193,, INFO, M48_Switch_2, Previous message repeated 4 time(s)

    2013/03/24-03:58:37, , 3194,, INFO, IBM_2109_M48, HA State is in sync.

    2013/03/24-03:58:39, , 3195,, INFO, M48_Switch_2, Switch status changed from MARGINAL to HEALTHY.

    More Info:

    M48_Switch_2:admin> firmwareshow

    Slot Name     Appl     Primary/Secondary Versions               Status

    ------------------------------------------------------------------------

      5  CP0      FOS      v5.3.2c                                  STANDBY *

                           v5.3.2c                                 

      6  CP1      FOS      v5.3.2c                                  ACTIVE

                           v5.3.2c                                 

    M48_Switch_2:admin> hashow

    Local CP (Slot 5, CP0): Standby

    Remote CP (Slot 6, CP1): Active

    HA enabled, Heartbeat Up, HA State synchronized

    What happened ?   How did this happen on BOTH switches at once ?  They're not connected to each other


    #BrocadeFibreChannelNetworkingCommunity


  • 2.  Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

    Posted 03-25-2013 09:19 AM

    The failover seems to have been brought about by a daemon that failed. Check the CORE file created in order to see which daemon failed, and if it was the same in the two switchs (maybe an intensive snmp polling, for example...).

    rgds,

    F


    #BrocadeFibreChannelNetworkingCommunity


  • 3.  Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

    Posted 03-25-2013 09:24 AM

    Would the CORE files be on CP0 or CP1.   How do I read them ?

    Thank You


    #BrocadeFibreChannelNetworkingCommunity


  • 4.  Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

    Posted 03-25-2013 09:35 AM

    First of all you should generate a fresh supportsave.

    Within the supportsave, there should be a file named XXXX.CORE_FFDC.tar.gz. In it, you may find info about the daemon that failed.

    Rgds


    #BrocadeFibreChannelNetworkingCommunity


  • 5.  Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

    Posted 03-25-2013 10:30 AM

    Thank you.  Un-tarring the core file created several directories,  In one called "panic" there was a file called core.pd1364111880.  In that file I found the text below.  Is the "no space' left on device the problem that caused the panic ?

    == Dumping debug information ==

    2013/03/24-03:54:55, , 772, FFDC, WARNING, IBM_2109_M48, kSWD: Detected unexpected termination of: ''zoned:0'RfP=925,R

    gP=925,DfP=0,died=1,rt=125771772,dt=83402,to=50000,aJc=125670272,aJp=125636971,abiJc=-1027802072,abiJp=-1027835372,aSeq=7636,kSeq=0,

    kJc=0,kJp=0,J=1256

    2013/03/24-03:54:55, , 773,, WARNING, IBM_2109_M48, HA State out of sync.

    shmInit: shmget failed: No space left on device

    shmInit: shmget failed: No space left on device

    F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

    4     0     1     0  16   0   1700   580 -      S    ?          0:03 init

    1     0     2     1  34  19      0     0 -      SN   ?          0:16 ksoftirqd/0

    1     0     3     1  10  -5      0     0 -      S<   ?          0:00 events/0

    1     0     4     1  20  -5      0     0 -      S<   ?          0:00 khelper

    1     0     5     1  11  -5      0     0 -      S<   ?          0:00 kthread

    1     0    27     5  10  -5      0     0 -      S<   ?          0:00  \_ kblockd/0

    1     0    56     5  15   0      0     0 -      S    ?          0:00  \_ pdflush

    1     0    59     5  14  -5      0     0 -      S<   ?          0:00  \_ aio/0

    1     0    60     5  14  -5      0     0 -      S<   ?          0:00  \_ xfslogd/0

    1     0    61     5  14  -5      0     0 -      S<   ?          0:00  \_ xfsdatad/0

    1     0    62     5  10  -5      0     0 -      S<   ?          0:00  \_ xfsbufd

    1     0    69     5  14  -5      0     0 -      S<   ?          0:00  \_ kseriod

    1     0   844     5  15   0      0     0 -      S    ?          0:00  \_ pdflush

    1     0    58     1  15   0      0     0 -      S    ?          0:00 kswapd0

    1     0   265     1  15   0      0     0 -      S    ?          0:00 kjournald

    1     0   283     1 -100  -   1680   396 -      Ss   ?          0:00 wdtd

    1     0   345     1  15   0      0     0 -      S    ?          0:08 kjournald

    5     1   495     1  16   0   1692   428 -      Ss   ?          0:00 portmap

    5     0   513     1  16   0   2120   636 -      Ss   ?          0:00 inetd

    1     0   537     1  16   0   1812   616 -      Ss   ?          0:06 crond

    1     0   538     1  15   0   1948   652 -      Ss   ?          0:00 syslogd

    5     0   544     1  16   0   1704   380 -      Ss   ?          0:00 klogd

    1     0   550     1  15   0      0     0 -      S    ?          0:01 RASLOGK_TH

    1     0   681     1  15   0      0     0 -      S    ?          2:44 module-99-th

    1     0   684     1  19   0      0     0 -      S    ?          0:00 module-107-th

    1     0   687     1  19   0      0     0 -      S    ?          0:00 module-126-th

    1     0   712     1  19   0      0     0 -      S    ?          0:00 kmtracer

    0     0   787     1  16   0  61312  5248 -      Dl   ?          0:22 raslogd

    1     0  5847   787  16   0  61316  2196 -      Ss   ?          0:00  \_ raslogd

    0     0  5848  5847  18   0   8648  2012 -      R    ?          0:00      \_ tracedump

    5     0   789     1  16   0  16620  1708 -      Ssl  ?          7:24 ipadmd

    1     0   792     1  16   0   7444  1264 -      Ss   ?          0:00 telnetmond

    0     0   800     1  16   0  17088  2472 -      Sl   ?          0:00 sysctrld

    4     0   898   800   0 -16  12084  1112 -      S

    4     0   899   800  16   0  51548  3292 -      Sl   ?          0:00  \_ pdmd

    0     0   900   800  16   0  23640  1884 -      Sl   ?          0:00  \_ hmond

    0     0   901   800  16   0   8100  1788 -      S    ?          0:00  \_ diagd

    0     0   902   800  18   0   8044  1780 -      S    ?          0:00  \_ porttestd

    0     0   903   800  16   0  77004  3876 -      Sl   ?        128:33  \_ emd

    0     0   905   800  16   0  50248  2344 -      Sl   ?          0:00  \_ bmd

    4     0   906   800  16   0  24484  2176 -      Sl   ?          0:00  \_ hamd

    0     0   923   800  16   0  59116  2996 -      Sl   ?          0:00  \_ essd

    4     0   924   800   6 -10  53820  3708 -      S

    0     0   926   800  16   0  51504  2976 -      Sl   ?          9:06  \_ fspfd

    0     0   927   800  15   0  72484  6376 -      Sl   ?         10:38  \_ nsd

    0     0   929   800  16   0  33800  2476 -      Sl   ?          0:00  \_ arrd

    0     0   930   800  16   0  65708  4096 -      Sl   ?         34:39  \_ msd

    0     0   931   8000:'nsd:0' PS:3(C2013/03/24-03:54shmInit: shmget failed: No spaceshmInit: shmget failed: No space left on device

    sysctrld tries to remove shm with id 0

    Assert failure: /vobs/pWarning: bad ps chassis0(0): ACTIVE(0), Required

    local = SYN_SUCC, prev = SYN_STime=7:54:56-480Time=7:54:56-480output of memshoshmInit: shmget failed: No space left on device

                 total       used       free     shared    buffers     cached

    Mem:     520110080  486871040   33239040          0   33091584  273969152

    Swap:            0          0          0

    shmInit: shmget failed: No space left on device

    2013/03/24-03:54:56, , 775,, WARNING, M48_Switch_2, Switch status changed from HEALTHY to MARGINAL.

    uSWD: End of Data ==========


    #BrocadeFibreChannelNetworkingCommunity


  • 6.  Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

    Posted 03-26-2013 01:13 AM

    Hi,

    Yes, it seems that the daemos ran out of memory, or maybe physical space in the flash.

    You should monitor the free space (commands df and memshow) in order to prevent this situation from reoccuring.

    As you know, an upgrade would be a good option.

    rgds


    #BrocadeFibreChannelNetworkingCommunity