ESXi

 View Only
Expand all | Collapse all

SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

  • 1.  SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

    Posted Aug 17, 2020 10:59 AM

    An abstract of my vmkernel.log

    We had storage incidents (crashes for unresolved bugs, being a readcache merge problem and a backend driver issue) on Unity 650F.

    Dell is writing on a 'custom' fix on both issues. In the mean time we were asked to mitigate the controller autoresets, and upgrade to OE 5.0.3 which they agreed upon will not resolve the current controller resets. After complaints from our side, they digged into every component of our infra, to mitigate on the impact/issues on their storage. The following SCSI sense codes were found in the vmkernel log and we are referred to further seek host support to suppress 'illegal scsi commands'

    These according to them are to addressed as they are contributing problems to crashe of a controller node , as target reset attempts are being made by the hosts (as seen from the storage side persfective)

    2020-08-17T01:49:15.677Z cpu108:65805)ScsiDeviceIO: 3015: Cmd(0x439e4341dfc0) 0xfe, CmdSN 0xbbc2e5 from world 65687 to dev "naa.60060160e8004b00e2ca985c1400127d" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.

    2020-08-17T01:49:15.677Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e4368a5c0, 65687) to dev "naa.60060160e8004b007e9d9a5cf732ff8e" on path "vmhba2:C0:T11:L47" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

    2020-08-17T01:49:15.677Z cpu65:3876714)ScsiDeviceIO: 3015: Cmd(0x439e43720dc0) 0xfe, CmdSN 0x719b30 from world 65687 to dev "naa.60060160e8004b007e9d9a5cf732ff8e" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

    2020-08-17T01:49:15.677Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e435e02c0, 65687) to dev "naa.60060160e8004b0058fbdc5d4a63c4ba" on path "vmhba3:C0:T10:L102" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

    2020-08-17T01:49:15.677Z cpu65:3876714)ScsiDeviceIO: 3015: Cmd(0x439e43502940) 0xfe, CmdSN 0x382ef0 from world 65687 to dev "naa.60060160e8004b0058fbdc5d4a63c4ba" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

    2020-08-17T01:49:15.877Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e435079c0, 65687) to dev "naa.60060160e8004b00de1a995cd70e3c6a" on path "vmhba3:C0:T12:L30" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

    I entred those value to

    https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/?host=8&device=0&plugin=0&sensekey=80&asc=41&ascq=0&opcode=

    so according to this, the HBA does a reset of the target, which in this case is a DELL EMC Unity FC port.

    And now what ?

    dev "naa.6006" is DELL EMC Storage, in my case Dell EMC Unity 650f running OE 4.5.1 (UWDC01) & OE 5.0.3 (UWDC02)

    Current Dell EMC Unity Target code is OE 5.0.3

    [root@esx070:~] esxcfg-scsidevs -m | grep "naa.60060160de004b005f5d2a5fbbcad438"

    naa.60060160de004b005f5d2a5fbbcad438:1                                     /vmfs/devices/disks/naa.60060160de004b005f5d2a5fbbcad438:1                                     5f2a5df1-f99059c6-eed8-20040ff4978e  0  UWDC01_IT-PROD-WDC_V005

    [root@esx070:~] esxcfg-scsidevs -m | grep "naa.60060160e8004b00e595255ec02cf074"

    naa.60060160e8004b00e595255ec02cf074:1                                     /vmfs/devices/disks/naa.60060160e8004b00e595255ec02cf074:1                                     5e2596ab-2bec8188-f141-20040ff4978e  0  UWDC02_IT-PROD-WDC_V103

    [root@esx070:~] vmkchdev -l | grep vmhba

    0000:00:11.5 8086:a1d2 1734:1230 vmkernel vmhba0

    0000:00:17.0 8086:a182 1734:1230 vmkernel vmhba1

    0000:17:00.0 1077:2261 1077:029b vmkernel vmhba2 ----------------> FC HBA

    0000:6d:00.0 1077:2261 1077:029b vmkernel vmhba3 ----------------> FC HBA

    [root@esx070:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -d

    Dumping all key-value instance names:

    Key Value Instance:  vmhba3/qlogic

    Key Value Instance:  vmhba2/qlogic

    Key Value Instance:  vmhba1/vmw_ahci

    Key Value Instance:  vmhba0/vmw_ahci

    Key Value Instance:  MOD_PARM/qlogic

    [root@esx070:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -l -i vmhba2/qlogic
    Listing keys:
    Name:   ADAPTER
    Type:   string
    value:
    QLogic 16Gb 1-port FC to PCIe Gen3 x8 Adapter for QLE2690:
            FC Firmware Version: 8.05.61 (d0d5), Driver version 2.1.73.0

    Host Device Name vmhba2

    BIOS version 3.61
    FCODE version 4.11
    EFI version 6.11
    Flash FW version 8.05.61
    ISP: ISP2261, Serial# RFD1722T35676
    MSI-X enabled
    Request Queue = 0x4309f6548000, Response Queue = 0x4309f6569000
    Request Queue count = 2048, Response Queue count = 512
    Number of response queues for CPU affinity operation: 4
    CPU Affinity mode enabled
    Total number of MSI-X interrupts on vector 0 (handler = 23) = 26676
    Total number of MSI-X interrupts on vector 1 (handler = 24) = 2186
    Total number of MSI-X interrupts on vector 2 (handler = 25) = 1090148738
    Total number of MSI-X interrupts on vector 3 (handler = 26) = 583007145
    Total number of MSI-X interrupts on vector 4 (handler = 27) = 2055128386
    Total number of MSI-X interrupts on vector 5 (handler = 28) = 1406005796
    Device queue depth = 0x8
    Number of free request entries = 1271
    FAWWN support: disabled
    FEC support: Disabled
    Total number of outstanding commands: 0
    Number of mailbox timeouts = 0
    Number of ISP aborts = 0
    Number of loop resyncs = 29
    Host adapter:Loop State = [READY], flags = 0x20ae200
    Link speed = [16 Gbps]

    Dpc flags = 0x0
    Link down Timeout =  010
    Port down retry =  010
    Login retry count =  010
    Execution throttle = 2048
    ZIO mode = 0x6, ZIO timer = 1
    Commands retried with dropped frame(s) = 297

    Product ID = 4953 5020 2261 0001

    NPIV Supported : Yes
    Max Virtual Ports = 254

    SCSI Device Information:
    scsi-qla0-adapter-node=20000024ff149042:160a00:0;
    scsi-qla0-adapter-port=21000024ff149042:160a00:0;

    Name:   TARGET
    Type:   string
    value:
    Driver version 2.1.73.0

    Host Device Name vmhba2

    FC Target-Port List:
    scsi-qla0-target-0=500000e0da81df29:122300:0:Online;
    scsi-qla0-target-1=500000e0da81df39:142300:1:Online;
    scsi-qla0-target-2=5006016249e4121e:140000:2:Online;
    scsi-qla0-target-3=5006016349e0121e:120000:3:Online;
    scsi-qla0-target-4=5006016849e4121e:140100:4:Online;
    scsi-qla0-target-5=5006016a49e4121e:120200:5:Online;
    scsi-qla0-target-6=5006016249e415ff:0e0000:6:Online;
    scsi-qla0-target-7=5006016349e015ff:100000:7:Online;
    scsi-qla0-target-8=5006016849e415ff:100100:8:Online;
    scsi-qla0-target-9=5006016a49e415ff:0e0100:9:Online;
    scsi-qla0-target-10=5006016249e41688:0e0500:a:Online;
    scsi-qla0-target-11=5006016349e01688:100200:b:Online;
    scsi-qla0-target-12=5006016849e41688:100300:c:Online;
    scsi-qla0-target-13=5006016a49e41688:0e0300:d:Online;

    Name:   NPIV
    Type:   string
    value:
    Driver version 2.1.73.0

    Host Device Name vmhba2

    NPIV Supported : Yes

    Looking at the Qlogic Site (Marvell Nowadays) and looing for the QL2690, we are at version -1 compared to the latest

    QLogic / Marvell Driver Download

    README

                       Read1st for Cavium Flash Image Package

                         --------------------------------------

                       **** ONLY FOR 268x/269x/27xx Series Adapters ****

    1. Contents Of Flash Package

    --------------------------------

    The files contained in this Flash image package are zipped into a file that

    will expand to provide the following versions for the 268x/269x/276x Series Adapters.

    *  Flash Image Version 01.01.91

       BK010191.BIN contains:

       ----------------------

         Bootcode FC

           FC BIOS       v3.62

           FC FCode      v4.11  (Initiator)

           FC FCode      v4.10  (Target)

           FC EFI        v7.00  (Signed)

         FC Firmware   v8.08.231

         MPI Firmware  v1.00.19

         PEP Firmware(Quad-port)        v1.0.27

         PEP Firmware(Single/Dual port) v2.0.12

         PEP SoftROM(Quad port)         v1.0.16

         PEP SoftROM(Single/Dual port)  v2.0.11

         EFlash tool  v1.18



  • 2.  RE: SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

    Posted Aug 17, 2020 10:38 PM

    Some quick questions:

    - Is your firmware up to date?
    - Have you tried upgrading all drivers?
    - Is it possible that there is an issue on an SFP?
    - Does this issue spread across multiple hosts?

    Is it possible that you disable ATS Heartbeating? Disabling ATS Heartbeat - Huawei SAN Storage Host Connectivity Guide for VMware ESXi - Huawei

    Let me know if you found this helpful



  • 3.  RE: SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

    Posted Aug 18, 2020 08:02 AM

    SAN

    SANSW23:xxxx> porterrshow 4
               frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs    uncor
            tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                  tx    rx     err    err
      4:    3.9g   2.2g   0      0      0      0      0      0      0      8      0      0      0      0      0      0      0      0      0

    SANSW22:xxxx> porterrshow 10
               frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs    uncor
            tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                  tx    rx     err    err
    10:    3.9g   2.1g   0      0      0      0      0      0      0     16      0      0      0      0      0      0      0      0      0

    Very few errors, c3 discard errors are frames that  got queued to the destination , then expired and got dropped.

    Likely cause : the buffercredits got exhausted , a flow control issue in the fabric. This may be be caused by HBA speed mismatches on the same path of the esx070 to Unity 650f SP port., as the ESX070 speed of 16gbit matches the Unity 650f (16gbit as well).

    Seen the high tx/rx, this a very low figure.

    Indeed we have a lot of

    "Lost access to volume xxxx due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

    We have regular posts of vRealize Loginsight messages on many hosts !!!!

    Masses of those are send during the night, when Veeam backups Proxies attach the VMFS for backup transfert purposes.

    I presume that when the VMFS gets dismounted , some ESXi hosts (and some more then others) report that the temporary device naa.xxxxxx is inaccessible.

    When performing a lookup on the VMFS issuing esxcfg-scsidevs -m | grep naa.xxxxx

    these devices are non existent after they have been declared inaccessible. infact these are volumes are recognised by vSphere as snap<hex value (?)>-<VMFS label>

    [root@esx070:~] vmware -v
    VMware ESXi 6.5.0 build-15256549

    Imageprofile ESXi-6.5.0-20191204001-standard

    https://esxi-patches.v-front.de/ESXi-6.5.0.html

    According to the Marvell Site we have last version-1 as to the FW

    I see an important update on Emulex  in a higher then ours build.

    Still we have a QL2690

    2020-07-30

    Imageprofile ESXi-6.5.0-20200704001-standard (Build 16576891)

    lpfc11.4.33.26-14vmw.650.3.138.16576891VMWUpdates the ESX 6.5.0 lpfcbugfiximportantESXi650-202007403-BG

    The last image profile update on qlogic , which is below our build version

    2019-07-02 (Update 3)

    Imageprofile ESXi-6.5.0-20190702001-standard (Build 13932383) includes the following updated VIBs:

    Important abstract from underneath full list

    qlnativefc2.1.73.0-5vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 qlnativefcbugfiximportantESXi650-201907205-UG

    We have this exact driver !!!!

    NameVersionVendorSummaryCategorySeverityBulletin
    bnxtnet20.6.101.7-23vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 bnxtnetenhancementimportantESXi650-201907216-UG
    brcmfcoe11.4.1078.25-14vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 brcmfcoebugfiximportantESXi650-201907218-UG
    esx-base6.5.0-3.96.13932383VMwareUpdates the ESX 6.5.0 esx-basebugfixcriticalESXi650-201907201-UG
    esx-tboot6.5.0-3.96.13932383VMwareUpdates the ESX 6.5.0 esx-tbootbugfixcriticalESXi650-201907201-UG
    esx-ui1.33.4-13786312VMwareVMware Host ClientsecurityimportantESXi650-201907103-SG
    i40en1.8.1.9-2vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 i40enenhancementimportantESXi650-201907214-UG
    igbn0.1.1.0-4vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 igbnbugfiximportantESXi650-201907206-UG
    ixgben1.7.1.15-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 ixgbenenhancementimportantESXi650-201907204-UG
    lpfc11.4.33.25-14vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lpfcbugfiximportantESXi650-201907217-UG
    lsi-mr37.708.07.00-3vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-mr3enhancementimportantESXi650-201907209-UG
    lsi-msgpt220.00.06.00-2vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-msgpt2bugfixmoderateESXi650-201907212-UG
    lsi-msgpt317.00.02.00-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-msgpt3bugfiximportantESXi650-201907210-UG
    lsi-msgpt3509.00.00.00-5vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-msgpt35bugfiximportantESXi650-201907211-UG
    lsu-hp-hpsa-plugin2.0.0-16vmw.650.3.96.13932383VMwareUpdates the ESX 6.5.0 lsu-hp-hpsa-pluginbugfiximportantESXi650-201907215-UG
    misc-drivers6.5.0-3.96.13932383VMWUpdates the ESX 6.5.0 misc-driversbugfiximportantESXi650-201907203-UG
    nenic1.0.29.0-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 nenicenhancementimportantESXi650-201907219-UG
    nvme1.2.2.28-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 nvmeenhancementimportantESXi650-201907207-UG
    qlnativefc2.1.73.0-5vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 qlnativefcbugfiximportantESXi650-201907205-UG
    smartpqi1.0.1.553-28vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 smartpqienhancementimportantESXi650-201907213-UG
    tools-light6.5.0-2.92.13873656VMwareUpdates the ESX 6.5.0 tools-lightsecurityimportantESXi650-201907102-SG
    vmkusb0.1-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 vmkusbbugfiximportantESXi650-201907202-UG
    vmw-ahci1.1.6-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 vmw-ahcibugfiximportantESXi650-201907220-UG
    vmware-esx-esxcli-nvme-plugin1.2.0.36-3.96.13932383VMwareUpdates the ESX 6.5.0 vmware-esx-esxcli-nvme-pluginenhancementimportantESXi650-201907208-UG
    vsan6.5.0-3.96.13371499VMwareUpdates the ESX 6.5.0 vsanbugfixcriticalESXi650-201907201-UG
    vsanhealth6.5.0-3.96.13530496VMwareESXi VSAN Health ServicebugfixcriticalESXi650-201907201-UG


  • 4.  RE: SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

    Posted Aug 18, 2020 11:15 AM

    Please try this:

    Upgrade Firmware and drivers. If that doesn't work disable ats heartbeating



  • 5.  RE: SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

    Posted Aug 18, 2020 12:55 PM

    Its an ESXi 6.5 , and disabling ATS is only valid for 5.5 & 6.0

    We have the latest driver on the ESXI side.

    MARVELL :

    Flash Image Version 01.01.91

    However this requires Qconvergence CLI which is not available on ESX

       BK010191.BIN contains:

       ----------------------

         Bootcode FC

           FC BIOS       v3.62

           FC FCode      v4.11  (Initiator)

          FC EFI        v7.00  (Signed)

         FC Firmware   v8.08.231

    Our Curent Version

    BIOS version 3.61

    FCODE version 4.11

    EFI version 6.11

    Flash FW v8.05.61

    Let me see what I can do to schedule this update.

    I checked on the Serverview Update DVD from Fujitsu Primergy RX4770 M4 and they released

    COMMENT_PUBLIC

    --------------

    bk016042.BIN contains:

    ----------------------

    Bootcode FC

    COMMENT_PUBLIC

    --------------

    bk016042.BIN contains:

    ----------------------

    Bootcode FC

    FC BIOS       v3.61

    FC FCode      v4.11 b2

    FC EFI        v6.14 (Fujitsu) Signed

    FC Firmware            v8.08.231

    MPI Firmware           v1.03.17

    PEP Firmware (Baker)   v1.0.24

    PEP Firmware (Qlipper) v2.0.14

    PEP SoftROM  (Baker)   v1.0.14

    PEP SoftROM  (Qlipper) v2.0.09

    FC EFI        v6.14 (Fujitsu) Signed

    FC Firmware            v8.08.231

    MPI Firmware           v1.03.17

    PEP Firmware (Baker)   v1.0.24

    PEP Firmware (Qlipper) v2.0.14

    PEP SoftROM  (Baker)   v1.0.14

    PEP SoftROM  (Qlipper) v2.0.09

    So I will schedule an intervention on this ESX host next week and keep you posted.