vSAN1

 View Only
Expand all | Collapse all

Strange device controller phenomenon after update to ESXi 7.0 U1d

  • 1.  Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 07, 2021 05:27 PM

    I've just updated the ESXi hosts of my vSAN cluster from 7.0U1c to 7.0U1d

    After reboot my cache device (Optane P4801X 100GB, on HCL) 'moved' from vmhba3 to vmhba64. Also the interface type changed from PCIE to SCSI and the controller name from "NVMe Datacenter SSD (Optane)" to "NVM Express Optane 4800X". There is no longer a PCI ID.
    Attached image shows hosts before patch (U1c, green) and hosts after latest patch (U1d, yellow)
    Disk device (Optane P4801X 100GB) is supported P4801X on HCL 
    System is a Supermicro E300-9D-8CN8TP
    Has anybody seen something similar?



  • 2.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Broadcom Employee
    Posted Feb 08, 2021 09:53 AM

    moved to vSAN, as some of the other vSAN users (or support people) may have witnessed it.



  • 3.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 11, 2021 09:24 AM

    thanks Duncan!

    I wasn't sure where to post it, because it's host hardware and vSAN.



  • 4.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 08, 2021 11:08 AM

    Hey ,

    I am not saying this is the solution but your issue definitely matches some parts of the next KB and I think this is also related with the driver that comes in the new version but it should not be an issue at all: https://kb.vmware.com/s/article/2127274



  • 5.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Broadcom Employee
    Posted Feb 10, 2021 01:08 PM

    Hi

    I recall to have seeing this behaviour once on older 7.0 builds with vSAN P4800 series qualified devices
    Could you post the precise driver amd firmware versions you are using for this device please?

    you may want to file an SR to support  highlight all the PCI-IDs reporting zeros after upgrade. 

    Has this negatively impacted anything since the upgrade or does this appear to be a display issue?

     



  • 6.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 10, 2021 07:55 PM

     , We have seen this a handful of times in GSS and IIRC (as never had a case myself) it was attributable to 2 sets of drivers identifying path to device and thus with extraneous 0 ID paths (e.g. vmhba3 alias is still there and correct IDs but also blank vmhba64 which is being picked up by vSAN Health) I think this was remediated as  mentioned by removing the extraneous unused paths/aliases, however this isn't the same in 7.0 as while esx.conf still exists it no longer stores pci device aliases as these are stored in ConfigStore - should be just a case of removing them from there but the process in that kb isn't going to work in 7.0.

    If you can open a Support Request we can likely do this, but otherwise I will have to see what the story is with whether this process is already documented in ikb and if present can we make it public and if not aim to write a kb with the process (provided manual modification of ConfigStore is something we can publish externally).



  • 7.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 10, 2021 09:18 PM

    >>> provided manual modification of ConfigStore is something we can publish externally
    Some KB articles for how to modify the ConfigStore have already been published, so it may just be a matter of support.
    Anyway, with an active support contract, I'd recommend to open a support case especially if this is a production system. This may not only help solving the issue, but also help the developers to identify, and fix the bug that's causing such issues.

    To find out about the current device configuration, run the following command:
    configstorecli config current get -c esx -g system -k device_data

    André

     



  • 8.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 10, 2021 10:02 PM

    , Thanks as always for your (always useful) input.

    Yes, I had a look at what is currently publicly published in this area and can only find this single kb https://kb.vmware.com/s/article/81722.

    When making publicly available kbs, we (and anyone really) should be careful to consider what is the worst possible outcome from someone that doesn't fully comprehend the possible impact of the changes they are making (and/or by doing them incorrectly) and sometimes where specifics need to be targeted (e.g. use specific configstore IDs as opposed to --all as per the second option in that kb) or getting the correct syntax is non-linear this can result in this knowledge remaining internal as ikbs - don't get me wrong, I am all for sharing as much understanding of things as possible but there are lines and these can be hazy.

    While I can't state any date/release, from what I have read (just now as was a long time since last I looked at the relevant PRs) the source of this issue appears to have been identified and resolved in an upcoming release.

    But for now I would advise anyone encountering this to open a case with us - I will see what I can do about a kb and update here if this is possible.



  • 9.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 11, 2021 10:57 AM

    Thank you Andre

    Here's the result: (shortened, removed vmnic information)

    ESX01: (updated)

    configstorecli config current get -c esx -g system -k device_data
    [
    {
    "alias": "vmhba1",
    "alias_pending": false,
    "bus_address": "p0000:00:17.0",
    "bus_type": "pci",
    "cs_generated_id": "52 03 4d dc a6 22 ed 78-49 6f 77 ab 8b 96 ee 3d"
    },
    {
    "alias": "vmhba64",
    "alias_pending": false,
    "bus_address": "pci#s00000007.00#0",
    "bus_type": "logical",
    "cs_generated_id": "52 2e 5d f7 1f 4f 44 7b-9e e6 90 0b 99 6c 81 2e"
    },
    {
    "alias": "vmhba0",
    "alias_pending": false,
    "bus_address": "p0000:00:11.5",
    "bus_type": "pci",
    "cs_generated_id": "52 6a 1c e6 72 64 b0 55-49 61 41 ae 73 17 0a bd"
    },
    {
    "alias": "vmhba3",
    "alias_pending": false,
    "bus_address": "logical#pci#s00000007.00#0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 85 e6 b1 ec c7 f6 df-ed 86 95 76 dc 90 44 4b"
    },
    {
    "alias": "vmhba2",
    "alias_pending": false,
    "bus_address": "s00000001.00",
    "bus_type": "pci",
    "cs_generated_id": "52 a8 55 e3 54 47 d8 ef-00 de 4c 41 bf fa fb fe"
    },
    {
    "alias": "vmhba2",
    "alias_pending": false,
    "bus_address": "logical#pci#s00000001.00#0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 ce 26 f3 a4 41 a5 53-14 b2 76 7b c7 e1 51 0e"
    },
    {
    "alias": "vmhba3",
    "alias_pending": false,
    "bus_address": "s00000007.00",
    "bus_type": "pci",
    "cs_generated_id": "52 de 2b 74 02 20 d9 7a-49 2a 03 b5 1a ce 36 2d"
    },
    {
    "alias": "vmhba1",
    "alias_pending": false,
    "bus_address": "pci#p0000:00:17.0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 e3 b3 4b 55 78 82 c2-22 39 55 a9 22 16 80 37"
    },
    {
    "alias": "vmhba0",
    "alias_pending": false,
    "bus_address": "pci#p0000:00:11.5#0",
    "bus_type": "logical",
    "cs_generated_id": "52 ec 69 08 68 b7 44 c6-6f 59 8d d7 19 f3 c0 83"
    }

     

    ESX02 (not updated) (also shortened without vmnic information)

    configstorecli config current get -c esx -g system -k device_data

    [
    {
    "alias": "vmhba2",
    "alias_pending": false,
    "bus_address": "s00000005.00",
    "bus_type": "pci",
    "cs_generated_id": "52 1a 04 5f 6d 26 0c dd-39 36 1b 74 34 17 5a 84"
    },

    {
    "alias": "vmhba0",
    "alias_pending": false,
    "bus_address": "pci#p0000:00:11.5#0",
    "bus_type": "logical",
    "cs_generated_id": "52 41 05 11 53 b6 28 70-83 04 27 f4 e1 aa 47 5d"
    },
    {
    "alias": "vmhba1",
    "alias_pending": false,
    "bus_address": "pci#p0000:00:17.0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 4a fa 26 d5 01 32 5c-b5 f7 d9 7f d8 c6 98 35"
    },
    {
    "alias": "vmhba3",
    "alias_pending": false,
    "bus_address": "logical#pci#s00000007.00#0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 5f 21 f7 78 8a aa ee-89 12 2e a3 14 ea 55 43"
    },
    {
    "alias": "vmhba1",
    "alias_pending": false,
    "bus_address": "p0000:00:17.0",
    "bus_type": "pci",
    "cs_generated_id": "52 7f bd 5e 12 86 f1 87-c6 d7 09 0a 4a 7e 09 f8"
    },
    {
    "alias": "vmhba0",
    "alias_pending": false,
    "bus_address": "p0000:00:11.5",
    "bus_type": "pci",
    "cs_generated_id": "52 98 57 51 27 11 e9 3f-c0 0f a8 f0 e3 18 56 6c"
    },
    {
    "alias": "vmhba3",
    "alias_pending": false,
    "bus_address": "s00000007.00",
    "bus_type": "pci",
    "cs_generated_id": "52 9a c4 7c 67 fc 28 60-68 8b d1 de 3f d0 c0 16"
    },
    {
    "alias": "vmhba2",
    "alias_pending": false,
    "bus_address": "logical#pci#s00000005.00#0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 c3 e0 d4 3a 9a e9 51-a7 46 b3 5f 89 07 30 9f"
    },

    ]



  • 10.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 11, 2021 07:47 PM

    , As I mentioned above, this (vmhba64) is the essentially duplicate entry being picked up:

    vmhba64 intel-nvme-vmd link-n/a pscsi.vmhba64 (0000:65:00.0) Intel Corporation NVM Express Optane 4800X

     

    Resulting in 2 logical mappings (note all other having 1:1 pci:logical entries) and that they point to the same pci bus address (pci#s00000007) (e.g. how one could confirm which mapped to which vmhba[0-4] if they had vmhba[64-67]):

    "alias": "vmhba64",
    "alias_pending": false,
    "bus_address": "pci#s00000007.00#0",
    "bus_type": "logical",
    "cs_generated_id": "52 2e 5d f7 1f 4f 44 7b-9e e6 90 0b 99 6c 81 2e"
    },
    {
    "alias": "vmhba3",
    "alias_pending": false,
    "bus_address": "logical#pci#s00000007.00#0#0",
    "bus_type": "logical",
    "cs_generated_id": "52 85 e6 b1 ec c7 f6 df-ed 86 95 76 dc 90 44 4b"
    },
    {
    "alias": "vmhba3",
    "alias_pending": false,
    "bus_address": "s00000007.00",
    "bus_type": "pci",
    "cs_generated_id": "52 de 2b 74 02 20 d9 7a-49 2a 03 b5 1a ce 36 2d"

    While these extraneous listings can be removed with 'configstorecli config current delete' (followed by reboot), but it needs to be stated once more that caution needs to be advised, I have seen this being applied using the cs_generated_id but not the vmhba alias and I will be brutally honest and say I am currently unaware if there is a difference (as ConfigStore has pretty much been a 99% SysOps thing until recently and I am the vSAN guy  ) but may be able to find out.



  • 11.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Mar 11, 2021 06:52 PM

      there's some interesting update to the issue.

    As I mentioned above, I reverted all hosts but one to v7U1c.

    I left esx01 on v7U1d for research purposes.

    Today I've upgraded my hosts to v7.0.2

    v7.0.1c -> v7.0.2 : vmhba for Optane still correct (vmhba3)

    v7.0.1d -> v7.0.2 : vmhba for Optane remained renumbered (vmhba64)

    I solved the problen (kind of) by redeploying esx01 with the new ESXi image 7.0.2.

    I guess there was someting between 7.0.1c and 7.0.1d that caused the renumbering of the vmhba. Whatever it was, it's no longer part of 7.0.2.

     



  • 12.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Mar 11, 2021 07:56 PM

    , While I haven't confirmed first-hand I did hear one of my colleagues saying today that in an update that this had been resolved in 7.0 U2 in a case he had.

    It could be a case of once it occurs and adds the entry to configStore that this is persisted regardless of updating to a release with the fix.



  • 13.  RE: Strange device controller phenomenon after update to ESXi 7.0 U1d

    Posted Feb 11, 2021 10:50 AM

    Thank you all.

    I'd like to provide some more information.I've looked at two of my hosts:

    esx01 (updated to v7U1d) and esx02 (not updated, v7U1c)

    First I looked at the two host clients. In both cases the device ID is 0000:65:00.0

    Only the name has changed from "NVMe Datacenter SSD [Optane]" (before) to "NVM Express Optane 4800X"

    Then I checked on the CLI:

    [root@esx01:~] vmkchdev -l | grep vmhba
    0000:00:11.5 8086:a1d2 15d9:0986 vmkernel vmhba0
    0000:00:17.0 8086:a182 15d9:0986 vmkernel vmhba1
    0000:65:00.0 8086:2701 8086:3907 vmkernel vmhba3
    0000:66:00.0 144d:a808 144d:a801 vmkernel vmhba2

    [root@esx02:~] vmkchdev -l | grep vmhba
    0000:00:11.5 8086:a1d2 15d9:0986 vmkernel vmhba0
    0000:00:17.0 8086:a182 15d9:0986 vmkernel vmhba1
    0000:65:00.0 8086:2701 8086:3907 vmkernel vmhba3
    0000:66:00.0 144d:a808 144d:a801 vmkernel vmhba2

    Interesting, that here the original vmhba3 number has been kept on the updated host (esx01).

    Let's look at the drivers:

    [root@esx01:~] esxcli storage core adapter list
    HBA Name Driver Link State UID Capabilities Description
    -------- -------------- ---------- ------------- ------------ -----------
    vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:11.5) Intel Corporation Lewisburg SATA AHCI Controller
    vmhba1 vmw_ahci link-n/a sata.vmhba1 (0000:00:17.0) Intel Corporation Lewisburg SATA AHCI Controller
    vmhba2 nvme_pcie link-n/a pcie.6600 (0000:66:00.0) Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
    vmhba64 intel-nvme-vmd link-n/a pscsi.vmhba64 (0000:65:00.0) Intel Corporation NVM Express Optane 4800X

    [root@esx02:~] esxcli storage core adapter list
    HBA Name Driver Link State UID Capabilities Description
    -------- --------- ---------- ----------- ------------ -----------
    vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:11.5) Intel Corporation Lewisburg SATA AHCI Controller
    vmhba1 vmw_ahci link-n/a sata.vmhba1 (0000:00:17.0) Intel Corporation Lewisburg SATA AHCI Controller
    vmhba2 nvme_pcie link-n/a pcie.6600 (0000:66:00.0) Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
    vmhba3 nvme_pcie link-n/a pcie.6500 (0000:65:00.0) Intel Corporation NVMe Datacenter SSD [Optane]

    Here we can see that there's a new driver (intel-nvme-vmd) for the optane device and a new UID (pscsi.vmhba64).

    Let's get some driver details:

    [root@esx01:/proc] vmkload_mod -s intel-nvme-vmd
    vmkload_mod module information
    input file: /usr/lib/vmware/vmkmod/intel-nvme-vmd
    Version: 2.0.0.1146-1OEM.700.1.0.15843807
    Build Type: release
    License: BSD
    Required name-spaces:
    com.vmware.vmkapi#v2_6_0_0
    Parameters:
    SNT_COMPAT: bool
    SCSI-to-NVMe Compatibility mode. Set to false to use VMware non-compliant translations

    [root@esx02:~] vmkload_mod -s nvme_pcie
    vmkload_mod module information
    input file: /usr/lib/vmware/vmkmod/nvme_pcie
    Version: 1.2.3.9-2vmw.701.0.0.16850804
    Build Type: release
    License: BSD
    Required name-spaces:
    com.vmware.nvme#0.0.0.1
    com.vmware.vmkapi#v2_7_0_0
    Parameters:
    nvmePCIEFakeAdminQSize: uint
    NVMe PCIe fake ADMIN queue size. 0's based
    nvmePCIEDma4KSwitch: int
    NVMe PCIe 4k-alignment DMA
    nvmePCIEDebugMask: int
    NVMe PCIe driver debug mask
    nvmePCIELogLevel: int
    NVMe PCIe driver log level

     

    I still don't understand why I get diffenent results regarding the device on the updated host. One query returns vmhba3 and another vmhba64.