vSAN1

 View Only
  • 1.  New setup questions

    Posted Jan 25, 2025 07:58 PM

    Hello,

    we have 6 servers that had a HCI crash from another vendor, and are now evaluating VMware.

    I had VMware experience in the past, but there are couple of new things I am facing right now, including vSAN.

    So, I have some little questions.

    1) Is there a Go/NoGo for LACP for vSAN? We set it up successfully, vSAN is working with it, storage is not complaining. I will be setting up HCIBench to test, not sure how much will that help. The NIC is Broadcom BCM57414 2-port 25G.

    2) I have other NICs in the server, where vSphere is complaining they are not vSAN ESA compatible (yes, I have set up vSAN ESA, since our drives and the used NIC are compatible). So the check comes up when I activate RDMA... these NICs pop up in the list of incompatible hardware. Which is senseless, since I am not using them for vSAN. Can I somehow go around that, without only silencing the alert?

    Thank you



  • 2.  RE: New setup questions

    Posted Jan 27, 2025 03:37 PM

    Hello @st-ops,

    1. LACP for vSAN is supported unless you are using RDMA (which your 2. indicates you are) so that is a NoGo from supportability perspective - sure it might 'work' but it may just work until it doesn't and in general if you want to use unsupported configurations then you are deciding to provide support for them yourself.

    "vSAN with RDMA supports NIC failover, but does not support LACP or IP-hash-based NIC teaming."

    https://docs.vmware.com/en/VMware-vSphere/8.0/vsan-network-design-guide/GUID-6E3EEA25-E77D-4E6B-BDC8-488672FFC41B.html

    2. Pretty sure there is an open/outstanding PR with engineering to address that as it isn't intended behaviour and/or was an oversight - I'll aim to have a check on that tomorrow when I am logged-in. Otherwise, or until when/if that is fixed, you can safely ignore/silence the health check for that as it is of no material consequence, I am not aware if there would be any other workaround.




  • 3.  RE: New setup questions

    Posted Jan 29, 2025 04:07 AM
    Edited by st-ops Jan 29, 2025 04:20 AM

    Thank you. So LACP is gone. Also good. However, there is a serious issue I am facing, already rebuilt the POC cluster twice, because I simply thought I was doing something wrong, but it doesn't seem like it. First time, I had VCSA on vSAN, so that was an error, at least when POCing vSAN. So now only two hosts, out of 3 are participating in vSAN. In the end, the cluster would consist out of 6 hosts.

    The problem is: if I activate RDMA on vSAN, all servers that are participating in vSAN, start PSOD-ing. The only way to get the server which is vCSA (not on vSAN), is to shut down other two, then pull vCSA up, deactivate RDMA, and then boot other two.

    I've made sure that the driver I am using is vSAN 8.0 U3 certified, which is my case:

    Broadcom-bnxt-Net-RoCE_228.0.216.0-1OEM.800.1.0.20613240_22868439

    The NIC is BCM57414 based N225P. The server is ASUS RS720-E10-RS24U, custom built by a company in Germany.

    It has 16 NVMe flash storages.

    Our switch came pre-configured with policies for Azure Stack HCI, it's a Dell S5248F-ON (OS10).

    Also, checking Skyline Health before activating RDMA, I have 100%, no issues.

    I am at the limit of my understanding when it comes to DCBX, PFC and ETS, when it comes to RDMA setup.

    But all that... I don't think that would or should cause ESXi to PSOD. Actually, none of the external settings should cause the PSOD, am I right?

    The worst that should actually happen is vCenter giving me errors that RDMA is not working, PFC wrong policy, DCBX not IEEE or not working, but no PSOD.

    We would very much like to get RDMA working stable. In our case, with 16 of our quite potent NVMEs (Micron MAX) and a lot of IOPS due to SQL server, RDMA should be used to lower the CPU usage. Although... our CPU usage isn't a general problem in our case, our clusters are mostly space-limited, and our CPU must have lots of cache, but neither CPU nor RAM are "full".

    Just thinking actually... how important RDMA really is in our case?




  • 4.  RE: New setup questions

    Posted Jan 29, 2025 04:20 AM

    Here is a screenshot of the PSOD:




  • 5.  RE: New setup questions

    Posted Jan 29, 2025 04:36 AM
    Edited by st-ops Jan 29, 2025 12:56 PM

    I will also say that from initial consultations, we were made to believe that our ASUS servers are compatible with vSAN ESA, due to network NIC and disks being compatible. However, today we got a hard no from Broadcom, due to our servers not being ReadyNode.

    We cannot proceed, if Broadcom/VMware denies us support.

    Also been thinking, maybe I should try vSAN OSA. But right now, awaiting for the reply.




  • 6.  RE: New setup questions

    Broadcom Employee
    Posted Jan 30, 2025 06:25 PM

    Make sure device driver and firmware match an entry for your vSphere version from the compatibility guide for vSAN over RDMA: https://compatibilityguide.broadcom.com/search?program=rdmanic&persona=live&column=brandName&order=asc&keyword=BCM57414&brandName=%5BBroadcom%5D&activePage=1&activeDelta=20 

    You may need to query your host for the NIC SSID, SVID, VID, DID info so you can find the correct entry in the compatibility guide.

    Correct that only server chassis that have been approved as ESA ReadyNodes are allowed to run ESA.  Not that it has to be branded a "ReadyNode", but it needs to be the same server model, etc. to help prevent unnecessary PSODs, proper drive backplane architecture, etc.  We call them "ReadyNode emulated" hosts.  

    Hope this helps.




  • 7.  RE: New setup questions

    Posted Jan 31, 2025 03:32 AM

    I actually found the cause of the issue. The issue also happens on OSA btw, so it's not vSAN-type-dependable.

    The cause was Teaming-Setup. I believe a wrong combination of settings in Teaming of the NICs that do vSAN in combination with RDMA, is what caused the PSODs. I switched back and forth to confirm. I believe it was the setting of two active NICs and virtual port.

    Now have it on physical NIC load and single adapter (but I can also do active/active).




  • 8.  RE: New setup questions

    Posted Jan 28, 2025 03:36 PM

    Hello @st-ops,

    I checked the PR I mentioned regarding the fix for this behaviour and it looks to be shipping with ESXi 8.0 U3 P05 which should be released soon.




  • 9.  RE: New setup questions

    Posted Jan 29, 2025 04:08 AM
    Edited by st-ops Jan 29, 2025 12:55 PM

    Excellent, sounds good! Thank you!