VMware vSphere

 View Only
  • 1.  Several ESXi hosts randomly have NCI drop alert

    Posted Jul 13, 2025 07:41 AM

    So, I have had this weird issue for a while and done some research, but no resolution. We migrated from a Cisco USC with SAN storage to Dell VSAN Ready nodes about two years ago. All of this is currently running VCF 5.2.1. We have two sites, same basic configuration, although this happens more at the production site than the DR site. Anyway, what is going on is that randomly there will be a NIC alert:

    hostname.domain.com: The NIC in Slot 6 Port 1 network link is down.

    It is always the same slot and port, no matter which host it is. It is only 1 host. These are Dell R750 VSAN ready nodes with 100G SFP's to Cisco switches. We have dual path. Slot 7 NIC's never show an alert. I have tried reseating the SFP's, switching them between slots 6 and 7, switching the cables, etc. It may only have alert once every couple of weeks. If you are looking at the Cisco switch when it happens, the light goes off and back on. Then, a host may do it a dozen times over the course of a day or so and suddenly stop. There does not seem to be a firmware update for the NICs. I have seen some post on a Dell forum where others have seen similar behavior, but those are a few years old. So far, no issues have been caused. Just seems a little frustrating. Anyone else seeing anything like this happening?



    ------------------------------
    Rodney Barnhardt
    vExpertPro
    ------------------------------


  • 2.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 14, 2025 09:17 AM

    Please clarify this statement
    "It is always the same slot and port, no matter which host it is. It is only 1 host."
    Is this happening to only 1 host or multiple hosts?

    My hunch would be the NIC is faulty or overheating.  If it is just 1 host, I would lean towards a bad NIC.  If multiple hosts, I would lean towards overheating causing it to reset or fail temporarily.  I have experienced NIC's just stopping and when replaced they work fine.  I have also experienced 10g copper NIC's that go to 1g and the theory is they are overheating.

    You mentioned you swapped cables and SFP's but the issue stays in slot 6.  Are the NIC's identical?  Can you swap them in slots 6 and 7?  If so and the issue moves to slot 7, you know you have a faulty NIC.  If not, there may be a air flow issue, not necessarily a problem with the system but perhaps not enough cooling for that nic.  Are there any other free slots you can move NIC 6 to?

    Did you buy it in its current configuration or add any NIC's to it?  The reason I ask is that Dell generally tests that a particular configuration will support the hardware purchased with it.  There have been cases when I order Dell servers and the configuration will change by adding fans, not allowing as much memory, etc. and these changes are all due to cooling.  I was wondering if perhaps it was ordered one way and hardware was added that requires additional cooling.




  • 3.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 14, 2025 09:25 AM

    It is multiple hosts and has happened at more than one location. The host are all configured identically. So, no matter which host generates the error, it is always the same slot number and port. The alert comes from the iDRAC itself but can also be seen in vCenter logs. The NIC's are identical, but I have not swapped them yet. It seemed unlikely to be a NIC since it happens across multiple hosts and at more than 1 location. These are in a well cooled data center with hot and cold aisles. The nodes are as built and shipped from Dell. 



    ------------------------------
    Rodney Barnhardt
    vExpertPro
    ------------------------------



  • 4.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 15, 2025 07:56 AM

    I would agree it is unlikely a NIC since they are all identical.  There have been instances where a particular batch of hardware has a flaw and if you purchase servers at the same time you are more likely to get NIC's made from the same batch but since they are the same in slots 6 and 7 it is likely they are from the same batch.

    You mentioned there is no firmware updates for the NICs, what about the BIOS and idrac/lifecycle controller?  Are those up to date?

    Do you have any other slots you can move it to?  Perhaps Slot6 is right next to the hot CPU or something and even though it came from Dell configured that way, perhaps it is not getting the cooling it needs.

    Are these still under support?  Have you reached out to Dell?  With several of yours experiencing the same issue, you would think they would have had reports of this already.




  • 5.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 15, 2025 09:36 AM

    Rodney, what exact NIC do you use? 




  • 6.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 15, 2025 09:52 AM

    The NIC's are: Broadcom BCM57508 2x100G QSFP PCIE 

    The transceivers are: FTLC9555REPM3-E5

    I have spoken to Dell in the past about it. At one point there was an ESX firmware update that was causing another issue. I was hoping that would also fix this problem. They were trying to point more to the Cisco switch as a possible problem. However, the network team says it would not move around to different hosts or sites if it was the switches. 



    ------------------------------
    Rodney Barnhardt
    vExpertPro
    ------------------------------



  • 7.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 16, 2025 07:25 AM

    Ahh.  The network team.   As they continue to dodge your issue, could they at least grab the logs out of your switches for review?    Those logs could reveal additional telemetry that your hosts otherwise would not be aware of.




  • 8.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 17, 2025 09:06 AM

    I ran into a similar issue back when 10G NICs were first hitting the scene. It took coordinated effort from IBM (servers), Cisco (switches), and VMware to get it resolved. Ultimately, they had to roll out targeted fixes across the blade servers, switches, and the ESXi build to fully address the problem.

    In your case, I'd suggest asking Dell to assemble a cross-vendor team-Dell, Cisco, and Broadcom-to investigate. After all, these servers are a certified vSAN nodes, and they deserves proper attention.




  • 9.  RE: Several ESXi hosts randomly have NCI drop alert

    Posted Jul 17, 2025 09:34 AM

    That may be what I need to do. A recent event was a server in another cluster\workload domain had the following alert:

    The NIC in Slot 7 Port 1 network link is down.

    Which is a different slot. The interesting part is that the hardware in this cluster is slightly different due to being dedicated to SAP. One of the differences is the NIC. Rather than the Broadcom NIC, these servers have Intel NICs. Being slot 7 also goes to a different TOR switch. So, that seems to eliminate it necessarily being just an issue with Broadcom NICs.



    ------------------------------
    Rodney Barnhardt
    vExpertPro
    ------------------------------