View Only
  • 1.  ESXi Network problem

    Posted Apr 16, 2015 04:31 PM

    I am having a strange networking problem with ESXi, my infrastructure is described below:

    I am running vCenter 5.5 on a VCSA (5.5.0 Update 2d, Build 2442330). The environment is a mix of ESXi 5.1 (mostly U2) and 5.5U1a. We have 1 primary site with 5 remote sites all connected over GRE VPN tunnels, The VCSA is in the primary site, each remote site has at least 1 ESXi server managed by the VCSA. Each remote site has independent internet so for each site the core router (which terminates the GRE VPN) routes inside networks (basically across the tunnel and has a default route to the local firewall. All ESXi boxes are configured with the core router as the default gateway, no custom static routes have been added to ESXi.

    At two of my remote sites (one site has two ESXi servers, the other site has one) the ESXi servers keep losing contact with vCenter. Nothing else on those networks ever has a problem talking to the main site, including talking to the VCSA. All three servers are varying versions of 5.1. Here’s what I’ve done for troubleshooting:

    1. ESXi can ping virtually any host across the tunnel back to the main site except the VCSA.
    2. I isolated my management network in ESXi to one physical NIC and did a SPAN on the switch port, a packet capture revealed that when pinging the VCSA ESXi is putting the packet on the wire with a destination MAC of the firewall. For all other hosts on the main site subnet ESXi sends the packet with a destination MAC address of the core router. Packet captures on ESXi itself (using tcpdump-uw) show the same thing.
    3. ARP tables on all the hosts show correct MAC addresses for the core router and firewall.
    4. Doing a "vmkping –I vmk0 –N <core router> <VCSA>" succeeds (destination MAC is correct in this case).
    5. Adding a static route to ESXi for the address of the VCSA to force it to the core router does not work, ESXi still sends the packet to the firewall MAC. These static routes were removed after testing this workaround failed.
    6. Rebooting the host will resolve the issue for a short time.

    Some more info:

    Of these three servers 1 is an IBM x3300 M4 using Intel I350 GB NICs, the other two are Dell R620s. The R620s each have 1 4-port Broadcom BCM5720 and 1 4-port Intel 82580. IPv6 has been disabled on all three hosts. Right now vmk0 is bound to an Intel NIC on all three servers but I’ve only just started watching this so I’m not sure if we’ve seen the problem with vmk0 bound to a Broadcom.

    The Dells are inherited so we did not do the installs but I reinstalled one of them yesterday (it was 5.1 U2 I reinstalled 5.1 U3) at the request of VMware support. Support noted that firmware and drivers were out of date and did not match the HCL so those were updated when I reinstalled.

    The IBM was purchased and installed by us last summer, firmware and drivers match the HCL except for the BIOS version which is a version ahead (it matches the HCL for 5.5).

    I have a support case in, SR 15633456803. Originally they tried to blame the network. Although it’s clear to me that this is not a network issue I put a Cisco case in anyway. Cisco quickly looked over the evidence I had gathered and made the same determination. The only way this could be a network issue is with ProxyARP. ProxyARP is disabled on my firewalls, ESXi has correct ARP tables, and when talking to anything on the main network besides the VCSA it is sending packets to the proper MAC address so clearly not a ProxyARP issue. At this point VMware has basically said they are out of ideas. I’m hoping my reinstalled host stays online for over a week, then I will reinstall the other two hosts and hopefully call it a day but I’m preparing for the possibility that may not happen.

    Anyone ever seen anything like this or have any ideas?

  • 2.  RE: ESXi Network problem

    Posted Jun 04, 2015 02:31 PM


    A friend suggested I try installing a vanilla ESXi load in VMware workstation. I realize this is unsupported and I'm not running any VMs on it but this abstracts hardware, drivers, and the OEM installations. I did so and it experiences the same problem, I was able to capture the test server in a failed state in a snapshot. A reboot resolves the issue and now I can force it back into a failed state for troubleshooting.

    I also found the following forum post ESXi route to vcenter is incorrect so I know I'm not the only one experiencing this. I still have no substantial response from support, anyone with any thoughts at all?

  • 3.  RE: ESXi Network problem
    Best Answer

    Posted Jun 04, 2015 04:09 PM

    I have figured out the problem, ESXi does not release ICMP generated redirects properly. You can force the issue by restarting the vmk0 interface (or whichever VM kernel port is your management interface) VMware KB: Internet Control Management Protocol Redirects, for a permanent fix I will be disabling ICMP redirects on my routers.