VMware vSphere

 View Only
Expand all | Collapse all

ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

sparrowangelste

sparrowangelsteJul 16, 2012 06:04 AM

  • 1.  ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Jul 16, 2012 05:35 AM

    Hi All,

    This one is very similar to this post.

    Really hoping the coummunity might be able to provide some direction on some networking issues I've been getting since the ESXi 5 upgrade for my site. Some details:

    • ESXi 5
    • 8x HP DL380p Gen8 servers
    • HP PioLiant networking infrastructure

    Basically since the upgrade (or I should say, fresh installation of ESXi 5) there's been 2 networking based issues that have occured.

    1. Randomly a vmnic will lose connectivity to the physical network.

    2. The physical network can no longer talk to the VM network through a vSwitch

    The network configuration has 4 links going to 2 seperate switches (not aggregated). They tag some VLANs however ignore that element for now (and yes default VLANs are the same).

    I'll start with issue 1 as I've been working through a support case with VMware that's got no where at this stage and can't progress until the issue occurs again. This morning I came to site and found that one of the ESX servers in my HA/DRS cluster was disconnected. A ping from my workstation suggested the machine was off the network. When I went to the host's console I restarted management services and found everything was OK again - with the exception that some VMs network connectivity was still down.

    When I jumped into vSphere I found that one of the 4 vmnics could NOT see any observed IP range - the rest were OK. This is a single NIC too.

    A then jumped into VMA and found the VMs that didn't seem to have networking connectivity were also on this vmnic. So to work-around, I placed this vmnic in the Not Used on the vSwitch and the inherited port groups of which those VMs belonged, then have connectivity. I'm willing to bet that the management interface was on that vmnic before the restart of services.

    SO right now you're thinking, faulty NIC or switch configuration variation on that port? Perhaps, but what makes this odd is that this exact same issue occured on another server with another NIC (same models however). And that, I decided to do some network troubleshooting with mirrored ports. Some results:

    Host A (physical) pings Host B (VM on vmnic3)

    ARP broadcast gets fowarded from the switch to the ESX host however the VM doesn't get the request

    Host B pings Host A

    ARP broadcast leaves the vSwitch, out the uplink and makes it to Host A, Host A responds which I can see on the mirror port get sent back to vSwitch and it doesn't make it to VM

    Host A pings Host B again

    So now it has the physical address/IP mapping (ARP) so a directed ICMP echo is sent, it get sent to vSwitch but never hits VM

    Host B pings Host A again

    I had to add a static ARP entry to get the ICMP happening but the ICMP goes out to the physical device, a ICMP reply comes back to the host, but never reaches the VM.

    Weird huh? VMware support said the same thing.

    So I've only been testing this with the failed vmnic, so it's not going through other vmnics. I can talk across the vSwitch, but not out to the physical (or rather, the physical's responses aren't making their way back through the vSwitch.

    ~ # esxcfg-nics -l

    Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
    vmnic0  0000:03:00.00 tg3         Up   1000Mbps  Full   2c:76:8a:51:e6:64 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
    vmnic1  0000:03:00.01 tg3         Up   1000Mbps  Full   2c:76:8a:51:e6:65 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
    vmnic2  0000:03:00.02 tg3         Up   1000Mbps  Full   2c:76:8a:51:e6:66 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
    vmnic3  0000:03:00.03 tg3         Up   1000Mbps  Full   2c:76:8a:51:e6:67 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet


    ~ # ethtool -i vmnic3

    driver: tg3

    version: 3.120h.v50.2

    firmware-version: 5719-v1.24 NCSI v1.0.60.0

    bus-info: 0000:03:00.3

    I've checked the HCL a number of times here and the server and NIC hardware and firmware versions are supported. I did have to use the HP ESX image however, but I'm told that's still supported by VMware. I've also taken VLANs out of the mix here as. I've also swapped switches, cables, ports (both new ones, and already working ESX links) to rule out anything non-VMware.

    In the end, I have to maintenance mode and restart the server to get the NIC working again. I can only assume, it's a VMware issue, the hardware is not supported (when said it is) or I've got really unlucky and there's a bad batch of Broadcom NICs getting about.

    Now, as for item 2 that's a bit more intermittant. Basically vSphere administators find we can't manage VMs through the VMware console. We find that when this occurs, if we ping the ESX host's management interface we don't get a response. Other parts of the network seem OK as they have that ARP lookup in their cache. This is likely why HA remains OK.

    We see that ARP request again makes it to the uplink switch, seems to get to the management vmk0 and the ARP reply goes back (I confirm this via tcpdump on the SSH console). From there I can't determine if it make it to the vSwitch but in any event, doesn't make it to the pinging workstation.

    This goes on for a few minutes and then after a time, everything starts working OK. Usually triggered by another host making a connection to that host.

    Any help here would be great! I've raised 2 cases with VMware but I'm not getting anywhere, I'd rather not have to wait for the issues to occur again. TO make matters worse we're looking at upgrading our control systems virtual infrastructure and calling contractors into support that process from overseas. I have to delay that until I can determine this issue.

    Let me know if I've been too vague or more some specific information is needed.

    Thanks muchly!

    Message was edited by: Daza, topic update



  • 2.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Jul 16, 2012 06:04 AM

    are you using the hp customized vmware iso?



  • 3.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Jul 16, 2012 06:06 AM

    Yep - "I did have to use the HP ESX image however.."

    Sorry, I might not have made that clear enough.

    Cheers



  • 4.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Aug 24, 2012 07:16 AM

    Hi Daza,

    Did you have rootcause for this problem?

    I had the problem same as you but with two Dell Blade M620 + Broadcom BCM5719 and i can't understand what happend?.



  • 5.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Aug 24, 2012 10:12 AM

    This seems weird really, just as you said.

    So your HP-branded BCM5719 chips are actually HP  331i or HP 331FLR onboard NICs of the gen8?

    One thing I noticed in the HCL is that only driver tg3  version  3.123b.v50.1 is supported. You are running 3.120h.v50.2.

    http://partnerweb.vmware.com/comp_guide2/detail.php?deviceCategory=io&productid=21472&vcl=true

    http://partnerweb.vmware.com/comp_guide2/detail.php?deviceCategory=io&productid=21446&vcl=true

    The release notes state that this version adds support for the new gen8 NICs:

    http://www.hp.com/swpublishing/MTX-782bca6458364148b98e34c5a5

    Supported Devices and Features:

    This software supports the following network adapters:

    • HP NC325m Quad Port PCIe Gigabit Server Adapter
    • HP NC326i PCIe Dual Port Gigabit Server Adapter
    • HP NC326m Dual Port PCIe Gigabit Server Adapter
    • HP Ethernet 1Gb 2-port 330i Adapter
    • HP Ethernet 1Gb 4-port 331FLR Adapter
    • HP Ethernet 1Gb 4-port 331i Adapter
    • HP Ethernet 1Gb 4-port 331T Adapter
    • HP Ethernet 1Gb 2-port 332T Adapter

    Try updating the driver accordingly first (wonder why VMware support hasn't mentioned that to you yet).

    When you capture traffic, do you mirror the physical uplink of the vSwitch as well as the vNIC-Port + the GuestOS view? That is if you use a dvSwitch.

    I have no idea why a vSwitch just "blackhole" layer 2 broadcasts such as ARP Requests.



  • 6.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Aug 24, 2012 04:21 PM

    Hi MKGuy,

    I tried with new driver version but the problems still occur "the vmnic stop passing traffic"

    ethtool -i vmnic2
    driver: tg3
    version: 3.123b.v50.1
    firmware-version: FFV7.2.14 bc 5719-v1.29
    bus-info: 0000:03:00.0

    I found out some logs in /scratch/log/vmkernel.log

    2012-08-23T16:10:08.445Z cpu9:4155)<6>tg3 : vmnic2: RX NetQ allocated on 1
    2012-08-23T16:10:08.445Z cpu9:4155)<6>tg3 : vmnic2: NetQ set RX Filter: 1 [00:50:56:b7:41:52 0]
    2012-08-23T16:10:38.447Z cpu12:4155)<6>tg3 : vmnic2: NetQ remove RX filter: 1
    2012-08-23T16:10:38.447Z cpu12:4155)<6>tg3 : vmnic2: Free NetQ RX Queue: 1

    @note: if i reboot the host, the problem will be solved but i want to know the rootcause to fix forerver

    Please give me some advise !



  • 7.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Aug 24, 2012 07:01 PM

    Those logs themselves don't really say much to me, at least without context. Are you sure this was logged during a problematic time window and it affected vmnic2? ESXi logs use UTC time, so if you're UTC+2 like me for example, the logs would be from 18:10 local time.

    Maybe you can avoid a full reboot if you just restart the management agents or set the vmnic down/up via esxcfg-nics/vicfg-nics.

    At this stage, if you analyzed the issue as throughly as Daza, theres not much we can help you with. Contact support or maybe Daza can give us an update on the case he had open? Besides I can't imagine this to be a general issue with the gen8 onboard NICs, as it's such a popular platform, albeit new. There must at least be some kind of special trigger to it in your cases.



  • 8.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Aug 30, 2012 04:42 PM

    I am having disconnect issues with that same card, but I am thinking it is a hardware level problem at this point.  I have 3 identical DL380 Gen8s with the 331FLR (4 port with BCM5719) and also an add on NC365T (4 port with Intel 82580).

    * All 4 ports on each of the 331FLRs are touchy - You bump a cable and it will lose link or drop to 100Mbps. I have swapped cables (with different brand), hard set speed/duplex, and the ports are connected to two different switch brands.  (HP and Juniper).  I can trigger the drops just by moving the end slightly.

    * All 4 ports on each of the NC365Ts are bulletproof.  I can ham fist them with no drops.

    These are so sensitive that "don't touch the cables" is not an answer.  I have lost link on occasion without anyone near the rack.

    HP just sent me replacement 331FLRs.  I put one in and it is subject to the exact same problem.  I am still working the case with them.

    Thought I would throw this out there in case this lines up with your problem.



  • 9.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Aug 30, 2012 04:44 PM

    Hi All,

    I'll be out of the office through to Sept 10th on annual leave.

    I'll respond back on my return. Thanks.

    Nick



  • 10.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 02, 2012 04:00 PM

    Disabling netq should solve the problem until new version of driver is out..

    # esxcfg-module -s force_netq=0,0,0,0 tg3

    # reboot

    (from vmware support)

    We had the same proglem with HP DL380p Gen8



  • 11.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:30 PM

    Any fix to this yet?  I am having the same exact issue with Gen8 servers and Broadcom 5719 NICs using the latest HP ESX iso as the installation media.

    This has happened to me twice on 2 different ESX5i hosts



  • 12.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:39 PM

    Thanks for the info bobanveljanos!

    I tried disabling NetQ just for another data point - It did not help my issue.  (I can reproduce it before the OS is even booted so I am not surprised.)

    That said, if anyone is having lockup/drop issues where they do not lose link, disabling NetQ sounds like the way to go.



  • 13.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:45 PM

    6 days after disabling NetQ it is still working. Usualy NIC stopped receiving traffic after 4 days.



  • 14.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:45 PM

    Would changing the vswitch's Failover Detection to Beacon Probing instead of Link State make the VMs use a different pNIC if the pNIC they were on stops passing traffic even though the link stat is UP?

    I have mission critical VMs on theses hosts and we have had 2 times now on different Hosts that production clients lose connectivity to the VMs due to this issue.  Each time i have to answer the question why this happens and why can't I prevent it from happening again especially in a 4 NIC team



  • 15.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:48 PM

    Here is my info on all 8 (4 in LOM, 4 in quad port card) NICs

    driver: tg3
    version: 3.123b.v50.1
    firmware-version: 5719-v1.29 NCSI v1.0.80.0
    bus-info: 0000:03:00.2



  • 16.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:52 PM

    Are you seeing the symptoms I am (layer 1 - jiggle of doom) or what bobanveljanos and others saw (runs fine for a few days then locks up)?  If the latter, try disabling NetQ.  If the former, call HP.



  • 17.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 02:57 PM

    No jiggle of doom.  The pNIC just stops passing traffic and the VMs using that pNIC are obviously unable to access the network. 

    At VMworld last week, I talked to a VM tech and he recommended to use Beacon Probing on the vSwitch to help the issue.  He didnt say anything about disabling NetQ.  What ramifications occur when you disable NetQ?



  • 18.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 09:31 PM

    Symtopm is VMs using problematic vmnic is not receiving traffic from the network after 2-5 days uptime. Traffic goes out to switch from VM, ARP and mac-address-table entry is present on the switch, but traffic does not return to the VM. Restarting host solves the issue. It happens allways on the same vmnic. Mezzanine card was changed by HP and problem moved from vmnic0 to vmnic2.

    In vSphere client you could notice that CDP information if not available from problematic vmnic.



  • 19.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 04, 2012 10:14 PM

    After talking with VMware support, they informed me that this issue is pretty widespread and has to do with the Broadcom NIC driver.  VMware has an open ticket with Broadcom to update their driver.  Until then, the recommend to disable Netqueue as stated earlier in this post.

    Here is another link to the issue... http://rcmtech.wordpress.com/2012/08/15/vmware-esxesxi-issues-with-broadcom-5719-quad-port-gb-nic/



  • 20.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 10, 2012 03:03 AM

    Thanks TheNuts,

    I've seriously spend days on the phone to support regading this. VMware ended up stating the issue was with our network.

    Heck there was no link aggregation, we're using singluar links. I spoke with 5 different guys from 1st level technicians, senior engineers and support team managers.

    I seemed to get forwarded through to an Indian call center. I don't know if there's a knowledge gap between them and those mentioned in:

    http://rcmtech.wordpress.com/2012/08/15/vmware-esxesxi-issues-with-broadcom-5719-quad-port-gb-nic/

    Going to fire up VMware support. So angry.



  • 21.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Jul 10, 2013 02:47 PM

    http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&externalId=2035701&sliceId=2&docTypeID=DT_KB_1_1&dialogID=436382127&stateId=1%200%20436384894

    Answered here after setting a laptop up and crossover cable to prove issue - tested and tested only by adding new nic driver could we bind the ip address to port. resolve issue.



  • 22.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine



  • 23.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 26, 2012 02:19 PM

    Thanks for the KB article.  Looks like they finally addressed it.  Now if they would come out with a fix instead of a workaround...or Broadcom come out with a new driver

    So, I suppose with any new ESX Host that is built and has non 10GB Broadcom 5719/5720 NICs, netqueue will need to be disabled since it it enabled by default.



  • 24.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Sep 26, 2012 02:32 PM

    Looks like there is an updated driver.  Anyone know if it fixes the issue without disabling netqueue?

    https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI50-Broadcom-tg3-3124cv501&productId=229#product_downloads



  • 25.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 12:22 PM

    I discovered this driver Pack. Everyone tried this yet?

    https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESX5X-Broadcom-bnx2x-17417v501&productId=229#product_downloads

    In the Release Notes are hints to ESX 6.0 :smileywink:



  • 26.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 12:25 PM

    Is that driver different than the one I linked above?

    I installed the one above yesterday



  • 27.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 12:54 PM

    The Gen8 onboard 331FLR NIC (Broadcom NetXtreme I) runs with the tg3 driver as per HCL:

    http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=21446&deviceCategory=io&releases=171,168&keyword=331FLR&deviceTypes=6&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

    The bnx2 or bnx2x driver linked above is meant and will only work for other NICs (Broadcom NetXtreme II based), so this is not applicable to the Gen8 NICs.



  • 28.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 01:37 PM

    Our G8's do not have embedded NICs in the systemboard.  They came with a 4 port LOM.  Broadcom 5719s.  I just upgraded to the driver I posted.  I am now running the tg3 driver version 3.124c.v50.1 on all my NICs



  • 29.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 01:44 PM

    The default Gen8 LOM you got there *is* exactly this HP 331FLR NIC I mentioned. It's just an HP NIC using the Broadcom 5719 chip, which requires the tg3 driver.

    http://h18004.www1.hp.com/products/servers/networking/331FLR/index.html

    Let's just hope that driver update fixes this issue.



  • 30.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 02:41 PM

    Hello All,

    The current workaround for the issue is to disable NetQ on the adapters.

    Looks like Broadcom is working on releasin a patch/fix for the same.



  • 31.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 02:44 PM

    According to the VMware tech i was working with when I had theses issues, the latest driver should correct the issue.  I did disable NetQ prior to installing the latest Broadcom driver and will leave it disabled since I have upgraded the driver.  It was recommended to disable it if NICs are not 10Gb



  • 32.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 02:51 PM

    It is a known issue for the 5720/5719 to drop connections, the engineering team in VMware is working with the broadcom to provide a fix.



  • 33.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Oct 18, 2012 10:11 PM

    Yes the same in our environment. We haven't put the "workaround" in place as I'm lead to believe that's for the older driver version.

    We haven't seen a repeat of our issues since.

    2-3 weeks and counting.



  • 34.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Nov 08, 2012 03:14 PM

    In our environment the workaround seems to help.

    Has anyone tested with 5.1 yet?



  • 35.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Dec 06, 2012 11:46 AM

    Try updated firmware for the Broadcom NIC - this solved my issue: download here

    Or use the HP Service Pack for Proliant 10/2012 (http://h18004.www1.hp.com/products/servers/service_packs/en/index.html) for creating a bootable USB key to upgrade the firmware.

    BR



  • 36.  RE: ESXi 5 vmnic stops passing traffic - HP DL380p Gen8 - HCL fine

    Posted Jun 14, 2013 12:03 AM

    This should be fixed with this

    http://kb.vmware.com/kb/2035701

    As a good measure also do a Firmware update using the latest PSP/SUM from HP and check if the Linux online nic firmware update *.scexe are included into the respository