VMware vSphere

 View Only
Expand all | Collapse all

10GB network port locks after Veeam backup

  • 1.  10GB network port locks after Veeam backup

    Posted Dec 25, 2019 05:30 AM

    I have a 1GB network connection on my Lenovo X3650 M5 server and was originally backing up a couple of VM's using the community version of Veeam 9.5 Update 4b. This works perfectly and has no issue. I have exactly the same setup on an HPe DL360 Gen 9 and it also works perfectly.

    I upgraded the network on both and put in a mellonox 10GB card and set it up to have a management port on the 10GB connection. All fine so far.

    When I run the backup using Veeam on the HPe 10GB management network it works fne and backs up the VM's. However when I do the same with the Lenovo it completes the backup and then the IP stops responding. Can't ping it, can't get into the UI, nothing. Using the 1GB management network and checking the network stuff everything looks okay but to actually get it working I have shutdown and restart ESXi.

    The ESXi version is the Lenovo version and is 6.7 update 3 but it has done this with update 1 and 2 as well.

    If I use the 1GB management network for the backup everything is fine.

    I deleted the 10GB network stuff from ESXi and recreated them and same issue.

    Any ideas on what may be happening appreciated.



  • 2.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 03:46 AM

    This requires logs to be checked.

    This is not a NIC capacity issue.

    Could you please check the NIC stats

    esxcli network nic stats get -n vmnicX



  • 3.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 07:59 PM

    Could you please check the NIC stats

    esxcli network nic stats get -n vmnicX

    Here it is. This was just after a backup and the port was not responding.

    NIC statistics for vmnic4

       Packets received: 28963399

       Packets sent: 87516039

       Bytes received: 29253117297

       Bytes sent: 121131004446

       Receive packets dropped: 0

       Transmit packets dropped: 0

       Multicast packets received: 811274

       Broadcast packets received: 0

       Multicast packets sent: 0

       Broadcast packets sent: 0

       Total receive errors: 0

       Receive length errors: 0

       Receive over errors: 0

       Receive CRC errors: 0

       Receive frame errors: 0

       Receive FIFO errors: 0

       Receive missed errors: 0

       Total transmit errors: 0

       Transmit aborted errors: 0

       Transmit carrier errors: 0

       Transmit FIFO errors: 0

       Transmit heartbeat errors: 0

       Transmit window errors: 0



  • 4.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 06:46 AM

    As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?



  • 5.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 02:26 PM

    As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

    I am a linux noob and also ESXi but will see if I can track them down.



  • 6.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 08:10 PM

    As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

    I have just used the Lenonvo specific ISO to install so would assume it has the correct driver for it's own network card. Not sure what you mean about the HCL?

    Ran the backup and after the lockup checked the vmkernal log and only reference to the vmnic4 (the connection) is

    2019-12-30T19:44:29.367Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

    2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

    2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1

    and a bit further down

    2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

    2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1

    2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

    Only thing in hostd that looks unusual is

    2019-12-30T20:14:15.688Z info hostd[2099523] [Originator@6876 sub=Libs opID=5c5015c6] NetstackInstanceImpl: congestion control algorithm: newreno

    2019-12-30T20:14:17.566Z info hostd[2098897] [Originator@6876 sub=Vimsvc.TaskManager opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Task Created : haTask--vim.vslm.host.CatalogSyncManager.queryCatalogChange-539196836

    2019-12-30T20:14:17.567Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Transfer to exception eraro code: 403, message:

    2019-12-30T20:14:17.568Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] AdapterServer caught exception: N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound

    --> )



  • 7.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 10:53 AM

    Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.



  • 8.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 02:23 PM

    Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

    I have both a 1GbE and 10GbE Management. The 1GbE works perfectly but is slow when copying the backups which is why I want the 10GbE connection working. It is only a small number of VM's but still don't like it to take time because I start the backup manually. The 10GbE connection locks up. I had two 10GbE connections to have a fail over but it did not fail over. Removing the fail over allowed the network to run again until the next backup and then the remaining connection locked.



  • 9.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 03:46 PM

    Check if you're using the latest driver and firmware for your NICs. This sounds like an driver/ firmware issue.



  • 10.  RE: 10GB network port locks after Veeam backup

    Posted Dec 30, 2019 08:23 PM

    I just did a  esxcli network nic get -n vmnic4 and the output is below. It says Pause RX: true and Pause TX: true.

    Does this mean that something has paused the NIC and if so how do I unpause it?

    [root@esxi67:~] esxcli network nic get -n vmnic4

       Advertised Auto Negotiation: true

       Advertised Link Modes: 1000None/Half, 1000None/Full, 10000None/Half, 10000None/Full, 40000None/Half, 40000None/Full, Auto

       Auto Negotiation: false

       Cable Type:

       Current Message Level: -1

       Driver Info:

             Bus Info: 0000:06:00:0

             Driver: nmlx4_en

             Firmware Version: 2.11.500

             Version: 3.17.13.1

       Link Detected: true

       Link Status: Up by explicit linkSet

       Name: vmnic4

       PHYAddress: 0

       Pause Autonegotiate: false

       Pause RX: true

       Pause TX: true

       Supported Ports:

       Supports Auto Negotiation: true

       Supports Pause: true

       Supports Wakeon: false

       Transceiver: external

       Virtual Address: 00:50:56:5b:ad:25

       Wakeon: None

    [root@esxi67:~]



  • 11.  RE: 10GB network port locks after Veeam backup
    Best Answer

    Posted Dec 31, 2019 02:20 AM

    Okay it looks like it is a driver issue. After much more searching and reading I found a knowledge base article detailing the issue. Last updated 21 October.  Resolution says:

    Symptoms

    • An ESXi host is experiencing full traffic loss
    • All Virtual Machine traffic using a Mellanox adapter stops
    • Mellanox adapter driver is in use nmlx4_en 3.15.11.6 and 3.16.11.6 and 3.17.13.1
    • Traffic is not passing over a Mellanox adapter but the link status shows as active
    • Both the vmkernel and VMs go unresponsive on the network.
    • Network Card MT27500 Family [ConnectX-3 and ConnectX-3 Pro Devices]

    Cause

    This is a driver related issue.

    Impact / Risks

    All network traffic can be lost when using this adapter and driver combination.

    Resolution

    This issue is resolved in later versions of the driver.
    nmlx4_en 3.15.11.10 (6.0 driver)
    nmlx4_en 3.16.11.10 (6.5 driver) or new releases (6.7 driver)

    My current version is 3.17.13.1 so is clearly affected by this. The resolution says to use a later driver than 3.16.11.10 which clearly I am so it does not make sense.

    One work around suggested was to downgrade the driver to 3.15.5.5. I have a BIOS/Firmware update to do on my server so if it is still having issues after that may try that.

    Link to the knowledge base article is  https://kb.vmware.com/s/article/60421?lang=en_US



  • 12.  RE: 10GB network port locks after Veeam backup

    Posted Dec 31, 2019 04:56 AM

    In light of the issue and it looks like it has been going on a while with no fix I am going to change the network card out with one that uses the Intel chipset.

    My HPe is using 10GbE intel chipset and not having any issue so will swap the card out.