ESXi

 View Only
Expand all | Collapse all

iSCSI random disconnections ....where to look first?

  • 1.  iSCSI random disconnections ....where to look first?

    Posted Feb 10, 2016 01:08 AM

    Hi all,

    I'm experiencing a very unpleasant issue...

    I have 3 IBM x3650M3 in a cluster with two NICs (BMC 7905) each dedicated to iSCSI, and on the other side, an IBM DS3524 (our old SAN) and a NetApp 2240 (New SAN)

    I have permanent warning messages about the iSCSI performance and from time to time, a very bad disconnection (last one was this Sat) which brings the affected host and the VMs almost down. The disconnection doesn't occur always on the same host.

    We have experienced this issue before upgrading our switches to Nexus 5k, so I think the network shouldn't be the problem.


    All 3 hosts are running VMware vSphere 5.0 with the latest patches. Using the NetApp vCenter add-on, I setup all the parameters according to the recommended values.

    VMware support says "It's the network"

    Cisco says "The network is fine. Stable. No issues"

    NetApp says "I can see that the connection dropped from the other side...not here"

    The IBM SAN (where we still have some LUNs) says "Connection dropped unexpectedly". We don't have support on the IBM NAS.

    The iSCSI vmkernel are configured according to "best practices" (individual IPs, one active and one unused nic, no failover)

    My next step is to upgrade the firmware on the boxes.

    Any clue where should I start looking at? Is there any "special" extra setting to configure?

    Thanks!!

    /var/log # vmware -vl

    VMware ESXi 5.0.0 build-3086167

    VMware ESXi 5.0.0 Update 3

    # ethtool -i vmnic5

    driver: igb

    version: 2.1.11.1

    firmware-version: 3.18-0

    bus-info: 0000:15:00.1

    ~ # esxcli network nic get -n vmnic1

       Advertised Auto Negotiation: true

       Advertised Link Modes: 10baseT/Half, 10baseT/Full, 100baseT/Half, 100baseT/Full, 1000baseT/Full

       Auto Negotiation: true

       Cable Type: Twisted Pair

       Current Message Level: -1

       Driver Info:

             Bus Info: 0000:0b:00.1

             Driver: bnx2

             Firmware Version: bc 6.2.0 NCSI 2.0.11

             Version: 2.0.15g.v50.11-5vmw

       Link Detected: true

       Link Status: Up

       Name: vmnic1

       PHYAddress: 1

       Pause Autonegotiate: true

       Pause RX: false

       Pause TX: false

       Supported Ports: TP

       Supports Auto Negotiation: true

       Supports Pause: true

       Supports Wakeon: true

       Transceiver: internal

       Wakeon: MagicPacket(tm)



  • 2.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 10, 2016 08:50 AM

    I would look closer at your switch configuration. Can you confirm that there are no port channels or anything along these lines configured on the Nexus switches?



  • 3.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 10, 2016 04:59 PM

    Hi Nick,

    Thank you for being interested!

    All iSCSI port are access ports, autoneg on. There is no routing involved. All the ESXi and the SANs resides on the same IP network.

    There are 2 port channels on these switches, one peer-link between the switches (trunking everything) and the other to our core L3 switch (trunking everything too).

    I have set up the iscsi according to some best practices doc:

    - One vmk port per interface, with its own IP, selected as "Active Adapter" on the Nic teaming tab and setting the other nic as "Unused Adapter". Override switch failover order checked.

    - On the Storage adapter, I'm using the ISCSI Software adapter, and on the Network Configuration tab I have both vmk selected. They are both green and Port group policy compliant.

    - Policies on Targets:  on the Netapp they are RR and on the IBM are MRU (according to IBM doc, RR is not supported).


    All path are showing Active, Active (I/O) or Stand by.

    I read about a bug where iSCSI traffic was trying to go through an Unused nic, so it was discarded, but that bug should have been fixed. As a workaround, it was suggested to explicitly set on each vmk the Nic Teaming Fallback option in NO...but the behaviour is like that...at some point, traffic is trying to go out but never made it.

    We dont have any L2 issue, but when this problem happens, we "fix" it disconnecting one of the iSCSI nics on each ESXi host. That triggers something which redirects all the traffic through the other nic, the UP link, and everything reconnects again.



  • 4.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 10, 2016 06:29 PM

    Hi,

    I'm assuming here that the two nics you are using are dedicated to ISCSI traffic and they are in the same vSwitch - if so the try this: separate the nics reserved for ISCSI into there own vSwitches so you don't need to have a Nic as unused.

    This would give you two vSwitches with one active Nic each and a VmKernel for ISCSI use. From here continue to configure ISCSI as before.

    This may prevent the situation you are facing.

    Kind regards.



  • 5.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 10, 2016 09:47 PM

    I can do that, but it will increase the complexity on the targets and routes to the NAS access.

    According to VMWare, independent vSwitches for iSCSI vmk should be configured with different IP networks. So I would have to split everything between two IP subnets (still using the same L2).

    VMware KB: Considerations for using software iSCSI port binding in ESX/ESXi


    In my case, my target has several IPs, and I'm using port binding....


    Couple of weeks ago I setup a splunk server to collect logs and after checking them, it appears that only the connection to the IBM SAN is dropping. I checked our FW version and we are a couple of subversions behind the latest one. Need to learn how to upgrade the FW now.





  • 6.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 11, 2016 03:03 AM

    Hi there,

    Sorry about that - I was pretty sure it use to be VMware's recommendation that you separate out the iSCSI nics over vSwitches however this now says otherwise: vSphere 5.5 Documentation Center

    Here as you mentioned, VMware recommend port binding should only be used if the network adapters reside in the same virtual switch. Apologies for false information as I certainly don't want to advocate additional complexity especially around your storage connectivity.

    Kind regards.



  • 7.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 11, 2016 05:26 PM

    Please, no need to apologize!.

    In one of IBM's redbooks, an example is provided using two NICs on the VMware host, each nic using a different VLAN to the storage.



  • 8.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 11, 2016 08:59 AM

    I checked our FW version and we are a couple of subversions behind the latest one. Need to learn how to upgrade the FW now.

    Check on the VMware compatibility list which array firmware revisions are supported for your storage array.



  • 9.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 10, 2016 09:48 PM

    Is there any special config for the vSwitch I could be missing??



  • 10.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 28, 2016 05:31 PM

    It looks like the issue was triggered by some path "misunderstanding" between the IBM DS3524 SAN and the host, and apparently, when VMware has iSCSI trouble to connect with the SAN, it causes big performance hit and impacts other SANs, make the host as unresponsive, VMs loose storage connections, etc.

    So last week I've upgraded the IBM SAN firmware from 7.38 to the latest (8.20), and to enforce a path selection, I selected ALUA as "host type". I verified on the host side that ALUA was also showing as policy for the LUNs running some esxcli storage command.

    I've been performing several Datastore and host migration operations with no issue. (so far)



  • 11.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 28, 2016 07:36 PM

    It seems generally when there are more than one iSCSI arrarys with Active standby and active active array configurations connecting through single software iSCSI adapter, these kind of issues are reported.


    I don't think it anyware mentions that such a configuration is not supported by VMware, but to understand more about this issue a complete understanding of the iSCSI sessions and vmkernel behavior while connecting to different targets is required, if you still have a support case open with VMware, i suggest you ask for more info on this



  • 12.  RE: iSCSI random disconnections ....where to look first?

    Posted Feb 28, 2016 11:03 PM

    Glad the issue is resolved now.