VMware vSphere

 View Only
Expand all | Collapse all

VM Ping/ARP issue

MikeOD

MikeODOct 17, 2012 10:10 PM

  • 1.  VM Ping/ARP issue

    Posted Oct 12, 2012 11:43 PM

    We are having a problem with some of our virtual machines intermittently losing communication with each other, and I’m at a loss as to the source.

    We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules.  The blade chassis are connected to our core Cisco 6500 switches.  The VMWare hosts are at 5.0, the guest VM’s are a mix on Windows 2003, 2008, and 2008R2.

    What’s going on is that everything seems to be OK, but then out of nowhere, we will get communication failures between specific machines.    It looks like it’s an ARP issue.  Using PING, it works fine in one direction, but we get an “unreachable” error when going the other way, unless we ping from the target back to the source first.

    For example: we have servers, “A” and “B”.   Ping A to B fails with “unreachable”. Ping “B” to “A” works fine.   However after pinging “B” to “A”, we can now ping “A” to “B”, at least for a while until the entry falls out of the ARP cache.  If we go into server “A” and set a static ARP entry (“arp –s”) for server “B”, everything works OK.  Through all this both server “A” and server “B” have no issues communicating with any other machines.

    We tried using vMotion to move the servers to a different host, different blade chassis, etc.  Nothing worked except when we put both VM’s on the same host.  Then everything worked OK.  Moving one of the servers to a different host and the problem came back.

    It seems like either the ARP broadcast from the one server, or the reply back from the target isn't making it through.  However, according to our networking group, there are no issues showing up Cisco switches.

    Early this year, we had an issue where it happened on about a third of machines at the same time (it caused significant outages to production systems!).   It seemed like it was limited to machines on one chassis (but not all of the machines on that chassis).  At that time, we opened up tickets with VMWare and HP.  Neither found anything wrong with our configurations, but somewhere in the various server moves, configuration resets, etc., everything started working.

    Since that time we’ve seen it very intermittently on a few machines, but then it seems to go away after a few days.

    The issue we found today was that the server we’re using for the Microsoft WSUS server hadn’t been receiving updates from a couple of the member servers.  We could ping from the WSUS to the member server, but not back from the member server unless we put a static ARP entry in the member server.  The member servers are working fine otherwise, talking to other machines OK, etc.   They are a production environment, so we’re limited on the testing we can do.

    Also, when it has happened, it seems like always been between machines on the same subnet.  However, most of our servers are on the same subnet, so it might just be coincidence.

    I’ve done a lot of internet searching, and have found some postings with similar issues, but haven’t found any solution.  I don’t know if it’s a VMWare issue, HP, Cisco, or Windows issue.

    Any assistance would be appreciated.

    Mike O'Donnell



  • 2.  RE: VM Ping/ARP issue

    Posted Oct 13, 2012 01:57 AM

    Hello MikeOD,


    Could you please let me know problem exists in all 3 above mentioned OS


    I had same kind of problem with Windows server 2008/R2, where it was windows firewall problem. Issue fixed by enabling 'File and Printer Sharing (Echo request - ICMPv4-IN)'.

    To do this go to start -> type -> Windows Firewall and advance security -> Inbound Rules -> enable 'File and Printer Sharing (Echo request - ICMPv4-IN) :smileyhappy:


    Now try pinging the respective systems in vice versa.


    Regards,

    deemee1988



  • 3.  RE: VM Ping/ARP issue

    Posted Oct 13, 2012 03:24 AM

    Thanks for the response.

    This time it seems to only be 2008r2, but that's the majority of our servers anyway. I don't recall if the incident earlier this year had any other O/S, although I believe it did.

    However, the firewall is disabled on our servers, since they're internal on our domain.

    Also, it's not blocking all PINGs, just between certain servers, and then only intermittently. Each server can send and reply to other servers.

    It looks like it's an issue with the ARP responses either not making it back from the target server, or being ignored by the sending server. I just can't figure what's causing it.



  • 4.  RE: VM Ping/ARP issue

    Posted Oct 17, 2012 10:10 PM

    Anybody??



  • 5.  RE: VM Ping/ARP issue

    Posted Oct 18, 2012 03:46 AM

    Mike -- is the firmware on the VCM upto date? anything from the OA or the VCM logs?

    also are there any standalone blades on the chassis? if so can we isolate this problem only to VMs ?

    ~Sai Garimella



  • 6.  RE: VM Ping/ARP issue

    Posted Oct 18, 2012 05:13 PM

    Nothing is showing up in the logs in VC or OA.

    VC is at 3.60, OA is at 3.56. Both of those are one release back, but the release notes for VC 3.70 and OA 3.60 don’t show anything fixed that would account for this. Besides we can’t go to VC 3.70, since we’re using some of the 1/10GB Enet modules; those aren’t supported past 3.60.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Mike O’Donnell

    Department of Technology

    (614) 645-6353 (voice)

    (614) 645-5444 (fax)



  • 7.  RE: VM Ping/ARP issue

    Posted Oct 19, 2012 12:11 AM

    We just began seeing the same issue, physical or virtual, and have narrowed it down to any flavor of Windows 2008. Clearing the arp cache on the affected servers only briefly fixes the issue. A lot of people report success with the hotfix available at http://support.microsoft.com/kb/2582281 though I just found it today and plan to test in a maintenance window. By the description there are hotfix versions for Vista through Windows 2008 R2 SP1. In the meantime adding a static arp entry is a temporary work around. For not being able to add static entries using the arp -s command, use the netsh int ipv4 add neighbor command instead, which works when arp -s does not.

    Other info: When we run a wireshark trace on an affected server and filter for arp entries we see that no arp broadcasts are sent or received and they appear to be filtered by the tcp stack. When we clear the arp cache we see one arp broadcast get sent, and one directed reply. Then it stops working again until the arp cache is cleared again or a static arp entry is added.

    Hope something here helps.



  • 8.  RE: VM Ping/ARP issue

    Posted Oct 19, 2012 01:53 AM

    I had come across this before, but from the support article it seemed like it only related to clustering, so I didn’t think it applied.

    Also, you mention that the problem goes away briefly after clearing the ARP cache. Do you mean clearing it through netsh command?

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Mike O’Donnell

    Department of Technology

    (614) 645-6353 (voice)

    (614) 645-5444 (fax)



  • 9.  RE: VM Ping/ARP issue

    Posted Oct 19, 2012 02:12 AM

    The problem comes down to not processing gratuitous ARP, at least in our case. The hotfix is supposed to address that issue.

    To clear the arp cache you can use a netsh command though arp -d * is easier. Keep in mind that it will delete static as well as dynamic entries, if you have added any. Do some testing, but that is what we have found so far. We're hoping the hotfix works as adding static arp entries all over the place is not desirable.



  • 10.  RE: VM Ping/ARP issue

    Posted Oct 19, 2012 02:23 AM

    Thanks for the info. We have an outage window this weekend, I’ll apply it on some of the ones we’ve been having the issue with. I’ll post a message next week with the results.

    If this does fix it, that would be great. Of course we have probably 100+ Windows 2008/2008R2 servers, so it may take a while to apply to them all..

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Mike O’Donnell

    Department of Technology

    (614) 645-6353 (voice)

    (614) 645-5444 (fax)



  • 11.  RE: VM Ping/ARP issue

    Posted Oct 20, 2012 05:15 PM

    Unfortunately that hotfix didn't solve the issue.  I applied it on both the source and target servers that were having the ping/ARP issue, but I still get the same results, "Destination host unreachable" from one direction, but it works from the other.

    It does sound like it's not specific to the server, since if I put both machines on the same host, the ping works OK.  It's got to be something with the networking portion in the vSwitch, the HP chassis Virtual Connect, or the external CISCO switches.

    Some more information:

    It seems like we're only having this issue between machines that are both on the same subnet.  However, it's not ALL machines on that subnet, just some.   I don't know if the issue is limited to just the one subnet, though, since most of our machines are on that subnet.  The ones I'm seeing with issues are production servers, so I can't move them both to a different subnet to see if that fixes it,

    When we've seen this before, we've been able to go ahead and put in a static ARP entry using the ARP -s command.  However, on this one, when I try to do that, I get "The ARP entry addition failed: Access is denied".  I am running this from an Administrative command prompt, and I can add other ARP static entries, pointing to machines on other subnets, but any static I add in the same subnet gives me "access denied". 



  • 12.  RE: VM Ping/ARP issue

    Posted Oct 20, 2012 05:34 PM

    I did see another thread here in the communities forum where there was mention of a similar symptom, but due to a Broadcom driver. I do not have the link handy but the thread talked about VMware having a request open with Broadcom to update the driver. Perhaps an updated driver is available.

    Thank you for the feedback on your experience with the hotfix.

    As to the access denied issue adding static arp entries, use the netsh command instead. That worked for us when arp -s gave us the access denied error when entering certain IP addresses.



  • 13.  RE: VM Ping/ARP issue

    Posted Oct 20, 2012 05:54 PM

    The hosts are not using the Broadcom drivers. They are HP 465G7 blades that have two HP NC551i (Emulex) ports and two Intel 82571 ports. The drivers are current. The NIC, blade and chassis firmware were updated about a month ago and are one version back from current, but the release notes on the current versions don’t reference any fixes that seem to apply to this issue.

    It does seem like it’s in the HP Virtual Connect and/or extern Cisco switches. If I eliminate those factors by putting both servers on the same blade host (that puts them in the same vSwitch) they work OK. It’s only when the ping/arp has to leave the host that the “unreachable” shows up on some.

    I’ve seen some references to a similar issue when you end up with duplicated MAC addresses. I’m not seeing that on those two machines, but could one have a duplicate MAC with some other VM somewhere? Pretty much all of our VM’s are using the automatically generated MAC addresses. With multiple VMWare hosts, do the VMWare hosts communicate with each other to ensure that they don’t assign duplicate MACs to VMs?

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Mike O’Donnell

    Department of Technology

    (614) 645-6353 (voice)

    (614) 645-5444 (fax)



  • 14.  RE: VM Ping/ARP issue

    Posted Oct 20, 2012 06:14 PM

    To my knowledge the hosts do not communicate with each other to identify duplicate MACs when VM are first created. I've seen duplicate MACs between VMs years ago, but that was back before vCenter even existed. I think the algorithm was changed to reduce the possibility, but it is still technically possible. The only way to know for sure is to dump the MAC address of every VM on every host. I am very much interested in knowing what you find.



  • 15.  RE: VM Ping/ARP issue

    Posted Oct 20, 2012 06:34 PM

    I dug a little more into it, and I don’t think it’s a duplicate MAC within VMWare. I did an export from PowerCLI of all the VM MAC addresses and didn’t see any duplication.

    I did do some more checking on the server using the Netsh and ARP commands. On the machine that gets a “unreachable” , using the NetSH command, showing the “neighbors” list, it shows the target as a MAC of all zeros, and type “unreachable”. As I said earlier, I get “access denied” when I try to create a static ARP, but if I use the “Add Neighbors” command in NetSH, it DOES let me add the MAC address of the target machine, and everything works OK.

    If I do a “delete neighbors” or a “delete arpcache” command in NetSH, it removes all the entries EXCEPT the “unreachable” ones, and shows the addresses as zeros again.

    Is there a way to remove the “unreachable” entries in Netsh? I’m thinking that the issue might be that somehow certain IP’s have been marked as “unreachable” by the server, and then it won’t remove them, even if they are reachable later.



  • 16.  RE: VM Ping/ARP issue

    Posted Feb 09, 2013 03:26 PM

    Hi Mike

    Having seen this problem several times I believe it is as simple as the gratuitous arp not reaching the external switches when a vm is migrated. The vSwitch i.e. the target host involved in the migration is supposed to send an rarp packet to the external switch. Due to HPs abysmal record of VC firmware updates I would bet the farm that the VC layer is dropping the rarp packet. Get VMware support to prove to you firstly that the rarp is being sent from the target host to notify the switches. Then get your network guys to see if they are recieving the rarp on the external switch ports. Either esxi is not sending the "Notify Switches" (rarp) packet at all or the VC layer is not passing it on to the blade uplinks (external switch ports). We have VC configured for tunneling (trunk mode) so this should not happen however we are still using VC assigned MAC addresses which have caused several major outages in the past. Every time networking breaks it is a combination of VC the emulex NICs or both.

    The rarp packet theory is validated by the fact that if a VM is pinging a host on a different subnet or the gateway the problem never occurs during migration. Let me know what you find.

    Cheers

    Nick



  • 17.  RE: VM Ping/ARP issue

    Posted Mar 31, 2013 06:52 AM

    Just an update. The only way we have been able to "Solve" this issue is to configure all hosts to use active/standby uplinks. The VMs then don't lose connectivity when they migrate. Not sure why but it appears the active/active options no longer work (from ESXi 5 onwards)  we have tried both load based teaming and originating port id in the Active/Active configuration but neither seem to update the switches when a vm migrates. Both of these options worked previously. Something broke this in version 5. Now we only have half the available bandwidth being used by each host.

    Also suprisingly in this instance it is not HP Virtual Connect at fault (a first!) as our IBM x3650 hosts are doing exactly the same thing.

    Any feedback on why this is happening would be very much appreciated.



  • 18.  RE: VM Ping/ARP issue

    Posted Mar 31, 2013 08:29 AM

    In our case the issue was finally resolved when HP sent us new NICs of the same model but with a newer revision. They eventually said that there was a hardware problem related to the Qlogic chipset on the HP NC375T adapters. Prior to that we had tried several OS patches related to ARP and different combinations of drivers and firmware as instructed, to no avail. We have not had the problem since, nor have we seen it on the other NIC models we use.

    Regards.



  • 19.  RE: VM Ping/ARP issue

    Posted Apr 01, 2013 12:30 PM

    As for the issue being the NC375T, we're not using the Qlogic based NIC's, our blades have the on board NC551i Emulex chipset.

    So far it hasn't come back since we updated all the firmware and re-configured the Virtual connect to use tunneling mode.



  • 20.  RE: VM Ping/ARP issue

    Posted Apr 01, 2013 01:22 PM

    IPs of the Vms Static/Dynamic?



  • 21.  RE: VM Ping/ARP issue

    Posted Apr 01, 2013 01:23 PM

    Static.  We only use dynamic addresses when doing an initial setup.



  • 22.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 04:20 PM

    I may have spoke too soon when I said that we weren't having the issue any more.  Earlier today I found one VM that was unable to ping a different server.  Both in the same subnet, but on different blades.  I could ping from the target to the source, and then ping from source to target worked OK for a while, then stopped working again.  I vmotioned the target to a blade in a different chassis and everything worked OK.

    I can't find it on anything else, but that doesn't mean it's not happening.  We have something like 300 servers, and most of the communication is NOT server-server , so short of having every server test ping every other server, it's not really going to show up..

    NV1, you mentioned that a workaround was to use active/standby NIC's.  Do you mean to have on one NIC on the virtual switch as active and the other on standby?  We are doing ours with all the NIC's on the virtual switch as active.  I'd really hate to lose the 10Gb bandwidth by turning one NIC as standby..



  • 23.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 05:53 PM

    Hi Mike

    That is correct I have had to turn off Active / Active teaming on all dvSwitch port groups as it is the only way I can "resolve" the problem. It is a major step backwards for the ESXi platform. The config I am now using is:

    All vm traffic port groups use vmnic0 active and vmnic1 standby and

    All vmkernel (management and vMotion) port groups use vmnic1 active and vmnic0 standby. At least this way both 10Gb NICs are being used however not as efficiently as aggregating 20 Gb and letting NIOC do it's thing.

    This is major but I have yet to find a real "fix" for this in the VMware KB.

    Will also post links to serious bugs in the console network settings check and dvSwitch health check that have cost me days and days of lost time chasing my tail on problems that don't exist because of buggy code.

    Stay tuned and please update me if you make any progress.



  • 24.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 06:13 PM

    Hi Mike

    While I am at it here is another beauty from the big V.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2037454

    The new distributed switch “Health Check” had me all excited once I had finished the 5.1 upgrade. It identifies misconfigurations between the virtual network ports and the physical network ports , except once again there is a bug that throws a serious error (randomly on different hosts at different times it would appear).

    Unfortunately this makes a very good new feature somewhat unreliable. The VLAN and MTU check appears to work well however but have turned off the other check (Teaming and failover check) until this bug is resolved by VMware.

    Currently both issues don’t have a patch which is annoying especially this one which has been around since the product was released last year.

    Pass it on. Might save folks some time if they are not already aware of it.

    Also another one

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2047511

    Basically when the “Test Network Settings” function is run from the ESXi console once all is good. But if there is actually a misconfiguration (incorrect hostname, DNS or IP settings) and you run it again after fixing the issue the test never completes properly. That is what happened the with the first host I saw the problem on. That led me to re test the other 490c G7’s I was building at the same time (the first test had been successful on these ones). I had already built the G6 490c hosts and all their tests were only run once (successfully). The chase then started with the more hosts I tested the bigger the problem appeared to be, throwing me off the real problem (the host or hosts dropping pings and disconnecting from vCenter).

    When I went back today and started testing hosts that had not had the problem previously they all failed to resolve the ESXi hostname in DNS as described by the article (even though pings and nslookup always do). I still have at least one host that is dropping its connection to vCenter intermittently but will go back and confirm that it has a separate problem.

    Unfortunately VMware support were not aware of the test utility fault when I started seeing the problem back in early February so I was chasing my tail trying to solve a connectivity problem when it was just a bug in the “Testing” utility.



  • 25.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 06:30 PM

    Release quality and problem rectification is starting to become a real problem with the vSphere platform. Is perhaps because we are now dealing with EMC not VMware? The core technical team that invented "VMware" is now all gone (Dianne, Mendel and Steve) replaced by Cisco and EMC folks. Both excellent hardware companies but in my experience often dreadful software companies when they try to be all things to all people (as VMware is also now doing). Sadly most of the vSphere features that have been released since the folks above left the company were in fact working in alpha code before they left. Unfortunately the issues we have been discussed here are the start of things to come I.e. "get the feature to market ASAP".

    I have spent the last 10 years specializing in this platform with great results. At the end of the day the only features that truly matter to my customers are:

    Quality, reliability and performance. They are certainly paying for this!

    Hopefully someone will give this feedback to Pat.



  • 26.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 06:40 PM

    That's what I thought you meant.  We're not using the distributed switch settings; each blade has it's own configuration.

    About how many VM's are you running per host?  I realized 10Gb is a large pipe, and most servers don't even push a 1Gb nic, but we're running about 20VM's/host.  I'm really uneasy about running all that through a single NIC, even if it is 10Gb



  • 27.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 06:44 PM

    I was seeing the same issue, but we decided to update our Virtual Connect firmware from version 3.51 to version 3.70 based on the HP Recipe (http://vibsdepot.hp.com/hpq/recipes/December2012VMwareRecipe4.0.pdf).  I then cabled up a total of two ports per interconnect bay X3 and X4 per bay on my c7000 enclosure (interconnect 2x bay 1 and 2x bay 2) each pair going to a single switch and we then setup LACP on them which allowed the ports in the Shared Uplink Set to go to Active/Active.  We are tagging the VLANs on the Ethernet Networks in VC, then set my LOM's to Active/Active on the dvSwitches in ESXi 5, we havent had the issue since. 



  • 28.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 07:15 PM

    geeaib824

    This makes sense as the the problem started around the time LACP support was introduced in ESXi. Unfortunately it has broken the simple originating port ID and LBT teaming when the VC modules are not using LACP. We have each module connected to 1 x 10Gb port on the same switch. Redundancy is provided by a standby 1 GB link on each VC module to a separate switch. ESXi has no visibility of this, the failover is handled at the VC layer. VC is the latest firmware by the way.

    All ESXi is aware of is the 2 10Gb uplinks to the same switch. A very simple config that has always worked seamlessly in the past. We are also seeing exactly the same symptoms on our rack hosts that don't use VC. I suspect if we turn LACP on then the problem will disappear. We are however deliberately trying to keep things as simple as possible given the Nightmares we have experienced with Virtual Connect over the last 3 years.

    Fundamentally it appears that in introducing LACP to ESXi directly the other options have broken. Ironically we stopped using LACP at the VC modules because it previously broke the ESXi teaming!

    Message was edited by: NV1 Added reference to who I am replying to: geeaib824



  • 29.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 07:20 PM

    It seems like (at least for us), after any kind of major virtual connect change, it's about 5 or 6 months then the issue comes back.

    Our structure has a four port LACP group going to each interconnect (1 and 2), and then the NIC's in VMware set to active/active.  When this first came up, we thought it was a firmware issue, so we make sure we were up to date on them.   Then when it came back, we originally had Virtual Connect splitting up the VLAN's and re-combining them, we we updated firmware again and converted the connections to tunneling mode.  That seemed like it took care of the issue, but I found one case today.

    The next thing I might try is what NV1 suggested, only having one NIC active in vmware and the other on standby.    I would hate losing half our bandwidth, but if that's what it takes...



  • 30.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 10:07 PM

    Hi Mike

    I feel your pain. It is very frustrating that problems that are fixed in one version of VC break again in the next version. From what you are saying LACP at the VC layer does not work for you like it appears to have worked for geeaib824 . That does not suprise mel. This is why I have spent the last 2 years trying to dumb down VC as much as possible and keep both the VC and dvSwitch uplinks as simple as possible. However we are now struggling with bugs at both the VC and ESXi networking layer so it is very difficult to understand where to start.

    One thing we did about a year ago was implement 2 rack hosts for the Management Cluster which at least protects vCenter, AD, DNS etc whenever Virtual Connect falls over. Makes it easier to troubleshoot and get everything back online. In this instance it has proven to me that the problem is at the ESXi or CISCO layer as both the blade hosts and rack hosts are having the same problem.

    A massive investment in Rolls Royce blade infrastructure (6 x C7000 Enclosures) and we have to put in workarounds like the rack hosts to deal with the type of flakey networking issues that I have not seen for over 10 years (before working with HP Blades).

    Now it is even worse with VMware forgetting about release management, regression testing and after sales support. Not good. It certainly looks like this problem is a VMware / CISCO problem however not a HP one.



  • 31.  RE: VM Ping/ARP issue

    Posted Apr 16, 2013 07:48 PM

    I've been doing some more testing and I might be on to something..

    In our structure, each blade has four NIC's.  The two Emulex on board, and two add on (intel) on the mezzanine card.

    Each NIC is going to it's "own" interconnect (I'm using interconnect 1, 2, 5, and 6).  Our data center has two "core" 6500 switches.  On the back of the chassis, the horizontal interconnects go to opposite switches.  I'm using LACP for the uplink from the interconnect to the 6500, but each LACP group is on it's own interconnect, I don't have the LACP groups spanning the interconnects.

    The end result is that in VMWare I have this structure:

    NIC0 - 6500A

    NIC1 - 6500B

    NIC2 - 6500A

    NIC3 - 6500B

    In VMWare, I had the virtual switch set with all four NIC's active, with the load balancing set to "route based on the originating virtual port ID"

    After trying different configurations with the active/standby configuraions, what I have now is:

    Active:

    NIC0 - 6500 A

    NIC1 - 6500 B

    Standby

    NIC2 - 6500 A

    NIC3 - 6500 B

    Before the configuration change I've been able to find some consistent, repeatable cases of VM's that can't ping certain other ones.

    I've set this configuration on a couple of the blades, and after moving VM's and giving it a day or so, the pings are working OK.

    Could the problem have been that original configuration had two active NIC's were going to the same 6500 switch?

    In another note, I just found out about a week ago that the Cisco 6500 switches are about 3 years behind on their firmware updates.  They're doing the updates this weekend, maybe that will have some effect on it.



  • 32.  RE: VM Ping/ARP issue

    Posted Apr 17, 2013 01:59 AM

    Hi Mike

    I agree there appears to be a problem using the Originating Port ID or LBT teaming when both ports are connected to the same external switch. Our configuration currently has that as we are using one CISCO Nexus for the 10Gb connectivity. Our redundancy is at the Virtual Connect modules where we have standby 1Gb uplinks on each module should the 10Gb switch fail.

    We have the Nexus switches with close to the latest firmware so I don't think it is a CISCO firmware problem .More something to do with the vSphere load balancing without using LACP on the same CISCO switch.

    I will ask our network guys to move one of the 10Gb uplinks to a different Nexus and then see if this resolves the problem. We have an engineering test enclosure I can do this with.

    Will get back to you with the results.



  • 33.  RE: VM Ping/ARP issue

    Posted Apr 04, 2013 06:53 PM

    Averaging around 15 on our smaller blades (490c) and around 30 on our big BL720c blades (2x 10 core, 256GB). Sometimes this is over 50 per host (during patching and upgrades etc when we do multiple remediations at the same time).

    Only 8 simultaneous vMotions ( the default for 10GB on 5.x) seems to ever push the pipe but I am still not happy with the active / standby arrangement as it is a major step backwards in functionality that has been working seamlessly since version 2.5.

    Regards

    Nick



  • 34.  RE: VM Ping/ARP issue

    Posted Nov 03, 2012 07:50 PM

    So seeing how you used to have this problem and are having it again now check that there are no static ARP mappings on the switches. Sometimes when people are desparately trying to troubleshoot something they keep making changes trying to fix it but don't record and rollback the changes that are unsuccessful.

    Do you have any nested ESXi hosts on this subnet? They can be very annoying as their VM config can report one address while their actual management MAC is different, you need to check the actual ESXi host in Vcenter, not its VM.

    Log into each switch and inspect its ARP table. There are 3 scenarios on each:

    1) The ARP entry points in the correct direction

    2) The ARP entry points in the wrong direction suggesting duplicate MAC or incorrect static ARP

    3) There is no entry



  • 35.  RE: VM Ping/ARP issue

    Posted Jun 17, 2013 03:22 PM

    Exactly the same problem here: Two VMs on two different ESX hosts connected with Nexus1K and a 2x10GbE portchannel (LACP) to a Nexus 5548.

    I tried different Nexus 5548, the problem still exists.

    If someone has an idea I have a small lab to test various scenarios or something.

    Rene Caspari

    Network Engineering



  • 36.  RE: VM Ping/ARP issue

    Posted Jun 18, 2013 06:17 PM

    We haven't seen this issue for several months.  We've done several updates/reconfigurations so I'm not sure what (if anything) fixed it.

    What we did was:

    -Ensured chassis and individual blades are at the latest firmware

    -Configured Virtual Connect to use VLan tunneling mode

    -Limited number of "active" NIC's on each vswitch.  Set redundant NIC's as "standby"

    -Network group updated firmware on 6500

    Hopefully we'll never see the ARP issue again...



  • 37.  RE: VM Ping/ARP issue

    Posted Jun 19, 2013 11:00 AM

    I found an ESX cluster (one out of six) which doesn't have this problem. And until now the only difference are the NICs, Cluster with the Bug use Broadcom, this one uses Intel.

    We finally opened a call, let's see if this helps.

    Kind regards,

    Rene



  • 38.  RE: VM Ping/ARP issue

    Posted Nov 19, 2014 05:37 PM

    Hi all, was there ever a firm cause/resolution surrounding this issue?



  • 39.  RE: VM Ping/ARP issue

    Posted Oct 18, 2012 01:05 PM

    We have  exactly same problems as you do but we are running debian/ubuntu on our vm's. I tried to migrate to same machine and there it works with no problems. I found another person that had the same problems but no solution there either. http://communities.vmware.com/thread/345288  so either put them on the same machine or add static nat. Let me know if you find a solution. :smileyhappy:



  • 40.  RE: VM Ping/ARP issue

    Posted Oct 18, 2012 04:23 PM

    If I put both VM’s on the same host, they ping/ARP fine.

    As for putting a static MAC, that’s what we’ve done on some of the others, but another odd twist on this is that when I tried to add the “arp –s”, I got an “Access Denied”. However, if I added a static for a different IP/Mac address, it worked OK.

    I am running it from an admin command prompt, and I’m admin on the server, so I’m not sure what’s causing that. The server will be rebooted this weekend; I’m hoping that may fix it.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Mike O’Donnell

    Department of Technology

    (614) 645-6353 (voice)

    (614) 645-5444 (fax)



  • 41.  RE: VM Ping/ARP issue

    Posted Oct 26, 2012 12:10 AM

    I still haven't found a solution to this. I have done some more testing and research. I tried the Microsoft hotfix related to a "gratuitous arp" issue in Windows 2008. However, that didn't resolve it.

    I ran a script run on all of our machines, doing a "netsh interface ip show neighbors", searching for anything that had an "unreachable" entry.

    The issue did show up on multiple subnets, but in each source/target pair, the servers were in the same subnet, with no router.

    The ones with "unreachable" all had at least one of the servers in the VMWare environment, passing through the blade chassis Virtual Connect.

    If the two VM's were on the same host, the "unreachable" issue went away. Moving them back to different machines, and the "unreachable" came back.

    There were some repeats in the "targets", but most other servers could ping those target servers OK.

    We are using VLans, and having Virtual Connect separate the networks before sending them to the blades. I recall seeing something about a Virtual Connect issue where when the VC environment would strip off the VLAN from the ARP packet, the resulting packet would be too small and would be dropped by other networking. However, I thought that was fixed in a later firmware release. Also, wouldn't it affect all ARP's going through Virtual Connect, and not just some?



  • 42.  RE: VM Ping/ARP issue

    Posted Oct 27, 2012 02:18 AM

    I cannot speak to HP VirtualConnect as we do not use it, but I can confirm that the garp hotfix does not fix our issue either, at least on the first server we put it on. Only some Windows 2008 servers are affected, and they are all plugged into the same pair of switches and firewalls, but across three different subnets and firewall interfaces (So far. Two more occurrences on a third subnet today).

    Our pain stems more from traffic suddenly not routing through the firewall due to missing arp entries there - that are normally learned dynamically - so our temporary patch is to add static arp entries on the ASA firewall which fixes the problem. This is not a viable long term solution for us though.

    There may be more arp entries missing for servers on the same subnet, but we do not have a lot of server to server traffic at layer two so I'm not sure.

    The big question is why this seemingly came up out of nowhere, with no change that we can find. We're still looking.



  • 43.  RE: VM Ping/ARP issue

    Posted Oct 30, 2012 07:51 PM

    I am curious as to whether you have found a solution to this problem. This problem is now seen on a Windows 2008 server on another vDS, and I just opened a support case to help me figure out whether something between the VM and the physical network is an issue. I think it is within the OS itself though as traces today show that the OS is not making ARP broadcasts for anything other than the default gateway.



  • 44.  RE: VM Ping/ARP issue

    Posted Nov 03, 2012 07:04 PM

    I never did find a clear cause or solution.

    However, I just completed a firmware update on the chassis, virtual connect and the blades. As part of the process, I also went and reconfigured Virtual Connect to use VLan tunneling, instead of the Shared Uplink Sets configuration that we had originally set up.

    As I understand it, using the shared uplink set method, Virtual Connect strips off the VLAN tag. Then if you send “multiple networks” to the blade NIC, it basically reassembles the packet with the VLAN tags.

    Using VLan tunneling, it sends all the packets coming into the VC module straight through to the blade and lets the O/S on the blade split out the different VLAN’s.

    I had come across an old posting stating that since the ARP pack was so small to begin with, once Virtual Connect stripped off the VLAN tag, sometimes the packet was too small and got discarded. Supposedly an earlier version of the VC firmware fixed the issue.

    Since we were sending all the VLAN’s to the blade NIC’s anyway, using the “multiple networks” config, it seems more efficient to just let Virual Connect just pass the packs through without splitting/recombining them.

    I’m hoping that will take care of the ARP issue, but since it’s so intermittent, it’s hard to tell. I tested it on on a two machines that showed the issue a couple of weeks ago (when I posted the original message). The Ping/ARP worked OK, but then moving the machines back to a some blades that I hadn’t updated showed it still working…

    If it does come back, I’ll open a case with VMWare and HP. Part of the reason for the firmware updates was so that if I did need to open a case, at least we’re running the current releases.



  • 45.  RE: VM Ping/ARP issue

    Posted Nov 04, 2012 09:14 AM

    I came across this article and will give it a try. It sounds like a reasonable explanation. The source of the problem has been difficult to track down as it is sporadic. I also have a case open for this issue.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2012455



  • 46.  RE: VM Ping/ARP issue

    Posted Nov 04, 2012 08:44 PM

    Thanks for the comment, but I don't believe that's the issue for us. Our blades are using the NC553i, which is the Server Engines chipset, not the 5xx NetXen series.

    Also, the HP advisory it talking about loosing network connectivity, but even when we had the Ping/Arp issue on a server, it was only between specific ones. The servers could both communicate with no issues with other machines. Also, once the ARP entry was put in, both servers would work fine with each other.



  • 47.  RE: VM Ping/ARP issue

    Posted Nov 11, 2012 01:51 AM

    Hi Guys

    The work around for this appears to be using "Originating Port" Teaming and not "Physical NIC Load". Switch all dvs port groups to the "Originating Port" teaming and the problem goes away. Also be aware that the October SPP installs a newer version of the NC553i firmware than the "October HP Recipe".

    The "October2012VMwareRecipe3.0.pdf" lists the correct firware version as 4.1.450.16 however the SPP installs 4.1.450.7 . You need to ensure the earlier version is loaded to comply with the Recipe.

    You need to change the teaming with either of these versions.

    Not a fix but a workaround none the less.



  • 48.  RE: VM Ping/ARP issue

    Posted Nov 11, 2012 02:04 AM

    In my case all port groups are already set to originating port and we have the problem. We are currently going through firmware and driver updates to see if that fixes the issue. So far so good on half the hosts.



  • 49.  RE: VM Ping/ARP issue

    Posted Nov 11, 2012 08:34 PM

    That’s kind of what I’m hoping too, that getting the firmware and drivers all up to the current levels will help, along with switching the Virtual Connect configuration to the VMWare blades to use the “vLAN tunneling” instead of the shared uplink set/“multiple network” server profile NIC.

    All of the blades now have the current drivers and 4.1.450.7 firmware. One of the two chassis has also been reconfigured to use the vLan tunneling. So far, I haven’t come across the PING/ARP issue. The second chassis has a few non-VMWare blades in it, so I can’t change it’s Virtual Connect configuration until our maintenance window next weekend.



  • 50.  RE: VM Ping/ARP issue

    Posted Jan 22, 2013 06:16 AM

    MikeOD,

    Have you had this issue at all since the firmware update?  I am currently on a call with HP trying to troubleshoot this same exact issue.  VMware tech pointed at the edge switch, networking team points to chassis.  I am at a loss and have been working this for a couple weeks now.  I just wonder if you have seen this come back up since the updates.  We are currently running OA 3.70, VC 3.51, and ESXi 5.1.

    thanks for your time.



  • 51.  RE: VM Ping/ARP issue

    Posted Nov 12, 2012 04:19 AM

    Our arp problem did not go away until we back reved the firmware to 4.1.450.16 as per the October recipe. Have seen this before in VC upgrades. One version breaks teaming or "Smart Link" The next version fixes it and then they seem to forget and the very next version breaks it. Enterprise networking? Virtual Connect? What a nightmare. For the other guy who is talking about VLAN Mapping vs VLAN Tunneling. Mapping adds absolutely no value to a VMware solution and yet another layer of abstraction (it tags and untags every packet with another header). Would also recommend reverting every enclosure to use the Factory assigned MAC addresses as once again there is no need to move blades between device bays if they are all part of a vSphere cluster. The redundancy is provided by HA.

    Would strongly recommend anyone who is looking at this and considering a HP VC or Flex Fabric solution to run a mile. Us standard CISCO or Brocade switch modules.



  • 52.  RE: VM Ping/ARP issue

    Posted Nov 11, 2012 08:24 PM

    We’re not using the distributed virtual switches (at least not yet), each blade has it’s “own” vSwitch. The teaming on those are set to “route based on the originating virtual port ID”.

    The firmware on our NIC’s is the 4.1.450.7 version. About two weeks ago I had updated them using the SPP so that it would bring the other parts (P4xx, BIOS, etc) up to the current levels. I saw that 4.1.450.16 was out, but the “resolved issues” didn’t seem like it addressed anything that applied to us.



  • 53.  RE: VM Ping/ARP issue

    Posted Jan 09, 2015 05:24 PM

    We were seeing very similar behavior to that of the original poster.  We also have a very similar design.  Two blade enclosures with Virtual Connect modules and two uplink sets connected to a Cisco 6500 core.  We were seeing one vlan in particular where the traffic could pass ingress into the environment, but the traffic headed egress would drop between the LOM and the virtual connect uplink.  The weird part was other vlans on the portchannel trunk were not affected and even traffic on the problematic vlan would occasionally work both ways. After having Cisco review the 6500 configuration, a call to HP support uncovered the problem.  The LOMs for a blade are mapped to particular Interconnect bays.  This is important to note.  (This can be seen by looking at the server profile in the Virtual Connect manager.)  When a network is mapped to a LOM, that network needs to be associated with a shared uplink set that is on the same Interconnect bay as the LOM, if the uplink is in the same enclosure as the blade.  That way, if the VC module Interconnect bay fails, it takes down the uplink and the LOM too.  But in most cases, the blade will have a redundant LOM connected to the VC module's other Interconnect bay which should then map to an uplink that is on that Interconnect Bay or possibly an uplink on an Interconnect Bay residing on a different Virtual Connect module in a different enclosure (if you have your VC modules stacked like we have).  Once we sorted out the network -> uplink and network -> LOM mappings, the problem was resolved.  Traffic flowed in both directions for the problematic vlan (which by the way had been added later after the initial VC config was done).  Why the problem was sporadic and didn't affect all vlans still blows my mind.



  • 54.  RE: VM Ping/ARP issue

    Posted Feb 17, 2015 11:38 PM

    Like most of you, we were baffled at first DNS not resolving, can't ping, finally traced it down to the ARP. We are fortunate in that we have two "identical sites" with C7000 chassis, Windows and RHEL on the blades as well as on VMs. We have firewalls disabled and fresh OS installs and still get these issues. This allowed us to test a variety of scenarios and we discovered issues on both physical blades and VMs. Our latest theory blames bad spanning-tree configs. One of our sites does not have these issues and one does. Our site which experiences these issues is configured with

    switchport

    switchport mode trunk

    switchport trunk native vlan XXX

    switchport trunk allowed vlan AA,BBB-CCC,DDD,EEEE,FFFF,GGGG

    spanning-tree port type edge

    In our other site, the spanning-tree config is a little more in depth:

    spanning-tree port type edge trunk

    spanning-tree guard root

    The "guard root" is just a little extra config, but the spanning-tree port type is huge so that spanning tree realizes that this is a trunk, and not an access port (it knows to expect multiple MACs).

    We are going to add the above configs whenever the network team can schedule the change and I will update!

    Jon



  • 55.  RE: VM Ping/ARP issue

    Posted Feb 27, 2015 07:48 PM

    I have excellent news. For us, updating our Cisco switch configs to include the following command fix our problem.

    spanning-tree port type edge trunk

    We have not seen any issues on the network since making this change. We have confirmed this across RHEL and 2008R2 as well as VMs and physical servers.

    I know this is too late for some, but I hope this fixes issues for other people. If you're strictly a sys admin and don't know what the above Cisco command means, go back to your network admins and tell them "to fix the spanning tree configs on your switchports".