VMware vSphere

 View Only
  • 1.  Load balancing configuration

    Posted Sep 13, 2011 10:36 AM

    Hi all,

    I'm fairly new to VMWare so not sure if I have the right forum.

    I'm having issues with performance on a VMWare farm running Vsphere 4.1.0.  The infrastructure is as follows :-

    HP Blade enclosure fully populated with 16 blades (hosts).   6 x stacked Cisco 3120 switches with etherchannels created so each host ends up with 4 gigabit physical nics which at the switch end are an etherchannel (each nic is a trunk allowing two large vlan subnets).    Links from two of the switches in the stack are 10GB fibre back to the network core switches.  Back end storage for all hosts is a HP EVA/SAN with fibrechannel disks...

    The vswitch nic teaming on each host as always been set as Route based on ip hash load balancing with Link status only netwok failover detection, all links are active, none set as standby.    The problem is that performance on the VM's is very poor - in normal use the performance tab on the host/virtual server shows the network barely ticking over and even when the server is loaded or we are trying to back files up from it, the performance of the backup is apalling, users will inevitably complain and yet the performance doesn't show the nic utilisation ramping up as you'd expect - as a test on a normal production server the other day I ran a backup during the day when the host nic was showing about 150mb per sec.   When I ran the backup (data being extracted through the LAN onto the backup server, SAN backups not possible unfortunately due to lack of licences) the backup ran apallingly and still the host for that server was still showing only 150mb utilisation on the nic.

    Some months back a change was made on the VMWare farm to all hosts, to change the load balancing to 'Route based on source mac hash' with beacon probing failover detection.    This change was made by an administrator without any knowledge of the network setup.      The hosts were changed gradually over a week and apparently the performance increase was enormous -  the host nics were showing 700mbs+ and backups were massively improved.    However when the last few hosts were changed at a weekend, by the Monday there was a massive network performance issue that seemed to be due to the beacon probing sending excessive broadcasts out.

    Having looked at the config I can see how the issues might have happened - host has 4 nics, each of which would be sending broadcasts out to a large subnet (class A !), thats 64 nics to a lot of potential hosts, needless to say the changes were quickly backed out.    

    I only started looking at this issue after that so although have done a lot of digging, don't seem to be getting anywhere. Yesterday we discovered that the source mac hash/beacon probing option wasn't backed out on two of the hosts, and the VM's on these hosts are running far better than on any of the others!!    

    I've checked and double checked the Etherchannel setup on the switches - load balancing is set to src-dst-ip to and etherchannels are not trying to use lacp, ie. are forced 'on' and not negotiating.  I don't feel that the switch end is an issue, as the performance issues are getting data out of the farm, not sending it in!  I've not looked in detail at the actual traffic coming in an out of the switch.   Likewise don't believe the issues are related to trunks back to the core network or the disk infrastructure as its demonstrated that performance can be massively better.   It all looks like the load balancing settings on the hosts.

    I'm probably going around in circles now and missing something obvious, but unanswered questions below...

    1. I've checked numerous documents and they all seem to say that our setup, using ip hash load balancing/link status, with etherchannel switch connections, is correct - in which case why is performance so poor ?

    2. I've found a forum entry (thread 120716) that states 'route based on ip hash' is the only load balancing option that supports etherchannel'.     In this case what is not supported about using src mac/beacon probing with etherchannel and why are the two hosts that are using it performing so much better ??   Also why when all the hosts were changed was the performance so much improved when the actual configuration (with Etherchannel) isn't supported/advised?

    3. I know there are a lot of nic settings on the advance tab for each host - is this a case of tuning these to improve performance with the ip hash load balancing, all will be on whatever the default is?

    4. Is there a bug in VMWare somewhere to do with load balancing? - VMWare has been upgraded a couple of times from older versions.

    5. When all the servers were change to src-mac/beacon probing, was this an overload situation which caused the network issues or did one host have an issue and send excessive broadcasts??  I suspect we'll not find the answer to this as we're not even sure what all the host settings were exactly before this change.

    I realise I've not posted any configs and only described the issue to start with, I'll have to be careful about what I post anyway, but thought this might give the techies something to think about!   I'm also fairly new to VMWare and don't have the knowledge of how the software really works. 

    Any ideas anyone? Any advice appreciated as this is a real problem. 

    thanks

    Chris



  • 2.  RE: Load balancing configuration

    Posted Sep 13, 2011 12:21 PM

    Chris,

    I'll give it a crack so to speak.

    First, I would like to request more background information.

    Your backup server, how is it connected to the network? What disk is it using? Is it a bottleneck?

    More information on Network Topligy, including core, and examples of your etherchannel config, port channel and interfaces in port channel.

    See this KB for more information on etherchannel and VMware.

    Next, I would like to know if your vlaning your VMware enviroment, and if you can show us the Network's IP Address design and config. Use x.x. to make it showable.

    Last, are you sure your not hitting San Disk I/O limits? do you monitor Lun I/O, datastore I/O, VM I/O etc?

    With some of those questions, I think we can help you narrow this problem down.

    Thanks

    Roger Lund



  • 3.  RE: Load balancing configuration

    Posted Sep 13, 2011 12:41 PM

    I will be out of the office with limited to no access to emails on 9/13/11. I will return any messages upon my return.



  • 4.  RE: Load balancing configuration

    Posted Sep 13, 2011 03:56 PM

    Hi Roger,

    Thanks for the offer.  The backup server is an HP DLxx connected to the network with 6 nics and a 6GB Etherchannel :smileyhappy: and running HP Data Protector   I recently did this as some of our backups are unfortunately lan based due to someone forgetting to buy vmware san licences when the farm was put in!  Backup servert has fibre connection to via SAN to tape libary and to EVA disks for different types of backups.   There's no issues with the backup server - backing up via the LAN from a physical server gets very good throughput and is barely touching its network capability!

    As I mentioned before - if you run backup from the guests on the 'incorrectly' configured hosts (which are using source-mac/beacon probing), the backups to tape/disk on this backup server run considerably better than from the other vmguests.

    I'll read that link as soon as I've got a few mins, thanks.

    I can't honestly give any disk info from vmware or SAN disk IO limits, not my area but the disks are split between 3 EVA's and an XP 12000.   Not sure if disk access could be an issue, particularly as the above hosts /guests that are  running the different load balancing method run considerably better than the others as mentioned above.  There's plenty of servers using the SAN and these arrays and absolutely no issues anywhere else.  I think the team have struggled to get anything useful out of the EVA's but they said they'd have another go.

    Core switches are Cisco 6509's connecteed to the blade enclosure stacked 3120's with at 2 10GB fibre pipes (10gb fibre interface, not etherchannel).  I've just checked switch log, 10gb interface stats, some sample port channel stats and interfaces within those channels, there are no errors, no dropped packets, nothing obvious.   There are no other issues on the network, no performance issues between any other servers.    In fact any performance issues we have with servers are always user complaints about servers which are virtual guests on VMWare!

    Sample etherchannel config from the blade switch

    interface Port-channel20
    switchport trunk allowed vlan 1x,x2,3x
    switchport mode trunk
    switchport nonegotiate
    spanning-tree portfast trunk

    interface GigabitEthernet2/0/13
    switchport trunk allowed vlan 1x,x2,3x
    switchport mode trunk
    switchport nonegotiate
    channel-group 20 mode on

    The Etherchannel for each host consists of four nics - so four of the stacked switches are used for this.  The others are used for VMWare internal comms/management.

    Of the network address design, the hosts/vm guests will have addresses in one of two subnets, both of which are allowed through the trunk as shown in the example interface above (sorry I blanked out the actual vlan numbers just in case - probably overkill!).   Routing between subnets is done on the cores, there are no routing issues on the lan at present.

    The hosts aren't vlan'd any differently to any other server on the network, as above they are in one of the same address subnets as mentioned above (there are two subnets only because we're gradually migrating servers to a smaller neater one, no real functional difference between them).

    It would be nice to know that as we have two incorrectly configured hosts that are running better than everything esle that we could just change all the others but not without understanding the why???

    Apologies if i haven't posted as much config as you'd like, trying to be careful and pick out things that are relevant.   Ideas appreciated!

    Chris