Hi all,
I'm fairly new to VMWare so not sure if I have the right forum.
I'm having issues with performance on a VMWare farm running Vsphere 4.1.0. The infrastructure is as follows :-
HP Blade enclosure fully populated with 16 blades (hosts). 6 x stacked Cisco 3120 switches with etherchannels created so each host ends up with 4 gigabit physical nics which at the switch end are an etherchannel (each nic is a trunk allowing two large vlan subnets). Links from two of the switches in the stack are 10GB fibre back to the network core switches. Back end storage for all hosts is a HP EVA/SAN with fibrechannel disks...
The vswitch nic teaming on each host as always been set as Route based on ip hash load balancing with Link status only netwok failover detection, all links are active, none set as standby. The problem is that performance on the VM's is very poor - in normal use the performance tab on the host/virtual server shows the network barely ticking over and even when the server is loaded or we are trying to back files up from it, the performance of the backup is apalling, users will inevitably complain and yet the performance doesn't show the nic utilisation ramping up as you'd expect - as a test on a normal production server the other day I ran a backup during the day when the host nic was showing about 150mb per sec. When I ran the backup (data being extracted through the LAN onto the backup server, SAN backups not possible unfortunately due to lack of licences) the backup ran apallingly and still the host for that server was still showing only 150mb utilisation on the nic.
Some months back a change was made on the VMWare farm to all hosts, to change the load balancing to 'Route based on source mac hash' with beacon probing failover detection. This change was made by an administrator without any knowledge of the network setup. The hosts were changed gradually over a week and apparently the performance increase was enormous - the host nics were showing 700mbs+ and backups were massively improved. However when the last few hosts were changed at a weekend, by the Monday there was a massive network performance issue that seemed to be due to the beacon probing sending excessive broadcasts out.
Having looked at the config I can see how the issues might have happened - host has 4 nics, each of which would be sending broadcasts out to a large subnet (class A !), thats 64 nics to a lot of potential hosts, needless to say the changes were quickly backed out.
I only started looking at this issue after that so although have done a lot of digging, don't seem to be getting anywhere. Yesterday we discovered that the source mac hash/beacon probing option wasn't backed out on two of the hosts, and the VM's on these hosts are running far better than on any of the others!!
I've checked and double checked the Etherchannel setup on the switches - load balancing is set to src-dst-ip to and etherchannels are not trying to use lacp, ie. are forced 'on' and not negotiating. I don't feel that the switch end is an issue, as the performance issues are getting data out of the farm, not sending it in! I've not looked in detail at the actual traffic coming in an out of the switch. Likewise don't believe the issues are related to trunks back to the core network or the disk infrastructure as its demonstrated that performance can be massively better. It all looks like the load balancing settings on the hosts.
I'm probably going around in circles now and missing something obvious, but unanswered questions below...
1. I've checked numerous documents and they all seem to say that our setup, using ip hash load balancing/link status, with etherchannel switch connections, is correct - in which case why is performance so poor ?
2. I've found a forum entry (thread 120716) that states 'route based on ip hash' is the only load balancing option that supports etherchannel'. In this case what is not supported about using src mac/beacon probing with etherchannel and why are the two hosts that are using it performing so much better ?? Also why when all the hosts were changed was the performance so much improved when the actual configuration (with Etherchannel) isn't supported/advised?
3. I know there are a lot of nic settings on the advance tab for each host - is this a case of tuning these to improve performance with the ip hash load balancing, all will be on whatever the default is?
4. Is there a bug in VMWare somewhere to do with load balancing? - VMWare has been upgraded a couple of times from older versions.
5. When all the servers were change to src-mac/beacon probing, was this an overload situation which caused the network issues or did one host have an issue and send excessive broadcasts?? I suspect we'll not find the answer to this as we're not even sure what all the host settings were exactly before this change.
I realise I've not posted any configs and only described the issue to start with, I'll have to be careful about what I post anyway, but thought this might give the techies something to think about! I'm also fairly new to VMWare and don't have the knowledge of how the software really works.
Any ideas anyone? Any advice appreciated as this is a real problem.
thanks
Chris