Hi all,
I have a cluster of about 10 hosts and 150 VMs. There are times when a Linux VM becomes unresponsive. The vCPU usage skyrockets and flat-lines at 100% until the VM is reset. The guest is not pingable and the console doesn't display anything nor respond to keyboard input. I've waited to see if the problem goes away on it's own (up to 4 hours), but it does not - a hard reset of the VM is the only solution.
It happens to various VMs, but most frequently to those running 32-bit Debian 6.0. The problem is not isolated to one VM, but there is a set of four VMs that experience the problem more often than others.
Some of the VMs have 8 vCPUs, but some only have 1 or 2. VMware support keeps saying the number of vCPUs should be reduced to 1 with more added as needed, but we've gone through that already - our VMs have the number of vCPUs that they need. I looked closely at the graphs of the 8 vCPU guests while the problem is happening. Sometimes there were 4 vCPUs pegged and 4 were idle, sometimes 5 were pegged, etc. There doesn't appear to be a pattern.
Our environment consists of Dell PowerEdge servers (1950, R810, R900) connected to an EMC Clariion SAN using 4Gb FC. All hosts are running ESXi, either 4.1 or 5.0. VMs are stored on various LUNs on the SAN using a range of RAID levels (5, 6, 10). At one point I thought the SAN utilization might be the source of the problem, so I setup a dedicated RAID-10 RAID group with a single LUN which contained only one VM. That VM was still affected. Also, in terms of SAN controller utilization, it's usually around 15% and only peaks at 30%.
All of the VMs use the vmxnet3 vNIC. On one occasion, after resetting the VM, I looked at the kernel log and it displayed some vmxnet3 messages which led me to believe the problem may be caused by http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2005717. I moved some VMs to the host running ESXi v5.0 and the problem still occurred, so that theory was debunked. Also, this problem has happened quite a few times but those vmxnet3 kernel messages only showed up on one occasion.
All VMs are running VMware tools. Some have the open-vm-tools Debian package, some use the official VMware version. I enabled verbose logging in VMware Tools but that didn't return any helpful info.
There is no gradual increase in CPU/memory/utilization of any kind which would indicate an over-worked VM. It will be humming along just fine and then out of the blue the CPU will immediately spike and flatline at 100%. The guests are monitored using cacti and sar. Neither of these tools report any utilization increase leading up to the event. Also, there is no pattern regarding when the problem happens. It could be in the middle of the workday when load is high or in the middle of the night when load is practically nil.
I've experienced this problem off and on since I started using VMware several years ago. I'm just bringing it up now because it seems to be happening more often lately.
Since the problem cannot be reproduced on-demand or at least at regular intervals, it's difficult to determine what's causing it.
Here are the facts:
- It happens on various VMs running different applications, some are affected more than others.
- The guest OS is always Linux. It's happened on Debian 5.0, Debian 6.0, and Gentoo. It may affect Windows guests but we have very few of those.
- It happens on various hosts with differing hardware configurations.
- It happens on ESXi 4.1 and 5.0.
- It happens on when the VM is stored on a Clariion SAN, regardless of which RAID type is used (5, 6, 10).
- It happens at random times, regardless of the guest utilization level.
- The guest doesn't "know" what happens. There are no log entries indicating anything before or after the event (except in one isolated vmxnet3 case which I explained).
- It has been happening more often lately. It's hard to provide exact numbers, but it seems like it used to happen maybe once every 3 months. Now it happens about every other week.
- There haven't been any significant changes to the hardware or software infrastructure which correlate with this problem.
Now, based on all of that information, logic tells me that something is wrong with VMware. It's the only factor I can find which is common in all situations. I opened a case and have been frustrated with the help VMware has provided.
1. I called while the problem was happening. A tech logged into the host and viewed CPU stats with esxtop. Then he put me on hold. I assumed he was looking something up in the KB or asking an associate for advice. 5 minutes later he came back and said Debian 6.0 is only supported on hosts running ESXi 5.0 (it was running on a 4.1 host). He would provide no further support whatsoever. Alright, I understand that, it's not in the support matrix. But spending 10 minutes looking at the problem would have been appreciated.
2. I upgraded one of the hosts to 5.0 and experienced the same problem a few days later. I called support again and mentioned the fact that it is now running in a supported configuration. He ran esxtop and noted the high vCPU usage. He stated there wasn't any more he could do, it must be a problem with the application running inside the guest. I specifically asked if he could run strace to see what the runaway processes were doing, he said no. I also asked if the VM could be forcibly crashed (coredump) to examine the aftermath. He said no and insisted the problem lies with some application running in the guest. His recommendations were to reduce the number of vCPUs, and/or schedule regular reboots of the guests. I had a good chuckle over the latter, then became dismayed when I realized he was being dead serious.
3. I escalated the case to a supervisor who repeated the technicians statements - there's nothing more VMware can do, something is wrong on our side. I explained that not all of the affected VMs run the same application - and there is no pattern.
Honestly that just doesn't make sense to me, the evidence points to a bug or some other problem in VMware.
Anyway, I'm an open-minded person and accept the fact that it may indeed be a problem with our internal application. How can I troubleshoot this further to find the source of this problem?
I do have a 'test subject' available for poking and prodding. While the problem was affecting one of the VMs, I suspended it, copied the entire VM directory to another volume, then resumed the original VM. CPU usage spiked right back up to where it used to be so I reset the guest. But I do have the copy of the suspended VM which may be resumed at any time to reproduce the problem.
Any help would be greatly appreciated.