We use ESX to virtualize ESXi and Windows Hyper-V machines for training scenarios. We have dozens (even hundreds) of these VMs in use across a large farm of ESX hosts at any given time. Students are able to suspend their VMs at any time and return to them later. This worked flawlessly on ESXi 5.1. Our troubles began when we upgraded to ESXi 5.5. After the upgrade, the Windows Hyper-V guests began crashing with a blue screen sometime during the suspend/resume process. We reverted to 5.1 on two of our boxes and the problem stopped. We may be forced to revert all servers to 5.1, but are really hoping to find a fix or workaround. We are opening a case with VMware, but thought we’d ask here as well.
Observations
- We are running the latest version of ESXi 5.5, build 1892794.
- Guest BSODs have occurred on all ESX hosts we have. We have two server hardware configurations that are very similar (details below) and the BSODs occur with equal frequency on both.
- BSODs happen in both Windows 2012 R2 and Windows 2008 R2 VMs.
- EDIT: BSODs occur if both “hypervisor.cpuid.v0 = FALSE” is configured and the Hyper-V role is installed. No BSODs occur if Hyper-V is removed or if “hypervisor.cpuid.v0 = FALSE” is removed.
- The “hypervisor.cpuid.v0 = FALSE” setting is used to make Hyper-V think it is running on native hardware. Without it, Windows knows it is running in a VM and the BSODs go away. But Hyper-V refuses to start nested VMs. I wish we didn't need the setting, but, alas, we do.
- We've never seen a BSOD that wasn't triggered by a suspend/resume.
- Not every suspend/resume causes a BSOD. During tests, we see a BSOD about 20% of the time that a suspend/resume is performed.
- This occurs on ESXi hosts whether they are busy or not. We took a server out of rotation for isolated testing and saw the exact same behavior.
- There are various BSOD messages. The most common is “IRQL_GT_ZERO_AT_SYSTEM_SERVICE” with a stop code of 0x0000004A. But there are several others.
- We thought this might have something to do with the Intel E5 processor bug (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2073791). However, we are using E5 V1 processors, not V2 processors. We updated to the latest Dell BIOS (2.2.3) just in case. It made no difference.
- We tried disabling all power settings in the server BIOS, as well as within ESX. This included setting everything to max performance, and disabling C-States. No difference.
- We disabled every feature available in the BIOS of the VMs, including all the caching options. No difference.
- We use the following settings to enable nested virtualization:
- Hardware CPU/MMU
- vhv.enable = TRUE
- hypervisor.cpuid.v0 = FALSE
- We tried using “windowsHyperVGuest” as the guest OS identifier instead of the above settings. Nested virtualization worked fine, but the BSODs still occurred at the same rate.
- EDIT: We tried upgrading the VMs from hardware version 9 to hardware version 10. This didn't help.
- EDIT: We tried upgrading VMware Tools from version 9.0.0.782409 to 9.4.6.1770165. This didn't help.
- We tried enabling CPU performance counter virtualization in the VMs. No help.
- We reverted two of our servers to 5.1 and the BSODs went away completely. No other changes were made to the servers or the VMs. Just reverting to 5.1 fixed the problem.
- We noticed a dramatic difference in the amount of time it takes the two versions of ESXi to suspend these VMs. ESXi 5.1 suspends these VMs in 2-3 seconds, while ESXi 5.5 takes 30 seconds to a minute. Something very different is definitely occurring during the suspend process. Suspend times in 5.5 are longer whether or not a BSOD occurs.
Our Hardware
- Dell PowerEdge R720xd OR Dell PowerEdge R620
- BIOS Version: 2.2.3
- CPUs: 2 - Intel Xeon E5-2670 0 @ 2.60GHz
- RAM: 384GB (24 Matched Dell 16GB DDR3 Synchronous Registered (Buffered) DIMMS)
- Controller: PERC H710P Mini 1GB NVRAM
- OS Drive: 2 - 240GB S3500 SSD drives in a RAID 1 Mirror (Slot 00 – 01)
- VM Storage Drive: 7 - 480GB S3700 SSDs in a RAID5 with a dedicated HS spare (Slot 02 - 09)
- 1 - Intel 2P X540/2P I350 rNDC
- 1 - Intel Gigabit 4P I350-t Adapter
The obvious fix is to roll back to 5.1. However, we have completely unrelated issues with 5.1 (a topic for a different post) that we don’t have with 5.5 and would rather stay with 5.5. But these BSODs are a deal-breaker. Any help or direction in resolving the BSODs will be greatly appreciated!