VMware Workstation

 View Only
Expand all | Collapse all

Sanity Check - VM Freezing, No Obvious Reason (to Me)

  • 1.  Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 13, 2024 05:27 PM
      |   view attached

    I have a situation were one of my VMs keeps freezing at random times for no obvious reason.  I can't touch the VM in Workstation at all, not even to "cut the power," and it vanishes from the network completely.  But the  vmware-vmx.exe  is still running (and throws 'Access Denied' if I try to kill it).

     

    SETUP:

    I'm running WorkStation Pro 17.5.1 build-23298084 on a Windows Server 2019 host, running 2 Windows Server 2019 guests.

    Host and guest OS's are same version - Windows Server 2019, 64-bit (Build 17763.5458) 10.0.17763.

     

    HOST:

    • Dell Poweredge R630 (iDRAC reports no hardware faults)
    • 2x 750W power supplies
    • 128 GB RAM MHz (4 x 32GB, 2 per Physical CPU), Multi-bit ECC
    • 2x 8-Core CPUs (Intel Xeon E5-2667v4 @ 3.20 GHz), 16 physical cores / 32 logical cores
    • OS Drive - RAID-1 (2x 465 GB SSD, Adaptive Read Ahead, Write Back)
    • VM Drive - RAID-10 (6x 223 GB SSD, Adaptive Read Ahead, Write Back)

     

    2x GUESTS:

    • 1 processor (8 cores)
    • 32 GB RAM

     

    Now, according to my arguably uneducated understanding of the way virtualization works, I don't *think* I've over-scheduled the resources on my host; I've left 64GB of RAM on the table (there are virtually no apps running on the metal, just a file share); I've used half of the logical core count; and the VMs are stored entirely (vmx's and vmdk's) on the second drive (not the OS drive for the host).

     

    Is there a problem with my setup that would cause a VM to simply go AWOL?  Have I failed to understand proper resource scheduling somehow?  Or should I consider that my hardware may simply be failing?

     

    I've attached the log from today; something seems wrong just looking at it, but I'm not versed enough in diagnosing these things to know for sure what.

    Attachment(s)

    log
    vm.log   6 KB 1 version


  • 2.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 13, 2024 09:05 PM

    I've just now completed a full ePSA Hardware Test, and all tests passed.



  • 3.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 14, 2024 12:06 PM

    If you aren't using the virtual CD-ROM, leave it disconnected.

    If you are running both of those guests at the same time, then you're using 16 cores, leaving nothing for your host OS to do any work.  Try reducing the number of cores.



  • 4.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 14, 2024 06:27 PM

    So from my research, I'm getting absolutely contradictory information on this.

     

    There seem to be two camps:

    1. You base your vCPU count off of the LOGICAL core count, and as long as loads aren't too high, you can even over-schedule.
    2. You base your vCPU count off of the PHYSICAL core count, and don't even THINK about over-scheduling under any circumstance.

     

    What's going on here?



  • 5.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 15, 2024 12:02 AM

    So far, the only significant things I've found to actually be available to change were (1) setting the processor affinity (2) disabling the Virtual CD-ROM in one VM (I had already disabled it in the other VM that had been freezing). 

     

    The server BIOS already had node interleaving disabled, Hyper-V / WSL was never enabled, and vmware.log showed CPL0 for the Monitor Mode.

     

    Furthermore, after verifying all of the things suggested above (EXCEPT changing my vCPU counts) I ran a GIMPS/Prime95 torture test inside both VM's simultaneously just to see what would happen to the host CPU usage.  With both VM's cramming 100% simultaneous CPU usage, the host was reporting..... 53% total utilization.  I kind of don't think I'm over-scheduling resources.  There was absolutely no instability anywhere during the test either.



  • 6.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 15, 2024 12:29 AM

    Personally, I ignore #2. Intel VT-x has matured to the point that the chances of VMexits (a VM having to go back to the hypervisor, similar to what an ordinary process will context switch with the OS), and the cost of the transitions (number of clock cycles spent) is reduced. The virtual RAM is also managed by the CPU and no longer done through VMware software. And if you go back to the mid 2000s, when multicore were not so common and virtual RAM was managed by software; you HAD to run multiple VMs; that advice would be useless; sure it can sometimes be as slow as molasses but what choice do you have? 

    And if you have the HyperV involved, things will be slower and the chances of VM transition go up as HyperV API will be running at ring 2.

    It really boils down to the workload of the VM(s) and the host.

    Anyway, the 16vCPU limit for your R630 is about RAM locality and L1/L2/L3 cache. L1/L2 resides in each core shared by 2 hyperthreads, L3 cache is shared across cores in one socket. Once a process (whether on host or VM) has to cross over to another CPU, L1/L2/L3 cache built up is no longer available and has to go back to RAM and this time it is on another CPU.

    For example, you don't want to get in a situation you assign 24vCPU to a database/application server VM, it retrieves a large amount of data using CPU0 and now has to do sorting and it gets assigned to CPU1 and now the VM has to ask CPU0 for that data in RAM that was already retrieved. CPU0 is now getting disturbed to get data in its RAM for CPU1 and unable to continuously do work for other processes (whether on the host or a VM). This is just a simplistic illustration to show the additional overhead when RAM locality and L1/L2/L3 cache advantages are lost.

    If the dual CPUs were 20c/40t each, you'd still be fine to create a monster 32vCPU VM as RAM locality can still be achieved and still have high chance of retaining L1/L2/L3 cache. 

     



  • 7.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 15, 2024 04:53 AM

    I'll have to continue to monitor carefully, but it appears that setting the processor affinity for each VM made all the issues I was having disappear, including getting "untouchable" black screens in Workstation (an issue about which I started a separate thread).  the freezing issue stop.



  • 8.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted 17 days ago

    My answer is a bit late, but i still hope it can help you in the future. Both statements are wrong or incomplete. The goal is to never overburden the host CPU and leave the needed CPU to run the virtualisation tasks and the host OS tasks that cannot be stopped and always run. The approximate amount of CPU power you have is physical cores count 100% and al the extra you get count as about 25% (exact fixed number is impossible as amount of second use of the same physical core depends on the instructions run on both logical cores and fluctuates between about 20% and 30% but 25% is a very good average). There is a second problem  and that is that the VM tries to claim its assigned cores when it gets a timeslice. So a host with only 2 VM's that both have half the logical cores assigned and the VM's have low load still gets you in many occasions of the VM's fighting for their cores. The whole over schedule starts to work better and better is you have a host with many cores and many smaller VM's that all get a few cores. Because the chance that all those seperate VM's need work done at exactly the same time is lower then just 2 VM's. So for cases where you run a lot of small VM's the first rule makes sense. And if you run only a few big VM's the second starts to make more sense. As when you count of physical lets day you got a 8 physical 16 logical cpu. And you create 2 VM's of 4 cores. Then your host still has 16-8=8 and 8*25%=2 physical cores of CPU power left to do it's tasks if everything has moments where things go full blast. I have helped building many Citrix environments where we had to calculate how many machines of certain sizes we could put on ESX hosts. And we even did load tests with automated workloads to verify the results. Now i know ESX is not exactly the same as Workstation because the host OS is very lean and specially made to be only a hypervisor. So the needed CPU left for the host is a more known factor. With workstation that runs on a Windows host you need to factor in the silly resource uses that Windows sometimes has (more overhead then with ESX).    




  • 9.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)
    Best Answer

    Posted Mar 14, 2024 01:47 PM

    Is Hyper-V or any components that use Hyper-V such as VBS, WSL2 enabled on the host? Hyper-V on the host results in the slower Microsoft Hypervisor API being used instead of ring 0 Intel VT-x native calls. Check msinfo32 at the host or vmware.log of any VM and look for "Monitor Mode". The text "ULM"" indicates it is the slower hypervisor while "CPL0" is for the ring 0 Intel VT-x.

    Since the host hardware has two CPUs, at the host UEFI, the "node interleaving" should be "disabled" for better performance. Windows Server 2019 is NUMA-aware so it is ok to disable node interleaving.

    Next step, set CPU affinity for the 2 VMs. This can be set either from Task Manager (on the vmware-vmx.exe process) or via vmx configuration to have RAM locality and retain as much L1/L2/L3 cache as possible. Getting a VM scheduled on one CPU socket and then swapped out to the other CPU next time clears away whatever L1/L2/L3 cache that processes inside the VM had apart from the more expensive RAM access to cross CPU instead of "local" RAM.

    One VM would have the following lines in the vmx configuration while the other would have the TRUE and FALSE values flipped. So VM1 would always be using CPU0 while VM2 uses the other CPU1. By extension, any VM in this specific host hardware should only be configured with 16 vCPUs or less.

    Processor0.use = "TRUE"
    Processor1.use = "TRUE"
    Processor2.use = "TRUE"
    Processor3.use = "TRUE"
    Processor4.use = "TRUE"
    Processor5.use = "TRUE"
    Processor6.use = "TRUE"
    Processor7.use = "TRUE"
    Processor8.use = "TRUE"
    Processor9.use = "TRUE"
    Processor10.use = "TRUE"
    Processor11.use = "TRUE"
    Processor12.use = "TRUE"
    Processor13.use = "TRUE"
    Processor14.use = "TRUE"
    Processor15.use = "TRUE"
    Processor16.use = "FALSE"
    Processor17.use = "FALSE"
    Processor18.use = "FALSE"
    Processor19.use = "FALSE"
    Processor20.use = "FALSE"
    Processor21.use = "FALSE"
    Processor22.use = "FALSE"
    Processor23.use = "FALSE"
    Processor24.use = "FALSE"
    Processor25.use = "FALSE"
    Processor26.use = "FALSE"
    Processor27.use = "FALSE"
    Processor28.use = "FALSE"
    Processor29.use = "FALSE"
    Processor30.use = "FALSE"
    Processor31.use = "FALSE"



  • 10.  RE: Sanity Check - VM Freezing, No Obvious Reason (to Me)

    Posted Mar 29, 2024 10:20 PM

    Just wanted to put one final update here — since I manually set the processor affinity for each virtual machine in their respective .vmx files, neither of my VMs has frozen even once in the last two weeks.

     

    Thank you, again, blue.