CASE REPORT
We recently encountered this problem while trying to deal with another issue-- ZFS data corruption under Ubuntu 22 when, and only when, running under VMWare 17.0. In hopes of a fix, we upgraded to 17.5.0.
IMMEDIATELY, all guest VMs began experiencing lag of up to several seconds per keystroke, even when completely idle. System Monitor, top, and iotop indicate no unusual loads.
INITIAL CONDITIONS
HOST
The host system is an Intel Core i7-8700K (0x6:0xE//0x6:0x9E:10) with 32 GB of RAM and 16GB of SSD-backed swap (which is not normally used). The graphics hardware is an NVidia Geforce RTX2060 with 12GB of VRAM. Storage is provided by a combination of WD Black NVMe SSD primary, and Samsung Pro SATA SSD secondary, storage, with a large rotary HDD as tertiary backup storage.
The host system runs Rocky Linux 9. It uses X11, not Wayland. It uses the nvidia driver, presently UMD Version NVIDIA 535.104.05. The display resolution is 4K, and the display device is a 60Hz LG monitor connected via DisplayPort.
The host system has been THOROUGHLY forced through benchmark and stress tests, which it passes. Thermal monitoring is conducted via OSPower (Konkor), Vitals, and the NVidia control panel. All conditions appear nominal. Deliberately throttling CPU core count and/or speed does not seem to affect either issue.
GUESTS
The guest VMs are running Ubuntu 22. Only one is run at a time. The resources allocated are:
CPU topology: 2 packages of three cores each.
VT-d, IOMMU, and CPU performance counter passthroughs are all DISABLED.
Memory: 12-16GB.
Video: 512 MB VRAM
Windowing system: X11
Resolution: Variable; 4K nominal
Storage: 250 GB multi-file dynamically sized.
Host storage backing: SSD or NVMe; ZFS or ext4 filesystem.
Scope of duties: Very large software and firmware image builds (OpenWRT, Yocto, Buildroot, rebuilds of GCC compiler suite).
Notes:
VMs are running primarily from SSD storage, though we have moved them to NVMe storage while investigating the Ubuntu 22 ZFS data corruption issue and observed no change in either that issue, or this keyboard delay issue.
EXPERIMENTAL FINDINGS
1. Adding:
keyboard.allowBothIRQs = FALSEkeyboard.vusb.enable = TRUE
produced no apparent change.
2. Disabling 3D acceleration in the guest WITHOUT the keyboard option changes SIGNIFICANTLY IMPROVED keyboard latency from 1000-3500 ms to perhaps 30-70 ms.
3. With 3D acceleration disabled, additionally applying the option:
keyboard.allowBothIRQs = FALSE
FURTHER IMPROVED keyboard latency from perhaps 30-70 to ms to 20-30 ms.
4. With both of the above improvements, additionally applying the option:
keyboard.vusb.enable = TRUE
FURTHER IMPROVED keyboard latency from 20-30 ms to effectively imperceptible.
OTHER FINDINGS OF INTEREST
1. Even when keyboard latency was at its worst, using the mouse wheel in Firefox produced instant response when rolled one click. This is significant, as comparing these latencies this is generally a very good test for determining overall X11 input latency.
2. Our ZFS data corruption issues, while still not fully resolved (we are falling back to ext4), were SIGNIFICANTLY EXACERBATED when the guests were using 3D acceleration. Additionally, we experienced frequent hangs and crashes when using both Ubuntu and Rocky Linux 9 guests for Yocto builds while 3D acceleration was on. We have tried multiple versions, and multiple packagings (official; distro; community; etc.) of the NVidia drivers, and observed no effect on the guests. Graphics, even accelerated graphics, on the HOST are very stable, to the point where a 3D game was installed via Steam and played successfully at punishing graphics settings, while a compile was running in the background, so we believe the host graphics to be stable and sound.
OUR CONCLUSIONS
1. VMWare's current testing and Quality Assurance (QA) processes for Workstation are clearly and demonstrably inadequate. This thread alone provides strong evidence of serious deficiencies that were immediately obvious as soon as the guests were powered on, on common hosting solutions.
2. The evidence strongly suggests that VMWare is considering testing results from ESXi or other products as adequate stand-ins for equally comprehensive testing results from VMWare Workstation, which they CLEARLY are not. This is, frankly, starkly obvious. The configurations people are reporting problems on are very basic, very common configurations for Workstation, but ones which would not be common for other products. The only reasonable explanation for issues this severe sailing through testing of VMWare Workstation is that at least some changes ARE NOT being adequately tested on Workstation. I find it unlikely that VMWare would completely blow off testing, so the reasonable middle-ground is that, during what is absolutely an economically challenging time in the market, VMWare has reduced co-testing of at least some critical changes across its product base. WHILE UNDERSTANDABLE GIVEN THE EVENTS OF RECENT YEARS, THIS STRATEGY IS *CLEARLY* NOT WORKING, AND NEEDS TO BE RECTIFIED. UNRESOLVED, THIS PRACTICE *WILL* PROVE TO BE MORTALLY DANGEROUS TO WORKSTATION AS A PRODUCT.
IN CLOSING
We rely on VMWare Workstation to provide permanently reproducible sterile build environments for embedded systems customers. Our ability to do this efficiently and reliably this directly affects both the security of these products, and the environments into which they placed. We take our clients' security and quality extremely seriously, and will push back hard when our vendors compromise our ability to protect their interests. Likewise, our reputation is extremely important to us, and we insist that products which act as an interface between us and our clients hit high standards.
We already know your people are of exceptional quality. But going forward, we must respectfully demand that we see that more clearly reflected in the products you ship.
Sincerely,
Matt Heck
Principal Firmware Engineer
USA Firmware LLC
/
President
Hard Problems Group, LLC
/
US INFRAGARD Southern Nevada