VMware Workstation

 View Only

 Windows 10 guest dies because of NMI error with WinDbg connected over network

Pavel_A's profile image
Pavel_A posted Oct 05, 2024 03:59 PM

Has anyone seen this problem with Workstation 17.6?

There's a Windows 10 x64 VM created with Workstation 17.1

It worked well for months on Win10 x64 host; Dell Latitude 5440 PC)

Recently we updated the host to Win11 and Workstation to 17.6 (removed 17.2, reboot, install 17.6)

Then sporadically the guest catches a NMI when the VM is debugged with windbg over ethernet.

See stack dump below.

Without the debugger connected, the VM seems to run normally.

This did not occur before update.

Example of stack trace:

 # Child-SP          RetAddr               Call Site
00 fffff803`4a897398 fffff803`47d18d82     nt!DbgBreakPointWithStatus
01 fffff803`4a8973a0 fffff803`47d18366     nt!KiBugCheckDebugBreak+0x12
02 fffff803`4a897400 fffff803`47bfdc57     nt!KeBugCheck2+0x946
03 fffff803`4a897b10 fffff803`47cb9f2a     nt!KeBugCheckEx+0x107
04 fffff803`4a897b50 fffff803`48c015b0     nt!HalBugCheckSystem+0x7a
05 fffff803`4a897b90 fffff803`47dbc58d     PSHED!PshedBugCheckSystem+0x10
06 fffff803`4a897bc0 fffff803`47cbe5e2     nt!WheaReportHwError+0x3dd
07 fffff803`4a897c90 fffff803`47d13af4     nt!HalHandleNMI+0x142
08 fffff803`4a897cc0 fffff803`47c0aec2     nt!KiProcessNMI+0x134
09 fffff803`4a897d10 fffff803`47c0ac52     nt!KxNmiInterrupt+0x82
0a fffff803`4a897e50 fffff803`47bf990f     nt!KiNmiInterrupt+0x212
0b fffff803`4a8746b8 fffff803`47bbcd14     nt!HalProcessorIdle+0xf
0c fffff803`4a8746c0 fffff803`47a4ca96     nt!PpmIdleDefaultExecute+0x14
0d fffff803`4a8746f0 fffff803`47a4b844     nt!PpmIdleExecuteTransition+0x10d6
0e fffff803`4a874af0 fffff803`47c02604     nt!PoIdle+0x374
0f fffff803`4a874c60 00000000`00000000     nt!KiIdleLoop+0x54

Those NMI's occur at random moments, when the guest runs any random programs or just sitting idle (but the  windbg debugger is active and moves data over guest ethernet).

What can cause this and how to resolve?

Dhairya Tomar's profile image
Broadcom Employee Dhairya Tomar

Request you to reproduce the issue and share support bundle of the affected VM using  Help->Support->Collect Support Data

Pavel_A's profile image
Pavel_A

@Dhairya  thank you for quick response. The support zip sent via private message.

I see another problem with same host & guest combination:

when the VM runs with or without windows debugger connected, it sporadically freezes without any error message. The guest stops responding to keyboard and mouse. Suspend & resume of the VM revives it.

A co-worker uses version 16.2.5 of the Workstation on a similar host (same Dell laptop, Win11 upgraded from win10) - without any problems. I'm going to install this version too.

Regards,

Pavel A.

Dhairya Tomar's profile image
Broadcom Employee Dhairya Tomar

Thanks for sharing Support bundle, ticket has been raised internally, relevant team will look into the same.

Pavel_A's profile image
Pavel_A

Hello,

The same problem occurs with latest update 17.6.1  and Windows 11 x64 guest.

Am I only one who has this problem?

Stack trace below.

 # Child-SP          RetAddr               Call Site
00 fffff806`2d89a338 fffff806`2a758982     nt!DbgBreakPointWithStatus
01 fffff806`2d89a340 fffff806`2a758073     nt!KiBugCheckDebugBreak+0x12
02 fffff806`2d89a3a0 fffff806`2a629027     nt!KeBugCheck2+0xba3
03 fffff806`2d89ab10 fffff806`2a6f46e0     nt!KeBugCheckEx+0x107
04 fffff806`2d89ab50 fffff806`281d10c0     nt!HalBugCheckSystem+0x90
05 fffff806`2d89ab90 fffff806`2a7feabf     PSHED!PshedBugCheckSystem+0x10
06 fffff806`2d89abc0 fffff806`2a6f8d9c     nt!WheaReportHwError+0x38f
07 fffff806`2d89ac90 fffff806`2a6b6182     nt!HalHandleNMI+0x14c
08 fffff806`2d89acd0 fffff806`2a636b42     nt!KiProcessNMI+0x175762
09 fffff806`2d89ad10 fffff806`2a6368ae     nt!KxNmiInterrupt+0x82
0a fffff806`2d89ae50 fffff806`2a4d5d14     nt!KiNmiInterrupt+0x26e
0b ffffc907`752447b0 fffff806`2a4d5a6f     nt!ExAllocateHeapPool+0x274
0c ffffc907`752448c0 fffff806`2ac9d68d     nt!ExpAllocatePoolWithTagFromNode+0x5f
0d ffffc907`75244910 fffff806`2a8e27ff     nt!ExAllocatePool2+0xdd
0e ffffc907`752449c0 fffff806`2a9daa01     nt!RtlpNewSecurityObject+0x110f
0f ffffc907`75244cf0 fffff806`2a93689f     nt!SeAssignSecurity+0x71
10 ffffc907`75244d50 fffff806`2a8d5f75     nt!CmpCreateChild+0x3ef
11 ffffc907`75244e80 fffff806`2a8d75d5     nt!CmpDoParseKey+0x1f35
12 ffffc907`752452d0 fffff806`2a8df0a7     nt!CmpParseKey+0x2e5
13 ffffc907`752454c0 fffff806`2a8de4d2     nt!ObpLookupObjectName+0x697
14 ffffc907`75245660 fffff806`2a949671     nt!ObOpenObjectByNameEx+0x1f2
15 ffffc907`75245790 fffff806`2a9491d2     nt!CmCreateKey+0x481
16 ffffc907`75245a10 fffff806`2a63dce5     nt!NtCreateKey+0x52
17 ffffc907`75245a70 00007ffd`19aef1a4     nt!KiSystemServiceCopyEnd+0x25
18 00000040`514fe518 00000000`00000000     0x00007ffd`19aef1a4

Prajakta Malla's profile image
Broadcom Employee Prajakta Malla

Hi Paul

Please check if this happens with APIC virtualization disabled. This can be done by adding the following config option to your VM's .vmx file:
monitor_control.disable_apichv="TRUE"
Please share the results and support bundle once your testing is done with this change.
Regards
Prajakta

Pavel_A's profile image
Pavel_A

Thank you Prajakta for this suggestion, I've made this change in the .vmx and testing. This can take some time to reproduce. Sometimes it takes 5 minutes, sometimes several hours. 

-- P.

Update: with the 'disable_apichv' the NMI did not occur after ~ 5 hours running.

But another problem persists:

  • the same win10 x64 VM "freezes" after being some time non-active.
  • Can be revived by clicking Pause & Play. 
  • When this occurs, nothing notable is the eventlog (system, application) nor in kernel debugger.