VMware vSphere

 View Only
  • 1.  W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 15, 2020 04:16 PM

    Hello VMware-ers,

    I've struggled with this GPU issue for a while. See my setup below, and let me know if anyone has any ideas!

    Problem: Any sort of stress on the GPU and it crashes the guest, and restarts. In order to utilize the GPU after the crash (even running shell commands in esxi host), require you to reboot the host. Small memory dumps are uploaded below. With the GPU/VM in pass-through, I have to use VNC to login.

    Thank you for any help/suggestions!

    ESXi Host #1 ver. 6.5

    Dell Precision T5600
    Bios: Latest

    RAM: 24GB
    Multiple SSD/HDDs
    PCIe Slot 1 Nvidia GRID K2 (ECC off, nvidia-sme looks good)

    PCIe Slot 5 Nvidia Quadro K600 (set as primary video card in BIOS)

    Latest matching Nvidia drivers for host (injected as .VIB) and guest

    vSphere Enterprise Plus license

    ESXi Host #2 ver. 6.5

    Intel NUC

    Assorted VM's, including vCenter server. (Deployed via GUI installer)

    vSphere Enterprise Plus license

    vCenter Server ver. 7.0

    Server 7 Standard

    Virtual Machine with vGPU assigned (installed on ESXi host #1)

    Windows 10 Enterprise LTSC ver. 1809

    Nvidia GRID vGPU grid_k220q (have tried k200 and k280q)

    Nested Virtualization Enabled

    Latest VM Tools installed (ver. 10272)

    hypervisor.cpuid.v0 = FALSE

    Virtual Total for minidump zip

    Message was edited by: Ryan (added photos)

    Message was edited by: Ryan (updated title)



  • 2.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 15, 2020 05:18 PM

    Hey, hope you are doing fine
    might sound silly but
    do you have VMware tools installed and up to date? What does VM logs say?



  • 3.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 15, 2020 06:10 PM

    Hi nachogonzalez,

    I do have latest VMware tools installed. Which log? Are you talking about "Export Systems Logs..."



  • 4.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 15, 2020 06:46 PM

    Hey, hope you are doing fine:

    can you please upload the following logs:

    VMkernel and VMKwarning logs --> ESXi Log File Locations

    VM log files: VMware Knowledge Base

    Thanks in advance

    Warm regards



  • 5.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 15, 2020 11:19 PM

    Please see enclosed. I'm going to take a look myself, now that I know what are the primary logs are.



  • 6.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 16, 2020 12:50 PM

    Let me know if you need further assistance.



  • 7.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 16, 2020 04:17 PM

    I wasn't able to find much of a crash or resource limitation. Any tips on how to comb through these logs better?



  • 8.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 16, 2020 05:03 PM

    can you upload them please?



  • 9.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 16, 2020 05:06 PM

    Oh, maybe you can't see them? I uploaded them above. Nonetheless, I re-uploaded



  • 10.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 18, 2020 01:42 AM

    From the vmware.log, there are vmx settings that are mutually exclusive (i.e. setting(s) is not to be used with another as it is either contradictory or the other setting(s) will take effect making the other useless).

    2020-09-15T13:30:10.423Z| vmx| I125: DICT  pciPassthru.use64bitMMIO = "TRUE"

    2020-09-15T13:30:10.423Z| vmx| I125: DICT pciPassthru.64bitMMIOSizeGB = "16"

    2020-09-15T13:30:10.423Z| vmx| I125: DICT             pciHole.start = "2048"

    2020-09-15T13:30:10.423Z| vmx| I125: DICT          pciHole.dynStart = "3072"

    There is no firmware settings in the vmx configuration so I guess the VM is using virtual BIOS and not virtual UEFI as I don't see this line in the vmware.log.

    firmware="efi"

    The use64bitMMIO and 64butMMIOSizeGB only has effect if the VM is using EFI for its virtual firmware. You wouldn't use the pciHole settings once the VM is using 64-bit MMIO as the MMIO address is already above the 4GB address area. pciHole.start = "2048" means the MMIO address starts at the 2GB address boundary.

    Have a read of this KB

    https://kb.vmware.com/s/article/2142307

    and also read this to understand what a "PCI Hole" is

    https://en.wikipedia.org/wiki/PCI_hole

    The GRID K2 does not have display output so I suppose you intend to use this as a compute device (such as for CUDA). It is better to use EFI as virtual firmware for the VM (and along with it the 64-bit MMIO settings).

    Note you have to reinstall the guest OS from scratch if switching to EFI from BIOS for the virtual firmware as the VM will no longer boot. Virtual EFI looks for GPT in the boot disk while BIOS looks for MBR.



  • 11.  RE: W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

    Posted Sep 18, 2020 01:54 AM

    Alternative to reinstall from scratch for Windows 10 VM is to use the MBR2GPT tool available from version 1703 and newer.

    Convert to GPT first and then change to EFI in the vmx settings.

    I have done a conversion successfully before for a Windows 10 VM in Workstation Pro 15.x.