VMware vSphere

 View Only
Expand all | Collapse all

VM Power on failures after adding PCIe device (GPU passthrough)

  • 1.  VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 15, 2024 07:07 AM

    Hello

    I experiencing the VM Power On Failures issue on vSphere 7 (same on ESXi Host) with adding PCIe device NVIDIA GPU.

    - ESXi Task result

    jhyicraft_0-1707979846834.png

    - vSphere Power On Failures

    jhyicraft_1-1707979938264.png

    - ESXi Host Configure

    jhyicraft_2-1707980349084.png

    jhyicraft_3-1707980368991.png

    - VM Configure
    There is no NVIDIA GRID vGPU Profiles either (empty and missing profiles)

    jhyicraft_4-1707980441128.png

    I even can not check the HCL for the Supermicro vendor in Lifecycle Manager.
    There is no vendor add on or something else for Supermicro.

    jhyicraft_5-1707980640797.png

     

    •  ESXi
      • VMware ESXi 7.0.3 (VMKernel Release Build 22348816)
      • Chassis
        • Supermicro SYS-221H-TNR
        • CPU Intel Xeon(R) Platinum 8480+ * 2EA
        • Memory 1TB
        • GPU
          • NVIDIA L40 * 2EA
          • Host Driver
            • NVIDIA-GRID-vSphere-7.0-535.129.03-537.70
    • vCenter
      • VMware vCenter Server Appliance 7.0.0.10300
      • VM on ESXi Host

    What should I do with them?

    Thank you.



  • 2.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 15, 2024 03:23 PM

    both L40 show 0 Bytes - is the gpu manager installed correctly ?

    try "nvidia-smi" on the console to check



  • 3.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 16, 2024 06:57 AM

    Thank you for the check.

    The gpu manager was not installed, so I installed and reboot the host.

    • nvidia-smi works fine on the esxi host
    • /etc/init.d/nvdGpuMgmtDaemon status showing
      daemon_nvdGpuMgmtDaemon is running

    Now I can see the vGPU Profiles on L40 GPUs!

    also the Memory of each L40 GPU shows 44.9GB now.

    But the VM still cannot start with Direct I/O and vGPU profile neither.

     

    ++

    I tried

    • disable all passthrough settings on the GPUs
    • add PCIe device on the guest VM with vGPU profile 'nvidia_l40-48q'

    but still have same error and cannot start the guest vm with gpu.



  • 4.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 16, 2024 08:22 AM

    Have a look at the vmware.log and it might give a clue that why the power-on is failing.
    Considering that each L40s has 48GB VRAM, it might be the MMIO size, 2 x 48GB = 96GB. The MMIO size has to be a power of 2; starting at 32GB, 64GB, 128GB .. etc

    Assuming the VM is already configured for EFI virtual firmware, you could try adding/editing the vmx with the following lines to increase the MMIO size.

    pciPassthru.use64bitMMIO = "TRUE"
    pciPassthru.64bitMMIOSizeGB = "128"

     



  • 5.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 16, 2024 08:34 AM

    do you want to assign the L40 with passtrhough oder with gpu profiles ?

    with passtrough you assign the whole L40 to one vm, with profiles many vm can use one L40

    afaik you dont need the gpu manager on the host with passthrough.

    if you want gpu profiles the gpu manager is requried, you also need valid nvidia grid licenses and an nvidia license server in cls- oder dls-mode



  • 6.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 19, 2024 12:35 AM

     

    The VM's Configuration Parameters that you mentioned were set up already.

    What is the vmware.log and where can I found it?

    vSphere and ESXi Host client's event log does not show any details for the vm power on failure.

    (vCenter is provisioned on the one of the ESXi host in the cluster as a VM)

     

     

    I want to assign the L40 with passthrough first.

    but the passthrough enabled, vm cannot start (without gpu manager in the esxi host)



  • 7.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 19, 2024 03:20 AM

    The vmware.log files should be in the same location where the VM is stored.



  • 8.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 19, 2024 03:12 PM

    have you tried "dynamic directpath i/o" ?

    is the vm configured to efi boot?

    have you reserved all memory for the vm ?



  • 9.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 21, 2024 07:27 AM

     

    I already tried dynamic too.

    EFI boot and memory reservation also set up.

     

     

    I found the NVIDIA vGPU does not support ESXi 7, but ESXi 8 would work with L40.
    Supported Products :: NVIDIA Virtual GPU Software Documentation

    But I can not find any reason that the passthrough mode still not working.

    Now I'm install the GPU on Windows baremetal Host(workstation). so I will check the vmware.log later.

     

    Thank you



  • 10.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 21, 2024 09:52 AM

    According to the link the L40 is supported with ESXi 7 and 8

    I have 20 hosts with 3 Tesla each running on ESXi7.0U3



  • 11.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 21, 2024 04:44 PM

    btw: your vcenter ist 7.0a from may 2020  and your host 7.0u3o from sep 2023 ?



  • 12.  RE: VM Power on failures after adding PCIe device (GPU passthrough)

    Posted Feb 22, 2024 01:47 AM

     

    My vCenter is VMware vCenter Server Appliance 7.0.0.10300

    and the ESXi Host is ESXi-7.0U3g-20328353-standard