VMware vSphere

 View Only
  • 1.  Vsphere 8 - VM hangs on startup with Nvidia T4 passthrough

    Posted Jan 26, 2023 03:58 PM

    Hi guys,

    I've been banging my head against a wall getting a Nvidia Tesla T4 passthrough-enabled VM to boot. I have two ESXi hosts in a Vsphere 8.0.0 setup (Enterprise Plus), each with 1 T4 card. These systems previously each ran a Quadro P620 via passthrough without issues. Moving to the T4 has been nothing but trouble. 

    With either ESXi host, it properly boots and recognizes the card, and I am able to enable passthrough on it in the vSphere UI, as well as add it to a VM configuration. However, once I try to start the VM (on either host), it will hang at 88% and eventually error out. vmware.log for the VM shows:

    2023-01-25T19:27:18.723Z In(05) vmx - MX: init lock: rank(PCIPassLCK_0)=0x3e7 lid=26
    2023-01-25T19:30:27.731Z In(05) vmx - AH Failed to find a suitable device for pciPassthru0
    2023-01-25T19:30:27.731Z In(05) vmx - Module 'DevicePowerOn' power on failed.


    Some more things:

    • The VM is set to boot via EFI and boots up fine without the GPU passthrough device added - stock Ubuntu 22.04 install.
    • I've tried both DirectPath IO and Dynamic DirectPath IO to pass the card though, no difference.
    • Embedded virtualization is not enabled in the vm.
    • All VM memory is reserved.
    • I have also tried enabling and disabling the IOMMU in the VM (under CPU).
    • Tried autodetecting the video card, and manually specifying it.
    • Have tried restarting the host after enabling passthrough...have rebooted the host numerous times.

    I've also tried the below config parameters in the .vmx in varying combinations, with no success:

    pciPassthru.use64bitMMIO="TRUE"
    pciPassthru.64bitMMIOSizeGB="32" (as the card has 16gb of memory)
    pciPassthru0.msiEnabled = "FALSE"
    hypervisor.cpuid.v0 = "FALSE"
    svga.guestBackedPrimaryAware = "FALSE" (seems to like to be set to TRUE by default)

    The host systems are each a Supermicro SuperServer 5019D-FN8TP running an up-to-date BIOS (v1.8), and this model is listed as supporting the T4 according to Qualified Platform List for GPUs | Supermicro -- now, I do have the GPU plugged into a x16 riser, which converts it to the x8 PCIE slot on the motherboard, but the T4 spec sheet says it supports PCIE 3.0 x8 and x16 so I didn't think this would be an issue.

    BIOS is as follows:

    Screenshot 2023-01-25 at 3.55.04 PM.jpg

     

    The GPU shows up in the Vsphere UI as follows:

    Screenshot 2023-01-25 at 4.07.57 PM.png

     

    Screenshot 2023-01-25 at 4.07.30 PM.png

     

    The GPU shows up fine on the host via esxcli hardware pci list -c 0x300 -m 0xff:

    0000:65:00.0
    Address: 0000:65:00.0
    Segment: 0x0000
    Bus: 0x65
    Slot: 0x00
    Function: 0x0
    Vendor Name: NVIDIA Corporation
    Device Name: TU104GL [Tesla T4]
    Configured Owner: VM Passthru
    Current Owner: VM Passthru
    Vendor ID: 0x10de
    Device ID: 0x1eb8
    SubVendor ID: 0x10de
    SubDevice ID: 0x12a2
    Device Class: 0x0302
    Device Class Name: 3D controller
    Programming Interface: 0x00
    Revision ID: 0xa1
    Interrupt Line: 0x0b
    IRQ: 255
    Interrupt Vector: 0x00
    PCI Pin: 0x00
    Spawned Bus: 0x00
    Flags: 0x3001
    Module ID: 45
    Module Name: pciPassthru
    Chassis: 0
    Physical Slot: 7
    Slot Description: CPU SLOT7 PCI-E 3.0 X8
    Device Layer Bus Address: s00000007.00
    Passthru Capable: true
    Parent Device: PCI 0:100:0:0
    Dependent Device: PCI 0:101:0:0
    Reset Method: Bridge reset
    FPT Sharable: true
    NUMA Node: 0
    Hardware Label:
    Virtual Function:


    Here's the .vmx file for the VM I'm trying to boot:

    .encoding = "UTF-8"
    config.version = "8"
    virtualHW.version = "20"
    nvram = "oc.nvram"
    svga.present = "TRUE"
    vmci0.present = "TRUE"
    hpet0.present = "TRUE"
    floppy0.present = "FALSE"
    numvcpus = "2"
    memSize = "16384"
    firmware = "efi"
    powerType.powerOff = "default"
    powerType.suspend = "default"
    powerType.reset = "default"
    tools.upgrade.policy = "manual"
    sched.cpu.units = "mhz"
    sched.cpu.affinity = "all"
    sched.cpu.latencySensitivity = "normal"
    vm.createDate = "1674612518956071"
    scsi0.virtualDev = "pvscsi"
    scsi0.present = "TRUE"
    sata0.present = "TRUE"
    scsi0:0.deviceType = "scsi-hardDisk"
    scsi0:0.fileName = "oc.vmdk"
    sched.scsi0:0.shares = "normal"
    sched.scsi0:0.throughputCap = "off"
    scsi0:0.present = "TRUE"
    sata0:0.deviceType = "cdrom-image"
    sata0:0.fileName = "/vmfs/volumes/9d696458-538d8b1c/iso/ubuntu-22.04-live-server-amd64.iso"
    sata0:0.present = "TRUE"
    ethernet0.allowGuestConnectionControl = "FALSE"
    ethernet0.virtualDev = "vmxnet3"
    ethernet0.dvs.switchId = "50 11 bd bf 4b da 72 f0-66 52 ed d6 5f 9a a5 b8"
    ethernet0.dvs.portId = "34"
    ethernet0.dvs.portgroupId = "dvportgroup-2041"
    ethernet0.dvs.connectionId = "1114659673"
    ethernet0.shares = "normal"
    ethernet0.addressType = "vpx"
    ethernet0.generatedAddress = "00:50:56:91:f3:77"
    ethernet0.uptCompatibility = "TRUE"
    ethernet0.present = "TRUE"
    displayName = "oc"
    guestOS = "ubuntu-64"
    chipset.motherboardLayout = "acpi"
    toolScripts.afterPowerOn = "TRUE"
    toolScripts.afterResume = "TRUE"
    toolScripts.beforeSuspend = "TRUE"
    toolScripts.beforePowerOff = "TRUE"
    uuid.bios = "42 11 41 c2 e2 4f 33 f8-bb e2 cc ae ec de ef e4"
    vc.uuid = "50 11 cd 21 85 bf 53 07-6b 03 95 46 2f 0d f0 99"
    migrate.hostLog = "oc-22261365.hlog"
    sched.cpu.min = "0"
    sched.cpu.shares = "normal"
    sched.mem.min = "16384"
    sched.mem.minSize = "16384"
    sched.mem.shares = "normal"
    migrate.encryptionMode = "opportunistic"
    ftcpt.ftEncryptionMode = "ftEncryptionOpportunistic"
    scsi0:0.ctkEnabled = "TRUE"
    ctkEnabled = "TRUE"
    sched.mem.pin = "TRUE"
    numa.autosize.cookie = "40012"
    numa.autosize.vcpu.maxPerVirtualNode = "4"
    cpuid.coresPerSocket.cookie = "4"
    sched.swap.derivedName = "/vmfs/volumes/611ffeaf-b4d4b252-6f7b-ac1f6b7d80aa/oc/oc-1416d0e7.vswp"
    pciBridge1.present = "TRUE"
    pciBridge1.virtualDev = "pciRootBridge"
    pciBridge1.functions = "1"
    pciBridge1:0.pxm = "0"
    pciBridge0.present = "TRUE"
    pciBridge0.virtualDev = "pciRootBridge"
    pciBridge0.functions = "1"
    pciBridge0.pxm = "-1"
    scsi0.pciSlotNumber = "32"
    ethernet0.pciSlotNumber = "34"
    sata0.pciSlotNumber = "35"
    scsi0:0.redo = ""
    scsi0.sasWWID = "50 05 05 62 e2 4f 33 f0"
    vmotion.checkpointFBSize = "16777216"
    vmotion.checkpointSVGAPrimarySize = "16777216"
    vmotion.svga.mobMaxSize = "16777216"
    vmotion.svga.graphicsMemoryKB = "16384"
    vmci0.id = "-320933916"
    monitor.phys_bits_used = "45"
    cleanShutdown = "TRUE"
    softPowerOff = "TRUE"
    tools.syncTime = "FALSE"
    guestInfo.detailed.data = "architecture='X86' bitness='64' distroName='Ubuntu 22.04 LTS' distroVersion='22.04' familyName='Linux' kernelVersion='5.15.0-58-generic' prettyName='Ubuntu 22.04
    toolsInstallManager.updateCounter = "1"
    extendedConfigFile = "oc.vmxf"
    sata0:0.startConnected = "FALSE"
    bios.bootDelay = "5000"
    vmx.buildType = "debug"
    svga.autodetect = "TRUE"
    svga.guestBackedPrimaryAware = "TRUE"
    uuid.location = "56 4d f0 8d e1 dc 65 db-8e 50 1a 54 63 4b f8 3e"
    svga.vramSize = "16777216"
    vvtd.enable = "TRUE"
    viv.moid = "f0c3d812-d205-4ee9-a1c6-452994dc9e42:vm-48044:A4Ad6e0tdI/Qwq+qN/eDfKIP6+cMXGD5Y6L6z5MTXBk="
    pciPassthru.use64bitMMIO="TRUE"
    pciPassthru.64bitMMIOSizeGB="32"
    pciPassthru0.id = "00000:101:00.0"
    pciPassthru0.deviceId = "0x1eb8"
    pciPassthru0.vendorId = "0x10de"
    pciPassthru0.systemId = "5c7944bd-360d-25c6-d570-ac1f6b7d80aa"
    pciPassthru0.present = "TRUE"


    Items like svga.vramSize, vmotion.*, svga.present were added automatically by VMWare. If I change from DirectPath to Dynamic Directpath, the pciPassthru0 items become:

    pciPassthru0.allowedDevices = "0x10de:0x1eb8"
    pciPassthru0.present = "TRUE"


    Thank you for any help on this matter! Would love to get these cards working over the Quadros. 



  • 2.  RE: Vsphere 8 - VM hangs on startup with Nvidia T4 passthrough

    Posted Jan 26, 2023 09:11 PM

    Which VMware tools version is running, was not able to find it in vmx.file

    I am not sure that the problem is there. With Ubuntu OS it's a bit complicated for me 



  • 3.  RE: Vsphere 8 - VM hangs on startup with Nvidia T4 passthrough

    Posted Oct 04, 2023 01:00 PM