VMware vSphere

  • 1.  BSOD Windows Server 2022 NVidia vgpu

    Posted Feb 01, 2024 06:49 AM

    Hello Everyone,

    we have a big issues since our migration from build 21930508 (7.0.3n) to 22380479 (8.0.2) - VCenter and esxi Hosts.

    We are using DELL PowerEdge R750 esxi Host with one NVidia A16 GPU. On every host are 6 Citrix VDAs - Windows Server 2022 with NVidia profile a16-8a. Every Citrix VDA is designed for 14 User. 

    When we are reaching arround 12 Users on the Citrix VDAs there are going into an bluescreen. In the VMWare Log file you can see this:

    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: NVOS status 0x59
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: Assertion Failed at 0xf0176a3b:143
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: 8 frames returned by backtrace
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv009081vgpu+0x35) [0x14f01774f5]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv004971vgpu+0x1ad) [0x14f0153d0d]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv011570vgpu+0x1a2b) [0x14f0176a3b]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv007364vgpu+0x3fe) [0x14f0212c1e]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x6876d) [0x14f011b76d]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /bin/vmx(+0x95d3f9) [0x14aaeaa3f9]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /lib64/libpthread.so.0(+0x7a82) [0x14ed0d7a82]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /lib64/libc.so.6(clone+0x3f) [0x14ed1e2eef]
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0): VGPU message 19 failed, result code: 0x59
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0): 0x0, 0xc1d0096e, 0xff020000, 0xff030000, 0x2080,
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0): 0x0, 0x0
    2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0):
    2024-01-09T08:08:25.438Z Er(02) vthread-2265659 - vmiop_log: NVOS status 0x59
    2024-01-09T08:08:25.438Z Er(02) vthread-2265659 - vmiop_log: Assertion Failed at 0xf0176a3b:143
    2024-01-09T08:08:25.438Z Er(02) vthread-2265659 - vmiop_log: 8 frames returned by backtrace

     

    Because of this Logs, I opened a case by NVidia. They are saying its not NVidias fault, it is VMWares fault. Because of the change in Build  22388125 (8.0.1c) or 22348816 (7.0.3o):

     

    Certain applications might take too many ESXi file handles and cause performance aggravationIn very rare cases, applications such as NVIDIA virtual GPU (vGPU) might consume so many file handles that ESXi fails to process other services or VMs. As a result, you might see GPU on some nodes to disappear, or report zero GPU memory, or performance degradation.

    This issue is resolved in this release. The fix reduces the number of file handles a vGPU VM can consume.

     


    The NVidia Support says VMWare has to fix this, or I have to go back to 7.0.2. This File Handle maximum was introduced in one 7.0.3 version with 16k. Now it is 2k. Only 7.0.2 is without.

    At the moment i uninstalled the NVidia Driver in the Citrix VDAs and we are running stable.

     

    Does someone know if VMWare is working on this?

    Or is there a possibility to deactivate this file handle maximum?

    I hope someone can help, many Thanks.

     



  • 2.  RE: BSOD Windows Server 2022 NVidia vgpu

    Posted Feb 16, 2024 03:01 PM

    Our customer has contacted NVIDIA too and there is now a KB on from NVIDIA:

    vGPU: File handle leak on VMware ESXi host causes blue screen crash (nvidia.com)

    He nas now opened a case at VMware...