Hello Everyone,
we have a big issues since our migration from build 21930508 (7.0.3n) to 22380479 (8.0.2) - VCenter and esxi Hosts.
We are using DELL PowerEdge R750 esxi Host with one NVidia A16 GPU. On every host are 6 Citrix VDAs - Windows Server 2022 with NVidia profile a16-8a. Every Citrix VDA is designed for 14 User.
When we are reaching arround 12 Users on the Citrix VDAs there are going into an bluescreen. In the VMWare Log file you can see this:
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: NVOS status 0x59
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: Assertion Failed at 0xf0176a3b:143
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: 8 frames returned by backtrace
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv009081vgpu+0x35) [0x14f01774f5]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv004971vgpu+0x1ad) [0x14f0153d0d]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv011570vgpu+0x1a2b) [0x14f0176a3b]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv007364vgpu+0x3fe) [0x14f0212c1e]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x6876d) [0x14f011b76d]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /bin/vmx(+0x95d3f9) [0x14aaeaa3f9]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /lib64/libpthread.so.0(+0x7a82) [0x14ed0d7a82]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: /lib64/libc.so.6(clone+0x3f) [0x14ed1e2eef]
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0): VGPU message 19 failed, result code: 0x59
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0): 0x0, 0xc1d0096e, 0xff020000, 0xff030000, 0x2080,
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0): 0x0, 0x0
2024-01-09T08:08:25.384Z Er(02) vthread-2265659 - vmiop_log: (0x0):
2024-01-09T08:08:25.438Z Er(02) vthread-2265659 - vmiop_log: NVOS status 0x59
2024-01-09T08:08:25.438Z Er(02) vthread-2265659 - vmiop_log: Assertion Failed at 0xf0176a3b:143
2024-01-09T08:08:25.438Z Er(02) vthread-2265659 - vmiop_log: 8 frames returned by backtrace
Because of this Logs, I opened a case by NVidia. They are saying its not NVidias fault, it is VMWares fault. Because of the change in Build 22388125 (8.0.1c) or 22348816 (7.0.3o):
Certain applications might take too many ESXi file handles and cause performance aggravationIn very rare cases, applications such as NVIDIA virtual GPU (vGPU) might consume so many file handles that ESXi fails to process other services or VMs. As a result, you might see GPU on some nodes to disappear, or report zero GPU memory, or performance degradation.
This issue is resolved in this release. The fix reduces the number of file handles a vGPU VM can consume.
The NVidia Support says VMWare has to fix this, or I have to go back to 7.0.2. This File Handle maximum was introduced in one 7.0.3 version with 16k. Now it is 2k. Only 7.0.2 is without.
At the moment i uninstalled the NVidia Driver in the Citrix VDAs and we are running stable.
Does someone know if VMWare is working on this?
Or is there a possibility to deactivate this file handle maximum?
I hope someone can help, many Thanks.