VMware vSphere

View Only

Back to discussions

Expand all | Collapse all

ESXi PSOD

1. ESXi PSOD

Recommend
croit55
Posted Jul 17, 2023 01:04 PM

Reply Reply Privately
Hi,
so for the past few days, I have been troubleshooting a specific issue that we are encountering with our VMware ESXi 8.0U1a installed on HPE ProLiant DL385 Gen10+. The host is connected to the vCenter server but not a part of a cluster.
The main error is the NOT_IMPLEMENTED, and from what I have found ( Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956) ) it basically says that some of the components are requesting from vmkernel some activity that it has not been designed to do. Other discussions on this error have not been helpful in my case since I have already tried to reinstall and upgrade the ESXi itself.
The error traceback is as follows:
2023-07-15T10:42:52.185Z cpu0:2097242)@BlueScreen: NOT_IMPLEMENTED bora/vmkernel/main/world.c:2294
2023-07-15T10:42:52.185Z cpu0:2097242)Code start: 0x420017400000 VMK uptime: 11:13:33:15.324
2023-07-15T10:42:52.185Z cpu0:2097242)0x453882d1bc00:[0x420017514d31]PanicvPanicInt@vmkernel#nover+0x1f5 stack: 0x100
2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bcb0:[0x4200175153a0]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453882d1bd10
2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd10:[0x4200175158ad]Panic_OnAssertAt@vmkernel#nover+0xba stack: 0x8f600000000
2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd90:[0x42001756855f]Int6_UD2Assert@vmkernel#nover+0x260 stack: 0x0
2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bdc0:[0x420017561067]gate_entry@vmkernel#nover+0x68 stack: 0x0
2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1be80:[0x420017547136]World_DestroyHeap@vmkernel#nover+0x4e stack: 0x4310dc600000
2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bea0:[0x420017547251]WorldGroupCleanup@vmkernel#nover+0xe6 stack: 0x453882d1bef0
2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bec0:[0x4200174f1dee]InitTable_Cleanup@vmkernel#nover+0x27 stack: 0x430f4ec01220
2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bee0:[0x42001754cd46]World_TryReap@vmkernel#nover+0x3d3 stack: 0x45389e01f000
2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfa0:[0x420017517582]ReaperWorkerWorld@vmkernel#nover+0xaf stack: 0x453882c9f100
2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfe0:[0x420017828eca]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0
2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1c000:[0x4200174d788b]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2023-07-15T10:42:52.191Z cpu0:2097242)base fs=0x0 gs=0x420040000000 Kgs=0x0
2023-07-15T10:42:52.116Z cpu0:2097242)Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

Besides that, sometimes we get notifications (errors?) from the lsi_mr3 driver installed on the HBA controlling our local array of disks:

2023-07-15T10:28:04.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.
2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18714 to 18714. Count 1
2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.
2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18715 to 18715. Count 1
2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

I would really be grateful if some of you have any clue for what else I could try to do, before opening a support request with VMware.

Thank you once again in advace!
2. RE: ESXi PSOD

Recommend
SiddSalman
Posted Jul 24, 2023 09:35 AM

Reply Reply Privately
This issue under investigation
Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout
VMware and HPE are investigating the cause of the issue.
Please log a case with HPE and VMware.

I work for HPE
3. RE: ESXi PSOD

Recommend
pashnal
Posted Jul 25, 2023 10:23 AM

Reply Reply Privately
Hi ,

Can you try the below workaround and check if this helps . increase the value of storageMaxDevices to 1024 as this issue occurs because devfs heap is full.

To increase the value using vSphere Client, go to Software > Advanced settings > VMKernel > vmkernel.boot.StorageMaxDevices.

Thanks ,
Pramod Ashnal
Pls mark this comment as solution provided and give a thumbs up if you have got your solution !!
4. RE: ESXi PSOD

Recommend
croit55
Posted Jul 25, 2023 12:26 PM

Reply Reply Privately
Hi,
yes, I have already tried that recommendation but unfortunately it has not been helpful.
5. RE: ESXi PSOD

Recommend
Maks Roshchyna
Posted Aug 23, 2023 07:55 AM

Reply Reply Privately
PSOD is in most cases hardware related.
so update iLO/iDRAC and BIOS.
patch ESXi to the latest version. update Drivers and Firmware of all PCI cards.
check if CPU and RAM are ok.
6. RE: ESXi PSOD

Recommend
GregoryCann
Posted Sep 13, 2023 05:54 AM

Reply Reply Privately
Ok, thanks. I will check it. I will go to Software > Advanced settings > VMKernel > vmkernel.boot.StorageMaxDevices and if I face any issue, I will ask.
7. RE: ESXi PSOD

Recommend
croit55
Posted Sep 27, 2023 06:50 AM

Reply Reply Privately
Did this fix the issue in your case?
8. RE: ESXi PSOD

Recommend
microy
Posted Nov 03, 2023 12:39 PM

Reply Reply Privately
We have the exact same PSOD.
ESXi, 8.0.1, 22088125
ProLiant DL325 Gen10 Plus, AMD EPYC 7542 32-Core Processors
Happened 3 times now..
VMware Support points towards HPE
9. RE: ESXi PSOD

Recommend
croit55
Posted Nov 06, 2023 06:58 AM

Reply Reply Privately
Yes, us too faced this PSOD multiple times. Sometimes it happens every 2-3 days, but now it has been okay for over 50 days. Can you please give update if HPE has any useful information on this.
10. RE: ESXi PSOD

Recommend
Norbertel
Posted Nov 08, 2023 07:08 AM

Reply Reply Privately
We have the exact same PSOD.
VMware ESXi, 8.0.2, 22380479
ProLiant DL385 Gen11, AMD EPYC 9474F 48-Core Processor
Happened already a few times on our cluster with 4 servers.
I hope Vmware and HPE find a solution together.
11. RE: ESXi PSOD

Recommend
Sourcepass
Posted Nov 24, 2023 01:07 PM

Reply Reply Privately
Has anyone tried downgrading to ESXI 7? A client of mine is having the exact same issue, and neither VMWare or HPE seem to have any answers.
12. RE: ESXi PSOD

Recommend
DanRobinsonHPE
Posted Dec 06, 2023 08:14 PM

Reply Reply Privately
Are all of you having this issue running MegaRAID cards?
HPE MR216 / MR416 / MR408 ?
That was a change we made from Gen10 to Gen10 Plus was to switch the default RAID Card vendor to LSI (Broadcom).
13. RE: ESXi PSOD

Recommend
croit55
Posted Dec 07, 2023 08:36 AM

Reply Reply Privately
Yes, we are using MR416i cards.
14. RE: ESXi PSOD

Recommend
TallonZek
Posted Dec 22, 2023 01:28 PM

Reply Reply Privately
Hello: we got the same PSOD yesterday, and are also running HPE ProLiant DL385 Gen10+ with a MR416i-a. All drivers running on the September SPP. Have you got anywhere with HPE support?
15. RE: ESXi PSOD

Recommend
TallonZek
Posted Jan 04, 2024 04:09 PM

Reply Reply Privately
We got another PSOD on a different host with identical hardware. Here's our latest info from VMWare support:
This issue is caused due to object being leaked in the world heap of the "smad" process and when we try to cleanup this world it results in PSOD

HPE Server's are impacted by this and may crash with PSOD with the Backtrace mentioned

Currently there is no Resolution
HPE Engineering Team is working on a code Fix in their ILO Driver to Resolve the issue

At this time it would be best to contact HPE to see if they have a updated ILO Driver, however the info I have was just published internally today so I would not expect they have anything just yet.

We already updated ILO to latest prior to the PSOD. HPE support gave me a cryptic promise that their developers are looking at it and there is no ETA. Anyone else get better info on this?
16. RE: ESXi PSOD

Recommend
allan trambouze
Posted Jan 04, 2024 04:32 PM

Reply Reply Privately
Did you use the HP ESXi custom ISO or standatd ESXi 8.0U1a, If you did not use the custom ISO you should switch to it, you can download it on HPE website.
Another possible cause is that the ESXi host has some incompatible or unsupported third-party software installed that interferes with the installation
17. RE: ESXi PSOD

Recommend
TallonZek
Posted Jan 08, 2024 03:44 PM

Reply Reply Privately
We're using HPE's version.
18. RE: ESXi PSOD

Recommend
lamax1976
Posted Jan 24, 2024 01:10 PM

Reply Reply Privately
Exact same behaviour here : 4 hosts HPE DL385 Gen10 Plus + 2 hosts DL385 Gen10 Plus v2 | VMware ESXi, 8.0.1, 22088125.
One PSOD every 4 days since we did upgrade to vSphere 8.
Opened cases on HPE and VMware support and they did say it is linked to bug between vSphere and the ILO that we have to wait for vSphere 8.0 U3 to have it fixed.
19. RE: ESXi PSOD

Recommend
Nathan Savolskis
Posted Jan 24, 2024 02:17 PM

Reply Reply Privately
I had the same issue with 6.5 years ago. It was fixed with 6.5U3 upgrade, but it was a cascading failure tied to how the database (SQL Enterprise at the time) was configured, but there was a runaway bug that led to essentially a SQL log buffer overflow....and PSOD.
Good times.
I'm surprised that is still an issue though.
20. RE: ESXi PSOD

Recommend
BC_Daniel
Posted Feb 23, 2024 08:41 AM

Reply Reply Privately
HPE DL385 Gen10 Plus with no Controller (FC-SAN-Connection)
Update everything to newest version, but failing again afterwards.
Cluster with 3 hosts failed within 30 min one after the other.
overall 5 PSoD until today.

HPE and VMware does not have any solution.
21. RE: ESXi PSOD

Recommend
lamax1976
Posted Feb 23, 2024 08:56 AM

Reply Reply Privately
We did receive this workaround which, so far seems to work. The goal is to disable ILO related modules :

The HPE Engineering Team is working on a fix in their "ilo" driver to mitigate the issue (They should release the driver in mid-April). The VMware Engineering Team worked on a fix that will prevent a PSOD and free leaked poll context objects at the end of each poll() syscall, even if a driver is not behaving correctly according to the expectations.

- About the Workaround section:

We must explain that there is NO general workaround, except for the known case of HPE software, which is used only on HPE servers. In that case, the VIB removal as described makes sense.

| To check if the SMAD service is running after the workaround has been applied, connect to the host via SSH and run:
|
| ps | grep -i SMAD
|
| If there is no output, the service is not running and the workaround has been applied successfully. Otherwise, something went wrong with the VIB removal (e.g. the wrong VIB was uninstalled or the removal failed etc.).

Steps:
Backup the ESXi configuration: https://kb.vmware.com/s/article/2042141
Place Host in Maintenance Mode
SSH to the Host
esxcli software vib remove --vibname=amsdv
esxcli software vib remove --vibname=amsd
esxcli software vib remove --vibname=sut
esxcli software vib remove --vibname=ilorest
esxcli software vib remove --vibname=ilo
Reboot the Host
22. RE: ESXi PSOD

Recommend
TallonZek
Posted Feb 23, 2024 01:14 PM

Reply Reply Privately
Yep: HPE eventually gave us the same recommendation (to remove the ILO VIBs). No PSOD since. Both HPE and VMWare claim they are working on a bugfix.
23. RE: ESXi PSOD

Recommend
BC_Daniel
Posted Feb 23, 2024 02:15 PM

Reply Reply Privately
Whats the actual downside of removing these vib's?
removing sut will result in firmware noch updateable.
24. RE: ESXi PSOD

Recommend
GUTOM-IT
Posted Apr 02, 2024 09:06 AM
| view attached (2)

Reply Reply Privately
ProLiant DL385 Gen10 Plus v2
VMware ESXi 8.0.2 Build-22380479 Update 2
iLO Firmware Version: 2.72 Sep 04 2022

We have made an Update two weeks ago from 7.0.3 to 8.0.2 and had allrady 2x this error.
first on the 2. Host and 4 days later on the 1. Host.

Should we try an Update of the iLO or whait for a solution from HPE or VMware?
sure i can also try this one:
SSH to the Host
esxcli software vib remove --vibname=amsdv
esxcli software vib remove --vibname=amsd
esxcli software vib remove --vibname=sut
esxcli software vib remove --vibname=ilorest
esxcli software vib remove --vibname=ilo
Reboot the Host
what does this setting make exactly?
Thanks for answer.
25. RE: ESXi PSOD

Recommend
BC_Daniel
Posted Apr 02, 2024 09:40 AM

Reply Reply Privately
iLO Firmware is not the problem. It's the driver integration within vsphere esxi.
Removing the vibs will result in:
no Firmwareupdate possible (sut)
no configuration of ilo mgmt interface from esxi (ilo)
no communcation between esxi and vmware (ilo-rest)

I will not remove the vib's, but it's a pain to have multiple customers a week with this problem.
VMware Support stats: HPE's fault
HPE's Support: VMware is aware of this problem and will patch it soon

So just waiting for them to get this fix released...
26. RE: ESXi PSOD

Recommend
DanRobinsonHPE
Posted Apr 02, 2024 04:13 PM

Reply Reply Privately
ams helps iLO get additional information from the OS. Not critical but helps. Might also be needed for vLCM to OV/COM.
sut is used to stage some updates and any drivers deployed through the iLO. If you remove this one, you can do offline FW updates instead, they will just take longer. It needs to be on for vLCM integration to work fully.
ilorest is just a tool, if you aren't using it, then removing it does nothing.
ilo is the driver itself that is needed for in-band communications with the iLO from the host.

Updated images and drivers are expected soon along with the release of the next SPP.

VMware vSphere

ESXi PSOD

croit55Jul 17, 2023 01:04 PM

SiddSalmanJul 24, 2023 09:35 AM

pashnalJul 25, 2023 10:23 AM

croit55Jul 25, 2023 12:26 PM

Maks RoshchynaAug 23, 2023 07:55 AM

GregoryCannSep 13, 2023 05:54 AM

croit55Sep 27, 2023 06:50 AM

microyNov 03, 2023 12:39 PM

croit55Nov 06, 2023 06:58 AM

NorbertelNov 08, 2023 07:08 AM

SourcepassNov 24, 2023 01:07 PM

DanRobinsonHPEDec 06, 2023 08:14 PM

croit55Dec 07, 2023 08:36 AM

TallonZekDec 22, 2023 01:28 PM

TallonZekJan 04, 2024 04:09 PM

allan trambouzeJan 04, 2024 04:32 PM

TallonZekJan 08, 2024 03:44 PM

lamax1976Jan 24, 2024 01:10 PM

Nathan SavolskisJan 24, 2024 02:17 PM

BC_DanielFeb 23, 2024 08:41 AM

lamax1976Feb 23, 2024 08:56 AM

TallonZekFeb 23, 2024 01:14 PM

BC_DanielFeb 23, 2024 02:15 PM

GUTOM-ITApr 02, 2024 09:06 AM

BC_DanielApr 02, 2024 09:40 AM

DanRobinsonHPEApr 02, 2024 04:13 PM

1. ESXi PSOD

2. RE: ESXi PSOD

3. RE: ESXi PSOD

4. RE: ESXi PSOD

5. RE: ESXi PSOD

6. RE: ESXi PSOD

7. RE: ESXi PSOD

8. RE: ESXi PSOD

9. RE: ESXi PSOD

10. RE: ESXi PSOD

11. RE: ESXi PSOD

12. RE: ESXi PSOD

13. RE: ESXi PSOD

14. RE: ESXi PSOD

15. RE: ESXi PSOD

16. RE: ESXi PSOD

17. RE: ESXi PSOD

18. RE: ESXi PSOD

19. RE: ESXi PSOD

20. RE: ESXi PSOD

21. RE: ESXi PSOD

22. RE: ESXi PSOD

23. RE: ESXi PSOD

24. RE: ESXi PSOD

25. RE: ESXi PSOD

26. RE: ESXi PSOD