VMware vSphere

 View Only
Expand all | Collapse all

ESXi PSOD

croit55

croit55Sep 27, 2023 06:50 AM

TallonZek

TallonZekJan 08, 2024 03:44 PM

  • 1.  ESXi PSOD

    Posted Jul 17, 2023 01:04 PM

    Hi,

    so for the past few days, I have been troubleshooting a specific issue that we are encountering with our VMware ESXi 8.0U1a installed on HPE ProLiant DL385 Gen10+. The host is connected to the vCenter server but not a part of a cluster.

    The main error is the NOT_IMPLEMENTED, and from what I have found ( Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956) ) it basically says that some of the components are requesting from vmkernel some activity that it has not been designed to do. Other discussions on this error have not been helpful in my case since I have already tried to reinstall and upgrade the ESXi itself.

    The error traceback is as follows:

    2023-07-15T10:42:52.185Z cpu0:2097242)@BlueScreen: NOT_IMPLEMENTED bora/vmkernel/main/world.c:2294

    2023-07-15T10:42:52.185Z cpu0:2097242)Code start: 0x420017400000 VMK uptime: 11:13:33:15.324

    2023-07-15T10:42:52.185Z cpu0:2097242)0x453882d1bc00:[0x420017514d31]PanicvPanicInt@vmkernel#nover+0x1f5 stack: 0x100

    2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bcb0:[0x4200175153a0]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453882d1bd10

    2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd10:[0x4200175158ad]Panic_OnAssertAt@vmkernel#nover+0xba stack: 0x8f600000000

    2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd90:[0x42001756855f]Int6_UD2Assert@vmkernel#nover+0x260 stack: 0x0

    2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bdc0:[0x420017561067]gate_entry@vmkernel#nover+0x68 stack: 0x0

    2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1be80:[0x420017547136]World_DestroyHeap@vmkernel#nover+0x4e stack: 0x4310dc600000

    2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bea0:[0x420017547251]WorldGroupCleanup@vmkernel#nover+0xe6 stack: 0x453882d1bef0

    2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bec0:[0x4200174f1dee]InitTable_Cleanup@vmkernel#nover+0x27 stack: 0x430f4ec01220

    2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bee0:[0x42001754cd46]World_TryReap@vmkernel#nover+0x3d3 stack: 0x45389e01f000

    2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfa0:[0x420017517582]ReaperWorkerWorld@vmkernel#nover+0xaf stack: 0x453882c9f100

    2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfe0:[0x420017828eca]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0

    2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1c000:[0x4200174d788b]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

    2023-07-15T10:42:52.191Z cpu0:2097242)base fs=0x0 gs=0x420040000000 Kgs=0x0

    2023-07-15T10:42:52.116Z cpu0:2097242)Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

     

    Besides that, sometimes we get notifications (errors?) from the lsi_mr3 driver installed on the HBA controlling our local array of disks:

     

    2023-07-15T10:28:04.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

    2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18714 to 18714. Count 1

    2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

    2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18715 to 18715. Count 1

    2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

     

    I would really be grateful if some of you have any clue for what else I could try to do, before opening a support request with VMware.

     

    Thank you once again in advace!



  • 2.  RE: ESXi PSOD

    Posted Jul 24, 2023 09:35 AM

    This issue under investigation 

    Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

    VMware and HPE are investigating the cause of the issue.

    Please log a case with HPE and VMware.

     

    I work for HPE



  • 3.  RE: ESXi PSOD

    Posted Jul 25, 2023 10:23 AM

    Hi , 

     

    Can you try the below workaround and check if this helps .  increase the value of storageMaxDevices to 1024 as this issue occurs because devfs heap is full.

    To increase the value using vSphere Client, go to Software > Advanced settings > VMKernel vmkernel.boot.StorageMaxDevices.

     

    Thanks , 

    Pramod Ashnal 

    Pls mark this comment as solution provided and give a thumbs up if you have got your solution !!

     



  • 4.  RE: ESXi PSOD

    Posted Jul 25, 2023 12:26 PM

    Hi,

    yes, I have already tried that recommendation but unfortunately it has not been helpful.



  • 5.  RE: ESXi PSOD

    Posted Aug 23, 2023 07:55 AM

    PSOD is in most cases hardware related. 

    so update iLO/iDRAC and BIOS. 

    patch ESXi to the latest version. update Drivers and Firmware of all PCI cards. 

    check if CPU and RAM are ok.



  • 6.  RE: ESXi PSOD

    Posted Sep 13, 2023 05:54 AM

    Ok, thanks. I will check it. I will go to Software > Advanced settings > VMKernel > vmkernel.boot.StorageMaxDevices and if I face any issue, I will ask.



  • 7.  RE: ESXi PSOD

    Posted Sep 27, 2023 06:50 AM

    Did this fix the issue in your case?



  • 8.  RE: ESXi PSOD

    Posted Nov 03, 2023 12:39 PM

    We have the exact same PSOD. 
    ESXi, 8.0.1, 22088125
    ProLiant DL325 Gen10 Plus, AMD EPYC 7542 32-Core Processors

    Happened 3 times now.. 

    VMware Support points towards HPE



  • 9.  RE: ESXi PSOD

    Posted Nov 06, 2023 06:58 AM

    Yes, us too faced this PSOD multiple times. Sometimes it happens every 2-3 days, but now it has been okay for over 50 days. Can you please give update if HPE has any useful information on this.



  • 10.  RE: ESXi PSOD

    Posted Nov 08, 2023 07:08 AM

    We have the exact same PSOD.
    VMware ESXi, 8.0.2, 22380479
    ProLiant DL385 Gen11, AMD EPYC 9474F 48-Core Processor

    Happened already a few times on our cluster with 4 servers.
    I hope Vmware and HPE find a solution together.



  • 11.  RE: ESXi PSOD

    Posted Nov 24, 2023 01:07 PM

    Has anyone tried downgrading to ESXI 7? A client of mine is having the exact same issue, and neither VMWare or HPE seem to have any answers.



  • 12.  RE: ESXi PSOD

    Posted Dec 06, 2023 08:14 PM

    Are all of you having this issue running MegaRAID cards?
    HPE MR216 / MR416 / MR408 ?

    That was a change we made from Gen10 to Gen10 Plus was to switch the default RAID Card vendor to LSI (Broadcom).



  • 13.  RE: ESXi PSOD

    Posted Dec 07, 2023 08:36 AM

    Yes, we are using MR416i cards.



  • 14.  RE: ESXi PSOD

    Posted Dec 22, 2023 01:28 PM

    Hello: we got the same PSOD yesterday, and are also running HPE ProLiant DL385 Gen10+ with a MR416i-a.  All drivers running on the September SPP.  Have you got anywhere with HPE support?



  • 15.  RE: ESXi PSOD

    Posted Jan 04, 2024 04:09 PM

    We got another PSOD on a different host with identical hardware.  Here's our latest info from VMWare support:

    This issue is caused due to object being leaked in the world heap of the "smad" process and when we try to cleanup this world it results in PSOD

    HPE Server's are impacted by this and may crash with PSOD with the Backtrace mentioned 

    Currently there is no Resolution
    HPE Engineering Team is working on a code Fix in their ILO Driver to Resolve the issue

    At this time it would be best to contact HPE to see if they have a updated ILO Driver, however the info I have was just published internally today so I would not expect they have anything just yet.

    We already updated ILO to latest prior to the PSOD.  HPE support gave me a cryptic promise that their developers are looking at it and there is no ETA.  Anyone else get better info on this?



  • 16.  RE: ESXi PSOD

    Posted Jan 04, 2024 04:32 PM

    Did you use the HP ESXi custom ISO or standatd ESXi 8.0U1a, If you did not use the custom ISO you should switch to it, you can download it on HPE website. 

    Another possible cause is that the ESXi host has some incompatible or unsupported third-party software installed that interferes with the installation



  • 17.  RE: ESXi PSOD

    Posted Jan 08, 2024 03:44 PM

    We're using HPE's version.



  • 18.  RE: ESXi PSOD

    Posted Jan 24, 2024 01:10 PM

    Exact same behaviour here : 4 hosts HPE DL385 Gen10 Plus + 2 hosts DL385 Gen10 Plus v2 | VMware ESXi, 8.0.1, 22088125.

    One PSOD every 4 days since we did upgrade to vSphere 8.

    Opened cases on HPE and VMware support and they did say it is linked to bug between vSphere and the ILO that we have to wait for vSphere 8.0 U3 to have it fixed. 



  • 19.  RE: ESXi PSOD

    Posted Jan 24, 2024 02:17 PM

    I had the same issue with 6.5 years ago.  It was fixed with 6.5U3 upgrade, but it was a cascading failure tied to how the database (SQL Enterprise at the time) was configured, but there was a runaway bug that led to essentially a SQL log buffer overflow....and PSOD. 

    Good times.  

    I'm surprised that is still an issue though.  



  • 20.  RE: ESXi PSOD

    Posted Feb 23, 2024 08:41 AM

    HPE DL385 Gen10 Plus with no Controller (FC-SAN-Connection)

    Update everything to newest version, but failing again afterwards.

    Cluster with 3 hosts failed within 30 min one after the other.

    overall 5 PSoD until today.

     

    HPE and VMware does not have any solution.

     

     



  • 21.  RE: ESXi PSOD

    Posted Feb 23, 2024 08:56 AM

    We did receive this workaround which, so far seems to work. The goal is to disable ILO related modules :  

     

    The HPE Engineering Team is working on a fix in their "ilo" driver to mitigate the issue (They should release the driver in mid-April). The VMware Engineering Team worked on a fix that will prevent a PSOD and free leaked poll context objects at the end of each poll() syscall, even if a driver is not behaving correctly according to the expectations.

     

     

    - About the Workaround section:

     

    We must explain that there is NO general workaround, except for the known case of HPE software, which is used only on HPE servers. In that case, the VIB removal as described makes sense.

     

     

    | To check if the SMAD service is running after the workaround has been applied, connect to the host via SSH and run:

    |

    | ps | grep -i SMAD

    |

    | If there is no output, the service is not running and the workaround has been applied successfully. Otherwise, something went wrong with the VIB removal (e.g. the wrong VIB was uninstalled or the removal failed etc.).

     

    Steps:

    Backup the ESXi configuration: https://kb.vmware.com/s/article/2042141

    Place Host in Maintenance Mode

    SSH to the Host

    1. esxcli software vib remove --vibname=amsdv
    2. esxcli software vib remove --vibname=amsd
    3. esxcli software vib remove --vibname=sut
    4. esxcli software vib remove --vibname=ilorest
    5. esxcli software vib remove --vibname=ilo
    6. Reboot the Host


  • 22.  RE: ESXi PSOD

    Posted Feb 23, 2024 01:14 PM

    Yep: HPE eventually gave us the same recommendation (to remove the ILO VIBs).  No PSOD since.  Both HPE and VMWare claim they are working on a bugfix.



  • 23.  RE: ESXi PSOD

    Posted Feb 23, 2024 02:15 PM

    Whats the actual downside of removing these vib's?

    removing sut will result in firmware noch updateable.



  • 24.  RE: ESXi PSOD

    Posted Apr 02, 2024 09:06 AM

    ProLiant DL385 Gen10 Plus v2
    VMware ESXi 8.0.2 Build-22380479 Update 2
    iLO Firmware Version:  2.72 Sep 04 2022

    We have made an Update two weeks ago from 7.0.3 to 8.0.2 and had allrady 2x this error.

    first on the 2. Host and 4 days later on the 1. Host. 

     

    Should we try an Update of the iLO or whait for a solution from HPE or VMware? 

    sure i can also try this one:

    SSH to the Host

    1. esxcli software vib remove --vibname=amsdv
    2. esxcli software vib remove --vibname=amsd
    3. esxcli software vib remove --vibname=sut
    4. esxcli software vib remove --vibname=ilorest
    5. esxcli software vib remove --vibname=ilo
    6. Reboot the Host

    what does this setting make exactly? 

    Thanks for answer.



  • 25.  RE: ESXi PSOD

    Posted Apr 02, 2024 09:40 AM

    iLO Firmware is not the problem. It's the driver integration within vsphere esxi.

    Removing the vibs will result in:

    no Firmwareupdate possible (sut)

    no configuration of ilo mgmt interface from esxi (ilo)

    no communcation between esxi and vmware (ilo-rest)

     

    I will not remove the vib's, but it's a pain to have multiple customers a week with this problem.
    VMware Support stats: HPE's fault
    HPE's Support: VMware is aware of this problem and will patch it soon

    So just waiting for them to get this fix released...



  • 26.  RE: ESXi PSOD

    Posted Apr 02, 2024 04:13 PM

    ams helps iLO get additional information from the OS. Not critical but helps. Might also be needed for vLCM to OV/COM.

    sut is used to stage some updates and any drivers deployed through the iLO.  If you remove this one, you can do offline FW updates instead, they will just take longer.  It needs to be on for vLCM integration to work fully.

    ilorest is just a tool, if you aren't using it, then removing it does nothing.

    ilo is the driver itself that is needed for in-band communications with the iLO from the host.

     

    Updated images and drivers are expected soon along with the release of the next SPP.