VMware vSphere

 View Only
Expand all | Collapse all

Strange Host Responsiveness Issues

Himadri

HimadriMar 09, 2019 06:28 PM

  • 1.  Strange Host Responsiveness Issues

    Posted Jun 08, 2018 09:29 PM

    - Recently upgraded hosts to 6.7 and vCenter to 6.7a

    - Hosts are 'not responding' in vCenter Server

    - Can ping

    - Cannot acess web interface or login via SSH

    - Can access it via console, but after you enter login information and press enter it freezes (cursor is still blinking)

    - If you remove the host from the inventory and shut down a virtual machine on the host it brings everything back online and the host can be re-added to vCenter

    - Four identical hosts, has happened on three of the four (twice on one)

    - The host that had this issue twice now will not come back after trying the above method and is completely unresponsive at the console



  • 2.  RE: Strange Host Responsiveness Issues

    Posted Jun 09, 2018 01:53 PM

    And what do you see in vmkenel.log and hostd.log of affected hosts?

    How did you perform an apgrade?



  • 3.  RE: Strange Host Responsiveness Issues

    Posted Jun 11, 2018 03:45 PM

    I updated the hosts via the Update Manger.

    I pulled the logs from one of the hosts I was able to get back online. Here's what was in hostd:

    --> [context]zKq7AVICAgAAAMKpfAAVaG9zdGQAAHyZNWxpYnZtYWNvcmUuc28AAADAGwBgsBcBWbxkaG9zdGQAAS5JzIKK4QABbGlidmltLXR5cGVzLnNvAANnIA9saWJ2bW9taS5zbwADTCwPA4pKHAMdmRwDAaIcAxlRHAPwZA0DbNoPA3SgHwH148EAJTAoAAM0KAA7DzYEa4AAbGlicHRocmVhZC5zby4wAAXtmg5saWJjLnNvLjYA[/context]

    count_events: starting communication with bmc over ipmi driver

    count_events: GET_SEL_REPO_INFO returned {version: 0x51, count 41, free 15728,add_stamp 1380738318, erase_stamp 1358956536 op_support 2}

    IPMI SEL sync took 0 seconds 0 sel records, last 41

    2018-05-29T09:29:30.273Z error hostd[2099052] [Originator@6876 sub=Cimsvc] IPMI SEL unavailable

    2018-05-29T09:29:30.274Z warning hostd[2099762] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea772f] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported

    2018-05-29T09:29:59.882Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.Tpm20Provider opID=e3ea776e user=root] Unable to retrieve TPM/TXT status. TPM functionality will be unavailable. Failure reason: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details..

    2018-05-29T09:29:59.918Z error hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] CheckLicense: vFlash is not licensed. error = [N5Vmomi9DataArrayINS_18LocalizableMessageEEE:0x000000b0b88b7180]

    2018-05-29T09:29:59.923Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.Tpm20Provider opID=e3ea776e user=root] Unable to retrieve TPM/TXT status. TPM functionality will be unavailable. Failure reason: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details..

    2018-05-29T09:29:59.964Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported

    2018-05-29T09:29:59.968Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported

    2018-05-29T09:30:00.032Z warning hostd[2099885] [Originator@6876 sub=Statssvc] Calculated write I/O size 589477 for scsi0:0 is out of range -- 589477,prevBytes = 27990022656 curBytes = 28010064896 prevCommands = 1280828curCommands = 1280862

    2018-05-29T09:30:00.565Z error hostd[2099053] [Originator@6876 sub=PropertyProvider opID=e3ea7773 user=root] Unexpected fault reading property: 000000b0622e1da0, IsSourceAvailable: N5Vmomi5Fault12NotSupported9ExceptionE(Fault cause: vmodl.fault.NotSupported

    --> )

    And here's what was in vmkernel:

    2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 477: uw.2386360 (827751) extraMin/extraFromParent: 33/33, sioc (809) childEmin/eMinLimit: 14066/14080

    2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 470: Admission failure in path: sioc/storageRM.2386360/uw.2386360

    2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 477: uw.2386360 (827751) extraMin/extraFromParent: 256/256, sioc (809) childEmin/eMinLimit: 14066/14080

    2018-06-01T21:51:50.940Z cpu1:2387625)ScsiVsi: 2899: Can't set the maxPathQueueDepth value to more than device advertised maxPathQueueDepth 128



  • 4.  RE: Strange Host Responsiveness Issues

    Posted Jun 20, 2018 08:55 PM

    Bump. Two of the hosts have gone into an unresponsive state again.



  • 5.  RE: Strange Host Responsiveness Issues

    Posted Jun 20, 2018 09:08 PM

    At this point you should be opening a SR to have them investigate.



  • 6.  RE: Strange Host Responsiveness Issues

    Posted Jun 15, 2019 08:54 PM

    Hi Isaacwd,

    The error you are experiencing is a known issue in vSphere  6.7. this bug have in ESXi 6.7 EP 07 and ESXi 6.7 EP 09 which results in host becoming unresponsive.

    The main Root Cause is SIOC running out of memory.

    Please wait for VMware to release the fix which will included in 6.7U3.  ETA the release date is around July/August 2019.

    Note: Please note that currently there is no workaround available for above-mentioned issue.

    Workaround:

    Workaround the issue by restarting the SIOC service using the following commands on the affected ESXi Hosts:

    1. Check the status of storageRM and sdrsInjector

    /etc/init.d/storageRM status
    /etc/init.d/sdrsInjector status


    2. Stop the service

    /etc/init.d/storageRM stop
    /etc/init.d/sdrsInjector stop


    3. Start the service

    /etc/init.d/storageRM start
    /etc/init.d/sdrsInjector start

    If the issue persists even after the SIOC service is restarted, users can temporarily disable SIOC by turning off the feature from VMware Virtual Center

    Refer VMware KB 67543



  • 7.  RE: Strange Host Responsiveness Issues

    Posted Aug 14, 2018 06:28 PM

    Did you get a resolution to this problem?

    We opened case with Vmware last year and they were unable to find the root cause.

    We have been battling this for the past year, however since our 6.5 upgrade, and quite intermittent, 6-7 total host.

    Here is our current environment to compare.

    ESXi  6.5.0, 8935087

    Cisco UCS B200 M4 latest drivers and UCS blade package 3.2(3d)

    nenic - 1.0.16.0

    fnic - 1.6.0.37

    Backup software Veeam 9.5.0.1922

    Thank you,

    Phil



  • 8.  RE: Strange Host Responsiveness Issues

    Posted Aug 21, 2018 02:24 PM

    I have what sounds like the same issue.  Hosts are non-responsive, VMs seem ok.  1 host is locked up after entering the root password, still on password screen.  Alt-F# keys work but nothing else.  Another host, I got logged on, but once I got to the troubleshooting screen it then locked.  If I can get there, restarting the management agents works but getting there is the problem.  I have tried connecting with powercli, but connect-viserver times out. 

    sometimes the lockup on the console will suddenly unfreeze on its own and I will then be able to get to the management agent restart and get the host back up.  No clue as to what triggers either the problem, or the console lockup.

    In my case, I just upgraded to the latest patches of 6.5u2 with the Hyperthreading Mitigation features.  I have set the flag and so far, problems have only happened on hosts that have had the flag set but have not yet rebooted.  It is still too early to tell if this is a coincidence.  I am pushing thru the reboots as fast as I can so as to eliminate this as a factor, I still have 16 hosts to go.  I set the flag via script 3 days ago and am still doing reboots (a weekend intervened).



  • 9.  RE: Strange Host Responsiveness Issues

    Posted Sep 06, 2018 08:55 PM

    I'm seeing similar errors (thousands & thousands of them; 8 lines every 30 seconds) and I also have a 6.7 host upgraded from 6.5U2.

    The host works fine though (mostly). I do have some strange intermittent connectivity issues with a web application running on one of the VMs.

    This is an HP DL380 Gen9, and the similar errors I'm seeing are:

    "2018-09-06T15:41:52.976Z cpu10:2099148)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2099148/uw.2099148"

    "2018-09-06T15:41:52.976Z cpu10:2099148)MemSchedAdmit: 477: uw.2099148 (9114) extraMin/extraFromParent: 117/117, nicmgmtd (806) childEmin/eMinLimit: 2479/2560"

    Your post is the only thing I hit when searching.

    I disconnected one of the NIC cards that I was hoping was associated with the errors, and the errors stopped for several hours- but then started back up...



  • 10.  RE: Strange Host Responsiveness Issues

    Posted Sep 10, 2018 01:13 PM

    You are not alone. We have the same issue on newly installed Dell PowerEdge R640 VSAN Ready Nodes, with a clean 6.7 installed from scratch. Some of our CentOS 7 images, latest patches and open-vm-tools, suddenly just start dropping off. The guests and the hosts seem fine, but we have 0 connectivity on certain interfaces. For example, on some, management interfaces will work fine, but services/Internet interfaces drop off and have no connectivity.

    I've opened a SR, and hope VMware comes back with something soon.



  • 11.  RE: Strange Host Responsiveness Issues

    Posted Sep 18, 2018 02:22 PM

    Hello guys,

    the same issue for 10x our ESXi 6.7 on DL380 Gen10 with vSAN.

    2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2100568/uw.2100568

    2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 477: uw.2100568 (12331) extraMin/extraFromParent: 186/186, nicmgmtd (796) childEmin/eMinLimit: 2443/2560

    About 1-2 lines each second in /var/log/vmkernel.log

    Any progress on SR / any statements from VMware ?

    Please share you info.

    Thx!

    Regards,

    JK



  • 12.  RE: Strange Host Responsiveness Issues

    Posted Sep 18, 2018 04:11 PM

    We were using the built-in Broadcom x4 GigT nics, but switched the traffic to an HP FLR 10GigT Intel-based card (simply due to a guess, considering Broadcom's driver track record).

    I haven't disabled the Broadcom cards entirely, just moved all the traffic to the other nics, but the errors have continued to fill the logs, and we still continue to have intermittent connectivity/responsiveness issues with one of the hosts...



  • 13.  RE: Strange Host Responsiveness Issues

    Posted Sep 18, 2018 02:54 PM

    Which vendor HBA is there

    Try to change as 64 Queue Depth and reboot host ,

    you can follow this KB

    VMware Knowledge Base



  • 14.  RE: Strange Host Responsiveness Issues

    Posted Sep 18, 2018 04:14 PM

    Thanks for the idea Rajeev,

         We're not currently using the Nic types mentioned in that KB article.



  • 15.  RE: Strange Host Responsiveness Issues

    Posted Sep 26, 2018 02:42 PM

    After disabling the embedded Broadcom quad Nic card last Saturday, the "admission failure" messages all stopped that day and have not returned, for what that's worth.

    I haven't collected any new feedback from users about the intermittent connectivity issues yet, so I don't know if that helped anything beyond getting rid of log bloat...



  • 16.  RE: Strange Host Responsiveness Issues

    Broadcom Employee
    Posted Oct 03, 2018 01:20 PM

    Hello,

    I found a similar case with "admission failure" messages reported. Can you try disabling netqueue on the card.

    esxcli network nic queue loadbalancer set --rsslb=off -n vmnicX

    Thanks,

    James



  • 17.  RE: Strange Host Responsiveness Issues

    Posted Jan 11, 2019 04:50 PM

    Hi MightyGorilla,

    After we disabled all 4 onboard Broadcom NICs in the BIOS/RBSU these warnings flooding the log have been disappeared:

    2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2100568/uw.2100568

    2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 477

    Thx for info!

    Regards

    Cop



  • 18.  RE: Strange Host Responsiveness Issues

    Posted Oct 04, 2018 03:55 PM

    It looks like we had (have?) the same issue. Two ESXi hosts on two separate occasions have locked up in the way you have described. VMKERNEL is full of this error:

    ScsiVsi: 2899: Can't set the maxPathQueueDepth value to more than device advertised maxPathQueueDepth 128.

    We put in a ticket with VMware, but they were unable to resolve it. They suggested executing a NMI if it locks up so that it generates a kernel dump.



  • 19.  RE: Strange Host Responsiveness Issues

    Posted Oct 22, 2018 12:34 PM

    Hello,

    This is a known issue with 6.7 and we were able to collect NMI vmkernel dump and shared to engineering team who is currently working on it.

    Will share updates as soon as we hear any update.

    Thanks,

    MS



  • 20.  RE: Strange Host Responsiveness Issues

    Posted Oct 25, 2018 04:55 AM

    Did they ever respond?  I just updated three of my hosts to 6.7 and experiencing the same issues.  Occasionally my hosts flip to 'not responding' in vCenter and I'm able to correct it by restarting the host agents.  I'm wondering if it has to do with the the HTAware Mitigation so I will disable it for now and see if the situation improves. 

    Hardware is Cisco UCS B200M3's Intel E5-2660 v2 and FW 4.0(1b).  Storage is all iSCSI over the VIC 1240 NICs (VNX and Nimble arrays).  Hardware, FW and Drivers all match up to the VMware HCL.



  • 21.  RE: Strange Host Responsiveness Issues

    Posted Oct 25, 2018 05:11 AM

    No that is different issue. In this case you cannot restart management agents.. Dcui also hung.. You might be encountering a different issue I guess.. Better to open an support request



  • 22.  RE: Strange Host Responsiveness Issues

    Posted Nov 02, 2018 05:52 PM

    Hi, Did you get a resolution to this issue from VMware yet? I'm having the same issue on 2 hosts. We have over 100 other hosts that are OK.



  • 23.  RE: Strange Host Responsiveness Issues

    Posted Nov 05, 2018 02:57 PM

    We are still waiting for a fix.



  • 24.  RE: Strange Host Responsiveness Issues

    Posted Nov 26, 2018 06:53 PM

    We've had the exactly same problem with a fresh install of Vmware ESXI 6.7 Update 1 on a DELL R740 Server.

    Still waiting for Vmware's support, but any update on this subject will be appreciated.

    Thanks!



  • 25.  RE: Strange Host Responsiveness Issues

    Posted Dec 04, 2018 03:29 PM

    We were told the fix would be made available in Update 2 which is scheduled for Q1 of 2019.



  • 26.  RE: Strange Host Responsiveness Issues

    Posted Jan 08, 2019 07:46 PM

    We were also told that the fix will be available in Update 2. The workaround in our case is to disable the SIOC on all Datastores.



  • 27.  RE: Strange Host Responsiveness Issues

    Posted Jan 23, 2019 05:27 PM

    Is there a bug number for this issue that anyone took note of?



  • 28.  RE: Strange Host Responsiveness Issues

    Posted Feb 26, 2019 02:52 PM

    We're having the identical issue with Dell PowerEdge 740xd hosts and Dell Compellent SC5020 iSCSI storage.  Multiple hosts go into this zombie state and some or all of the VMs lose connection to the storage.  HA  doesn't to move the VMs to healthy hosts.  Initially we were power cycling the hosts because we could not access their Web UI or console (DCUI).  Once power cycled, Ha will move the VMs to other healthy hosts. We've since learned that if SSH is enabled, you can connect to them with SSH and run "services.sh restart". After several minutes the command will complete and the host will return to "normal".  We can then vMotion the VMs and gracefully restart the host.

    Support confirmed yesterday there's a "firmware / driver / vmkernel" bug in 6.7 update 1 and the fix would be included in 6.7 update 2.  They indicated update 2 was not expected out for another 2 months.  They're very tight lipped so far about the details of the bug, triggers and any possible workarounds.

    I cannot wait 60 days for a update 2 that's has already been delayed from Q1 to Q2/Q3.   I'm really hoping VMware does the right thing and makes the details of this critical defect public.so we and other customers can make an informed decisions about updating.

    .



  • 29.  RE: Strange Host Responsiveness Issues

    Posted Mar 09, 2019 02:24 AM

    I currently have the same issue, slight variation in our infrastructure.

    We have four esxi 6.7 hosts fully patched. All hosts are connected to iscsi nimble san. Two hosts share majority of  workload and iscsi connections. These are the only two hosts that experience the issue. I have managed to find a workaround at least in our setup. We have been restarting the vxpa and hostd services at least once a day if not twice And have not had the issue for 5 days. The services are restarted regardless of healthy host state.

    Some of the logs reported “out of memory” errors and storage disconnects.



  • 30.  RE: Strange Host Responsiveness Issues

    Posted Mar 12, 2019 06:50 PM

    Support confirmed there is a bug in SIOC that causes it to consume large amounts of memory and CPU when handling storage I/O anomalies. Anomalies appear to include normal latency increases during backup periods and Windows updates.  The workaround from engineering was to disable the SIOC on all datastores and stop the StorageRM service on the host (/etc/init.d/storageRM stop).

    Apparently this workaround is not 100% effective because we had another outage last night.  Support suggested downgrading to 6.5 until 6.7 update 2 is ready.  A downgrade will be painful because we'll need to also downgrade the hardware version on a large number of VMs.  If I downgrade I'm not likely to go back to 6.7 knowing it's history of instability.

    I'm ready to dump VMware for a more reliable virtualization platform.



  • 31.  RE: Strange Host Responsiveness Issues

    Posted Mar 21, 2019 12:54 PM

    Took another outage with SIOC and the StorageRM serivce disabled. We're now working on downgrading the hosts to 6.5u2. Migrating the VMs to the 6.5 hosts requires a reboot due to the lower EVC support level and downgrading the hardware version on them.

    Engineering conceded they have been working on this for many months and have not found a root cause and it's affecting multiple customers. Update 2 is pushed until Early April and will not a fix. They're now hoping to have a fix by the time Update 3 comes out this summer.



  • 32.  RE: Strange Host Responsiveness Issues

    Posted Mar 09, 2019 06:28 PM

    Upgrade the hosts to 6.5u2



  • 33.  RE: Strange Host Responsiveness Issues

    Posted Mar 28, 2019 02:08 PM

    We are seing the same issue on some of our 6.7EP6 hosts. The recommendations we have gotten from vmware support is:

    1. Disable ATS heartbeat

    2. Upgrade drivers/fw on HBA (Emulex)

    3. Migrate from VMFS5 til VMFS6 datastores

    4. Upgrade NIC drivers to latest

    They also told me yesterday (27th of March) that a fix would be in place in 6.7U2 and that will be released most likely within 4 weeks.

    EDIT: We are now trying with EP7 on the affected hosts to see if that helps. No specific fixes for this mentioned in the releasenotes thought.



  • 34.  RE: Strange Host Responsiveness Issues

    Posted Apr 04, 2019 05:52 PM

    Vmware finally published a KB for this defect.  https://kb.vmware.com/s/article/67543

    The fix will NOT be in 6.7 update 2.



  • 35.  RE: Strange Host Responsiveness Issues

    Posted Apr 09, 2019 03:12 PM

    I came across the below KB that talks about a different defect in 6.7 affecting Dell EMC SC Storage. use Dell EMC SC storage so this may be part of the equation.

    ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios

    https://kb.vmware.com/s/article/67006



  • 36.  RE: Strange Host Responsiveness Issues

    Posted Apr 24, 2019 11:25 AM

    We also have SC storage. But after implementing all the mentioned changes we haven't had this problem anymore...



  • 37.  RE: Strange Host Responsiveness Issues

    Posted May 08, 2019 12:01 PM

    VMware engineering Team is working on this issue and hopefully permanent fix would be included in upcoming versions.

    For temporary fix this issue, please follow the workaround steps mentioned in VMware KB - https://kb.vmware.com/s/article/67543

    Regards

    Jitendra Singh



  • 38.  RE: Strange Host Responsiveness Issues

    Posted Aug 20, 2019 06:48 PM

    I experienced the same issues in 6.7u2 and found this post. I reverted back to 6.5 and all the issues disappeared. Anyone try update 3 and see if its fixed?



  • 39.  RE: Strange Host Responsiveness Issues

    Posted Aug 27, 2019 08:17 PM

    It isn't fixed.  I'm running 10 hosts and all have the same issues as of 6.7 U3 so the problem persists even though the fix was supposed to be in U3.  I'm not looking forward to downgrading all my hosts, I AM looking forward to dumping VMware and going with a Microsoft virtual environment.  I've had enough of losing VM's and having zero access to the ESXi hosts when it comes to trying to recover them.  I wish I have never upgraded to 6.7, it's a POS.

    RJB



  • 40.  RE: Strange Host Responsiveness Issues

    Posted Oct 08, 2019 08:18 PM

    Hey all - Does anyone know if this error is still occurring in 6.7.3 or has it been hotfixed since?

    Cheers!



  • 41.  RE: Strange Host Responsiveness Issues

    Posted Oct 09, 2019 03:22 PM

    This issue is fixed in 6.7 Update 2

    VMware ESXi 6.7 Update 2 Release Notes

    PR 2235031: An ESXi host becomes unresponsive and you see warnings for reached maximum heap size in the vmkernel.log

    Due to a timing issue in the VMkernel, buffers might not be flushed, and the heap gets exhausted. As a result, services such as hostd, vpxa and vmsyslogd might not be able to write logs on the ESXi host, and the host becomes unresponsive. In the /var/log/vmkernel.log, you might see a similar warning:WARNING: Heap: 3571: Heap vfat already at its maximum size. Cannot expand.

    This issue is resolved in this release.

    Host unresponsive can happen due to multiple reasons, if you experience this issue in 6.7 U2 and above, it needs to be validated by GSS if you are hitting same issue .. Most likely a different issue I believe

    Thanks,

    MS



  • 42.  RE: Strange Host Responsiveness Issues

    Posted Oct 09, 2019 09:35 AM

    Plz provide vmkernel.log and hostd.log file

    and exact date/time when server went not responding

    what is back end storage ( is it boot from SAN or Local)

    What hardware type