ESXi

 View Only
  • 1.  Host going into not responding state

    Posted Aug 13, 2014 08:01 AM

    Hi

    Can somebody help me with finding solution for 'Host is not responding state' in my Vmware ESX 4.1 cluster.

    We got 3 hosts and this saturday one of them ESX03 went into this state for more less 30-40 sec. I have HA monitoring enable but vm's didn't migrate ony went to disconeted state.

    _________________________________________________________________________________________________________________-

    vpxd.log

    [2014-08-10 12:46:32.500 05356 info 'App' opID=HB-host-59@108335] [VpxdHostSync] Synchronizing host: esx03.xxx.local (xxx.xxx.xxx.xxx)

    [2014-08-10 12:46:53.503 05320 error 'App'] [VpxdVmomi] Got vmacore exception: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

    [2014-08-10 12:46:53.507 05320 error 'App'] [VpxdVmomi] Backtrace:

    backtrace[00] rip 000000018010a1aa Vmacore::System::Stacktrace::CaptureWork

    backtrace[01] rip 00000001800e8018 Vmacore::System::SystemFactoryImpl::CreateFileWriter

    backtrace[02] rip 00000001800e850e Vmacore::System::SystemFactoryImpl::CreateQuickBacktrace

    backtrace[03] rip 0000000180077ed8 Vmacore::Throwable::Throwable

    backtrace[04] rip 0000000180108603 Vmacore::SystemException::SystemException

    backtrace[05] rip 000000018011a96a Vmacore::System::GetThisThread

    backtrace[06] rip 00000001801237ac Vmacore::System::IsEnlisted

    backtrace[07] rip 0000000180119915 Vmacore::System::GetThreadId

    backtrace[08] rip 0000000074042fdf endthreadex

    backtrace[09] rip 0000000074043080 endthreadex

    backtrace[10] rip 0000000077b059ed BaseThreadInitThunk

    backtrace[11] rip 0000000077c3c541 RtlUserThreadStart

    [2014-08-10 12:46:53.594 05356 warning 'VpxProfiler' opID=HB-host-59@108335] [VpxdHostSync] GetChanges host:esx03.xxx.local (xxx.xx.xx.xxx) took 21093 ms

    [2014-08-10 12:46:53.595 05356 warning 'VpxProfiler' opID=HB-host-59@108335] [VpxdHostSync] DoHostSync:0000000007F03830 took 21095 ms

    [2014-08-10 12:46:53.650 05356 warning 'VpxProfiler' opID=HB-host-59@108335] InvtHostSyncLRO::StartWork took 21150 ms

    [2014-08-10 12:46:53.650 05356 info 'App' opID=HB-host-59@108335] [VpxLRO] -- FINISH task-internal-70893 -- host-59 -- VpxdInvtHostSyncHostLRO.Synchronize --

    [2014-08-10 12:46:53.745 05360 info 'App' opID=task-internal-70894-2fb7cf1b] [VpxLRO] -- BEGIN task-internal-70894 --  -- ScheduledTaskLRO --

    [2014-08-10 12:46:53.751 05376 info 'App' opID=task-internal-70895-872363d3] [VpxLRO] -- BEGIN task-internal-70895 --  -- ScheduledTaskLRO --

    [2014-08-10 12:46:53.751 05320 info 'App' opID=task-internal-70896-c4f3efd3] [VpxLRO] -- BEGIN task-internal-70896 --  -- ScheduledTaskLRO --

    [2014-08-10 12:46:53.754 05376 info 'App' opID=task-internal-70895-872363d3] [VpxLRO] -- FINISH task-internal-70895 --  -- ScheduledTaskLRO --

    [2014-08-10 12:46:53.831 05360 info 'App' opID=task-internal-70894-2fb7cf1b] [VpxLRO] -- FINISH task-internal-70894 --  -- ScheduledTaskLRO --

    [2014-08-10 12:46:53.837 05320 info 'App' opID=task-internal-70896-c4f3efd3] [VpxLRO] -- FINISH task-internal-70896 --  -- ScheduledTaskLRO --

    [2014-08-10 12:46:54.228 05364 error 'App'] SSLStreamImpl::DoServerHandshake (0000000011502ac0) SSL_accept failed. Dumping SSL error queue:

    [2014-08-10 12:46:54.228 05364 error 'App'] [0] error:14094416:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate unknown

    [2014-08-10 12:46:54.228 05364 warning 'ProxySvc'] SSL Handshake failed for stream TCPStreamWin32(socket=TCP(fd=3600) local=xxx.xxx.xxx.xxx:443,  peer=172.16.254.150:56357), error=SSL Exception: error:14094416:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate unknown

    [2014-08-10 12:47:00.936 05396 info 'StackTracer'] [5396] Enter DAS_PROFILE UpdateDasStatus

    [2014-08-10 12:47:00.936 05396 info 'App'] [VpxdDas] Bad primary esx03: connected=0, dasState=running, vpxaDasState=running

    [2014-08-10 12:47:00.936 05396 info 'StackTracer'] [5396] Exit DAS_PROFILE UpdateDasStatus (0 ms)

    ____________________________________________________________________________________________________________________

    And today i had the same situation this time with ESX01 , it also lost comunication with vCenter for about 30 seconds. Logs again show that vCenter was  synchronizing host ESX03 and got some SSL error and after 30s it started to connect vm again.

    _____________________________________________________________________________________________________________________

    vpxd.log:

    [2014-08-13 04:35:23.616 05356 info 'App' opID=HB-host-9@71365] [VpxdHostSync] Synchronizing host: esx01.xxx.local (xxx.xxx.xxx.xxx)

    [2014-08-13 04:35:25.374 05352 error 'App'] SSLStreamImpl::DoServerHandshake (00000000122fdbe0) SSL_accept failed. Dumping SSL error queue:

    [2014-08-13 04:35:25.374 05352 error 'App'] [0] error:14094416:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate unknown

    [2014-08-13 04:35:25.374 05352 warning 'ProxySvc'] SSL Handshake failed for stream TCPStreamWin32(socket=TCP(fd=5332) local=xxx.xxx.xxx.xxx:443,  peer=172.16.254.150:65346), error=SSL Exception: error:14094416:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate unknown

    [2014-08-13 04:35:37.969 06120 info 'App' opID=e21cb95e] [VpxLRO] -- BEGIN task-internal-113983 --  -- vim.ServiceInstance.retrieveContent -- 6F48A6D2-2178-40B9-9C45-696910E711D6(D797B092-0367-4925-A153-E5D84D976F36)

    [2014-08-13 04:35:37.969 06120 info 'App' opID=e21cb95e] [VpxLRO] -- FINISH task-internal-113983 --  -- vim.ServiceInstance.retrieveContent -- 6F48A6D2-2178-40B9-9C45-696910E711D6(D797B092-0367-4925-A153-E5D84D976F36)

    [2014-08-13 04:35:37.985 04708 info 'App' opID=c75bfd1f] [VpxLRO] -- BEGIN task-internal-113984 --  -- vim.host.DiskManager.renewAllLeases -- 8173FFB3-A3DA-44FF-AEF1-428A0C3E485D(D8F649D6-5925-4909-92BE-602B0BDB58C4)

    [2014-08-13 04:35:37.995 04708 info 'App' opID=c75bfd1f] [VpxLRO] -- FINISH task-internal-113984 --  -- vim.host.DiskManager.renewAllLeases -- 8173FFB3-A3DA-44FF-AEF1-428A0C3E485D(D8F649D6-5925-4909-92BE-602B0BDB58C4)

    [2014-08-13 04:35:44.617 04984 error 'App'] [VpxdVmomi] Got vmacore exception: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

    [2014-08-13 04:35:44.621 04984 error 'App'] [VpxdVmomi] Backtrace:

    backtrace[00] rip 000000018010a1aa Vmacore::System::Stacktrace::CaptureWork

    backtrace[01] rip 00000001800e8018 Vmacore::System::SystemFactoryImpl::CreateFileWriter

    backtrace[02] rip 00000001800e850e Vmacore::System::SystemFactoryImpl::CreateQuickBacktrace

    backtrace[03] rip 0000000180077ed8 Vmacore::Throwable::Throwable

    backtrace[04] rip 0000000180108603 Vmacore::SystemException::SystemException

    backtrace[05] rip 000000018011a96a Vmacore::System::GetThisThread

    backtrace[06] rip 00000001801237ac Vmacore::System::IsEnlisted

    backtrace[07] rip 0000000180119915 Vmacore::System::GetThreadId

    backtrace[08] rip 0000000074042fdf endthreadex

    backtrace[09] rip 0000000074043080 endthreadex

    backtrace[10] rip 0000000077b059ed BaseThreadInitThunk

    backtrace[11] rip 0000000077c3c541 RtlUserThreadStart

    [2014-08-13 04:35:44.631 05356 warning 'VpxProfiler' opID=HB-host-9@71365] [VpxdHostSync] GetChanges host:esx01.xxx.local (xxx.xxx.xxx.xxx) took 21014 ms

    [2014-08-13 04:35:44.631 05356 warning 'VpxProfiler' opID=HB-host-9@71365] [VpxdHostSync] DoHostSync:0000000007F01930 took 21015 ms

    [2014-08-13 04:35:44.659 05356 warning 'VpxProfiler' opID=HB-host-9@71365] InvtHostSyncLRO::StartWork took 21043 ms

    [2014-08-13 04:35:44.659 05356 info 'App' opID=HB-host-9@71365] [VpxLRO] -- FINISH task-internal-113982 -- host-9 -- VpxdInvtHostSyncHostLRO.Synchronize --

    [2014-08-13 04:35:45.215 05396 info 'App' opID=task-internal-113985-a845ed4b] [VpxLRO] -- BEGIN task-internal-113985 --  -- ScheduledTaskLRO --

    [2014-08-13 04:35:45.307 05396 info 'App' opID=task-internal-113985-a845ed4b] [VpxLRO] -- FINISH task-internal-113985 --  -- ScheduledTaskLRO --

    [2014-08-13 04:35:46.229 05396 info 'App' opID=task-internal-113986-174c2d26] [VpxLRO] -- BEGIN task-internal-113986 --  -- ScheduledTaskLRO --

    [2014-08-13 04:35:46.229 05376 info 'App' opID=task-internal-113987-269d9b7] [VpxLRO] -- BEGIN task-internal-113987 --  -- ScheduledTaskLRO --

    [2014-08-13 04:35:46.232 05396 info 'App' opID=task-internal-113986-174c2d26] [VpxLRO] -- FINISH task-internal-113986 --  -- ScheduledTaskLRO --

    [2014-08-13 04:35:46.293 05376 info 'App' opID=task-internal-113987-269d9b7] [VpxLRO] -- FINISH task-internal-113987 --  -- ScheduledTaskLRO --

    [2014-08-13 04:35:46.449 05376 info 'StackTracer'] [5376] Enter DAS_PROFILE UpdateDasStatus

    [2014-08-13 04:35:46.449 05376 info 'App'] [VpxdDas] Bad primary esx01: connected=0, dasState=running, vpxaDasState=running

    [2014-08-13 04:35:46.449 05376 info 'StackTracer'] [5376] Exit DAS_PROFILE UpdateDasStatus (0 ms)

    [2014-08-13 04:35:53.646 05376 info 'App' opID=HB-host-9@71365] [VpxLRO] -- BEGIN task-internal-113988 -- host-9 -- VpxdInvtHostSyncHostLRO.Synchronize --

    [2014-08-13 04:35:53.647 05376 info 'App' opID=HB-host-9@71365] [VpxdHostSync] Synchronizing host: esx01.xxx.local (xxx.xxx.xxx.xxx)

    [2014-08-13 04:35:54.333 05316 info 'App' opID=HB-host-59@110899] [VpxLRO] -- BEGIN task-internal-113989 -- host-59 -- VpxdInvtHostSyncHostLRO.Synchronize --

    [2014-08-13 04:35:54.335 05316 info 'App' opID=HB-host-59@110899] [VpxdHostSync] Synchronizing host: esx03.ssm.local (xxx.xxx.xxx.xxx)

    [2014-08-13 04:35:55.355 05316 info 'App' opID=HB-host-59@110899] [VpxdHostSync] Retrieved host update to 110899

    [2014-08-13 04:35:55.356 05316 info 'App' opID=HB-host-59@110899] [VpxdHostSync] Completed host synchronization

    [2014-08-13 04:35:55.357 05316 info 'App' opID=HB-host-59@110899] [VpxdHostSpecSync] saving host update to vpxaMasterSpecGenNo = 3604, vpxdMasterSpecGenNo = 3605

    [2014-08-13 04:35:55.357 05316 info 'App' opID=HB-host-59@110899] [VpxdHostSpecSync] Updating vpxa for esx03.xxx.local (spec. genno 3605)

    [2014-08-13 04:35:55.366 05316 info 'App' opID=HB-host-59@110899] [VpxLRO] -- FINISH task-internal-113989 -- host-59 -- VpxdInvtHostSyncHostLRO.Synchronize --

    [2014-08-13 04:35:55.537 05376 info 'App' opID=HB-host-9@71365] [VpxdHostSync] Retrieved host update to 71365

    [2014-08-13 04:35:55.711 05376 info 'App' opID=HB-host-9@71365] [VpxdMoHost::UpdateConfigInt] Marked esx01.xxx.local as dirty.

    I don't know what might be cause. I check ports on firewalls , pings , DNS. The interesting thing is that in both cases synchronization took more than 2000ms where normaly it takes less than 1000ms. Is there some time limit to synchronization ?



  • 2.  RE: Host going into not responding state

    Posted Aug 13, 2014 12:44 PM

    Hi,

    try to check:

    - time on your host if is synchronized with vCenter server

    - /etc/hosts file entries

    - which vCenter release are you using

    - are you using self-signed SSL certificates?

    - verify that some other software does not interfering with port 902 used by vpxa

    - test your host connectivity with ping if you will see any packet loss

    - verify host firewall rules for port 902

    - verify host partitions free disk space with this command:

    # df -h



  • 3.  RE: Host going into not responding state

    Posted Aug 13, 2014 05:36 PM

    An ESXi/ESX host connects to vCenter Server, but after several seconds, it becomes disconnected or changes to a Not Responding state.
    Trying to reconnect fails and the host changes to an unresponsive state.
    Direct connection to the ESXi/ESX host with the vSphere Client works normally.
    Ping and other network connectivity testing works without any issues.
    There are no abusive session timeout configurations on the firewall.
    hostd and vpxa processes and logs do not show any responsiveness issues.
    Solution
    This issue may occur in these scenarios:

    A firewall prevents port 902 UDP heartbeats from the ESXi/ESX host from reaching the vCenter Server.

    This issue occurs when the ESXi/ESX host firewall or another firewall between the ESXi/ESX host and vCenter Server is configured to filter the UDP packets.

    Note: ESXi automatically opens its firewall when the vCenter Server agent is installed.

    To resolve this issue:


    For an ESX firewall:


    Run this command to temporarily disable the ESX firewall and determine if it is blocking the port 902 UDP heartbeat packets:

    service firewall stop


    Try connecting the ESX host to vCenter Server. If it stays connected, there is a misconfiguration/corruption of the firewall. For more information, see 1001081.


    If the ESX host continues to appear Disconnected or Not Responding, run Ethereal on the vCenter Server NIC to confirm if the port 902 UDP packets arrive.


    For a Windows or third-party firewall:


    Disable the firewall.
    Try connecting the ESX host to vCenter Server. If it stays connected, there is a misconfiguration/corruption of the firewall. Ensure that the firewall is not blocking UDP port 902.

    Network Address Translation (NAT) between vCenter Server and the ESXi/ESX host is preventing port 902 UDP heartbeats from arriving at vCenter Server.

    Note: Network Address Translation (NAT) is not a supported network configuration. For more information, see the vCenter Server Prerequisites section in the VMware vSphere Installation Guide for your version.


    Starting with VirtualCenter 2.0.1 patch 2, an explicit heartbeat IP address can be advertised by VirtualCenter. To do so:


    Add a managedip entry to the vpxd.cfg configuration file and include the IP address of vCenter Server.

    Note: By default, the vpxd.cfg file is located at:

    C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter.

    Example:

    <vpxd>
    ...
    <managedip>IP_address</managedip>
    ...
    </vpxd>

    Where IP_address is the actual IP address of the VirtualCenter server.


    Restart the VMware VirtualCenter Server service after adding the entry.


    For VirtualCenter 2.5 and later, the IP address to use for heartbeats can be specified in the VMware Infrastructure Client. To do so:


    Log into the VirtualCenter server using the VI or vSphere Client.
    Click Administration > VirtualCenter Management Server Configuration.
    Click Runtime Setting.
    Set Managed IP Address to the IP address to which the ESX host should send the heartbeats.



  • 4.  RE: Host going into not responding state

    Posted Sep 23, 2014 07:36 PM

    I had this problem and it turned out to be a Dell iDRAC firmware update. Installed the latest version and we were good to go.