ESXi

 View Only
  • 1.  ESX host unresponsive

    Posted Jan 27, 2015 09:41 PM

    hi

    now i have a problem.

    a customer has 3 IBM HS23 blade.

    for the past 7 months i had 4 times that all esx host became offline/unresponsive

    all vm's keep running so production keeps running

    esx host don't respond on SSH or telnet connection

    i can login after pressing F2 and F12 but then the host just hangs

    i cant login in vi client to the ESX host either.

    all host are installed on USB disk from IBM

    only thing to resolve it is to power off the blades and start them again

    its also random. i had 3 month running good , 1 day running good and 3 weeks running good

    the vpxd.log file only shows time outs to communicated with esx host and then the error offline appears

    already updated all my firmware's (chasis, blades, storage, brocade fiber and IBM switches)

    i started with vsphere 5.5 U1 and ESXI 5.5 u1 from IBM. already updates to vsphere 5.5 U2 and ESXi 5.5 u2 from vmware (since ibm doesn't have a custom ESXi 5.5 u2)

    im lil stuck.. last event is on 16/01/2015 (and ibm wont give support since there is only 3Y subcription and no software support contract)

    a last point i'm running veeam bacup and replication 7.x but don't see a problem there.

    thanks in advance.



  • 2.  RE: ESX host unresponsive

    Posted Jan 29, 2015 05:31 PM

    Hi vervoortjurgenvervoort jurgenvervoort jurgen,

    This reminds me of a similar issue I faced in the past on vSphere 5.0, here are the details from 2 years ago;

    Exhausting inodes + Disconnected Host

    Re: Free INODES and % free RAMDISK

    I eventually got a hotfix from VMware to address this, but in the interim I tweaked the script to monitor inodes and ramdisk and email me every time thresholds were reached (so I could react before there was an outage). I can dig this out if you need it.

    I also remember the alerts being generated after a support bundle was created which filled up the TMP volume.

    It would be interesting if you have the same issue?

    Cheers,

    Jon



  • 3.  RE: ESX host unresponsive

    Posted Jan 30, 2015 01:15 AM

    Is the microcode up-to-date?



  • 4.  RE: ESX host unresponsive

    Posted Jan 30, 2015 02:47 PM

    We could probably start with hostd, vmkernel logs. Could you please upload them?

    The APD issue caused by storage device loss also caused such behaviour. To confirm, vmkernel logs show these verbiage - PERM LOSS, failed with status Device is permanently unavailable etc



  • 5.  RE: ESX host unresponsive

    Posted Jan 30, 2015 03:17 PM

    Yes. Sounds like a hostd hang.

    Check /var/log/syslog.log, vobd.log, vmkwarning.log and hostd.log before the hang occurred.

    Could be a memory leak, full ramdisk, no free inodes (like the ones mentioned above), or PDL/APD (storage not available).

    Perhaps stop any 3rd party SW on the host (like HW monitoring etc.).



  • 6.  RE: ESX host unresponsive

    Posted Jan 30, 2015 08:47 PM

    yes all micro code is up to date

    IBM confirms no errors on the hardware

    attaching the logs files

    last error occurred at 27-01-2015 at 20:20. had to reboot the hosts because customer needs management of hosts

    now im thinking it could be the CPU load.

    in production i have 70% cpu load

    also the CPU is poorly i think E5-2609 2,4 GHz

    monitoring software is stopped, only veeam backups runs at 18:00 until 24:00

    mcafee move also active

    don't see the errors you all mention.

    any suggestions?



  • 7.  RE: ESX host unresponsive

    Posted Jan 31, 2015 01:09 AM

    are you able to connect via the console and use ESXTOP and view network stats for dropped packets? Are you able to also check the config on the management ports? I had this with a faulty NIC that negotiated to from 1Gb to 100Mb



  • 8.  RE: ESX host unresponsive

    Posted Feb 08, 2015 09:28 PM

    hello

    an update

    i have now 2 vsphere environments with this problem

    so i compared the logs

    i think its the iscsi datastore that makes my host unresponsive

    ive been reading on the internet and alot of persons seems to have problems since the update2?

    alot of iSCSI storage deivce arent supported anymore?

    anyway i have a case open with veeam because i use my iSCSI storage for replica most of the time

    hopefully they can confirm my findings



  • 9.  RE: ESX host unresponsive

    Posted Feb 09, 2015 07:43 AM

    You mentioned a reboot at 27-01-2015 at 20:20.

    But vmksummary.log does not show any reboot on the 27th Jan.

    2015-01-29T19:03:36Z bootstop: Host has booted

    2015-01-29T21:56:21Z bootstop: Host is rebooting

    2015-01-29T22:02:46Z bootstop: Host has booted

    Around this time, no errors in the logs. Some logs are cycled already and the older ones are in /var/run/log.

    The only thing is this here in vmkarning.log:

    2015-01-29T13:36:00.706Z cpu1:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt "Abort Task" with itt=0x6cac7 (refITT=0x68fcd) timed out.

    2015-01-29T13:36:12.709Z cpu2:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt "Abort Task" with itt=0x6cac8 (refITT=0x68fcd) timed out.

    2015-01-29T13:36:24.712Z cpu3:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt "Abort Task" with itt=0x6cac9 (refITT=0x68fcd) timed out.

    This is ongoing every 12 seconds where vmkernel tries to abort a SCSI task.

    Yes, you should investigate research in the storage and iSCSI if all is correct here.

    Check if those lines come up every time the servers got stuck.

    Besides that there is nothing else in the logs pointing to any problem.

    Check /var/run/log since some log files in /var/log are already cycled.



  • 10.  RE: ESX host unresponsive
    Best Answer

    Posted Feb 11, 2015 05:59 PM

    i found the problem

    i restarted only the iSCSI datastore and all the ESXi hosts became responsive again

    so after searching the qnap forum i noticed that they released an update on 29/01/2015

    testing the firmware now and see what happen

    if this fails i'm guessing the qnap ts-469L isnt supported anymore for vsphere 5.5

    thanks all for suggestions