ESXi

 View Only
  • 1.  ESXi host crashes random; no dump or log entries

    Posted Jan 12, 2019 11:51 AM

    Hi,

    My home lab host was running ESXi 5.5 for almost 6 years without any crash. No hardware was changed, only the location in August 2018. A couple of weeks ago I noticed that the host is rebooting regularly. Initially I thought it had something to do with an IPv6 roll out on the location, which was known to be buggy in the version I was running. After first disabling IPv6, I eventually updated to ESXi 6.5. The host is still crashing as you can see from the Xorg.log below. By reading the network switch log, I can tell that the host sometimes crashes multiple times before completing the reboot.

    2019-01-10T10:48:40Z mark: storage-path-claim-completed

    2019-01-10T14:10:44Z mark: storage-path-claim-completed

    2019-01-10T15:46:38Z mark: storage-path-claim-completed

    2019-01-10T17:03:56Z mark: storage-path-claim-completed

    2019-01-10T17:46:53Z mark: storage-path-claim-completed

    2019-01-10T18:12:50Z mark: storage-path-claim-completed

    2019-01-10T19:24:31Z mark: storage-path-claim-completed

    2019-01-10T19:51:06Z mark: storage-path-claim-completed

    2019-01-10T21:23:26Z mark: storage-path-claim-completed

    2019-01-10T23:30:24Z mark: storage-path-claim-completed

    2019-01-10T23:59:18Z mark: storage-path-claim-completed

    2019-01-11T00:29:38Z mark: storage-path-claim-completed

    2019-01-11T01:32:12Z mark: storage-path-claim-completed

    2019-01-11T02:19:19Z mark: storage-path-claim-completed

    2019-01-11T04:09:31Z mark: storage-path-claim-completed

    2019-01-11T05:35:51Z mark: storage-path-claim-completed

    2019-01-11T06:51:12Z mark: storage-path-claim-completed

    2019-01-11T07:17:11Z mark: storage-path-claim-completed

    2019-01-11T07:57:42Z mark: storage-path-claim-completed

    2019-01-11T08:30:11Z mark: storage-path-claim-completed

    2019-01-11T14:59:32Z mark: storage-path-claim-completed

    2019-01-11T15:37:45Z mark: storage-path-claim-completed

    2019-01-11T16:20:33Z mark: storage-path-claim-completed

    2019-01-11T16:49:22Z mark: storage-path-claim-completed

    2019-01-11T18:30:27Z mark: storage-path-claim-completed

    2019-01-11T20:23:01Z mark: storage-path-claim-completed

    2019-01-12T00:06:59Z mark: storage-path-claim-completed

    2019-01-12T00:34:22Z mark: storage-path-claim-completed

    2019-01-12T01:02:04Z mark: storage-path-claim-completed

    2019-01-12T09:10:54Z mark: storage-path-claim-completed

    2019-01-12T11:16:12Z mark: storage-path-claim-completed

    The crashes appear like if you just unplug the power cable. There is no purple screen, no dump files and in none of the logs I can find hints of what happened prior to the reboot.

    In vmkernel.log I noticed a few memory corrections so to be sure, I ran memtest86 for 72 hours. No errors found.

    The server has a redundant power supply. Currently I have disabled one module to see if it makes a difference. after 24 hours I will swap to the other module. It's a shot in the dark, specially since the memtest could run for 72 hours. Other than this I am running out of idea's. Although I can remove some non essential hardware from the host like a GFX card and 2nd storage controller. But I would expect log entries when those are failing.

    Any input will be very much appreciated. Perhaps there are other logs that I am missing which contain info. Or can I enable more advanced logging?

    Thanks,

    Robbert

    Hardware:

    Motherboard: ASUS KGPE-D16 with 2x AMD Opteron 6134 / 8 core

    Memory: 12x 8GB DDR3 ECC Reg

    Main controller: IBM ServeRAID M5016 SAS/SATA -> LSI 9266-8i MegaRaid with SSD RAID10 and HDD RAID10

    2nd controller: Supermicro -USAS2-L8i 8-Port SAS/2 (pass through)



  • 2.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 12, 2019 12:25 PM

    Since there's no PSOD when these crashes occur, I'd suggest you start with creating a persistent scratch location (https://kb.vmware.com/s/article/1033696).

    Maybe the log files contain entries which help to determine what happened prior to the crash(es).


    André



  • 3.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 12, 2019 02:05 PM

    Hi André,

    Thanks for replying. I had already made a persistent scratch partition on the SSD RAID10 datastore.I do have updating logs in /var/log but there are no dump files after the crashes.

    Via IPMI I can see the concole remotely from where I am. I have seen several crashes on my screen but they just show as if the power was toggled. I'm guessing the system does not have a chance to generate the dump files.

    Thanks,

    Robbert

    drwxr-xr-x    1 root     root           512 Jan 12 13:35 .

    drwxr-xr-x    1 root     root           512 Jan 12 13:35 ..

    -rw-------    1 root     root            13 Jan 12 13:35 .ash_history

    -r--r--r--    1 root     root            20 Jul  7  2017 .mtoolsrc

    lrwxrwxrwx    1 root     root            49 Jan 12 12:18 altbootbank -> /vmfs/volumes/715de0e4-ad245e89-1b34-6b0c39efb6a5

    drwxr-xr-x    1 root     root           512 Jan 12 12:18 bin

    lrwxrwxrwx    1 root     root            49 Jan 12 12:18 bootbank -> /vmfs/volumes/e46647ea-2f245bae-a60d-d5dc0ad390f0

    -r--r--r--    1 root     root        505736 Jul  7  2017 bootpart.gz

    drwxr-xr-x   13 root     root           512 Jan 12 13:35 dev

    drwxr-xr-x    1 root     root           512 Jan 12 13:18 etc

    drwxr-xr-x    1 root     root           512 Jan 12 12:18 lib

    drwxr-xr-x    1 root     root           512 Jan 12 12:18 lib64

    -r-x------    1 root     root         21439 Jan 12 12:01 local.tgz

    lrwxrwxrwx    1 root     root             6 Jan 12 12:18 locker -> /store

    drwxr-xr-x    1 root     root           512 Jan 12 12:18 mbr

    drwxr-xr-x    1 root     root           512 Jan 12 12:18 opt

    drwxr-xr-x    1 root     root        131072 Jan 12 13:35 proc

    lrwxrwxrwx    1 root     root            23 Jan 12 12:18 productLocker -> /locker/packages/6.5.0/

    lrwxrwxrwx    1 root     root             4 Jul  7  2017 sbin -> /bin

    lrwxrwxrwx    1 root     root            57 Jan 12 12:18 scratch -> /vmfs/volumes/548cb1c4-30d22a56-b3a7-bcaec527ae9b/.locker

    lrwxrwxrwx    1 root     root            49 Jan 12 12:18 store -> /vmfs/volumes/527fd478-f2ea3747-127a-bcaec527ae9b

    drwxr-xr-x    1 root     root           512 Jan 12 12:17 tardisks

    drwxr-xr-x    1 root     root           512 Jan 12 12:17 tardisks.noauto

    drwxrwxrwt    1 root     root           512 Jan 12 13:01 tmp

    drwxr-xr-x    1 root     root           512 Jan 12 12:17 usr

    drwxr-xr-x    1 root     root           512 Jan 12 12:18 var

    drwxr-xr-x    1 root     root           512 Jan 12 12:17 vmfs

    drwxr-xr-x    1 root     root           512 Jan 12 12:17 vmimages

    lrwxrwxrwx    1 root     root            18 Jul  7  2017 vmupgrade -> /locker/vmupgrade/

    Filesystem         Bytes          Used    Available Use% Mounted on

    VMFS-5      997774589952  774669074432 223105515520  78% /vmfs/volumes/IBM_RAID10_4SSD

    VMFS-5     1997965099008 1619378831360 378586267648  81% /vmfs/volumes/IBM_RAID10_4HDD

    VMFS-5     3000571527168 2138127204352 862444322816  71% /vmfs/volumes/SM_1HDD

    vfat           299712512        131072    299581440   0% /vmfs/volumes/527fd478-f2ea3747-127a-bcaec527ae9b

    vfat           261853184     172625920     89227264  66% /vmfs/volumes/715de0e4-ad245e89-1b34-6b0c39efb6a5

    vfat           261853184     162902016     98951168  62% /vmfs/volumes/e46647ea-2f245bae-a60d-d5dc0ad390f0

    [root@RJ-ESXi:/vmfs/volumes] ls -al

    total 3844

    drwxr-xr-x    1 root     root           512 Jan 12 13:59 .

    drwxr-xr-x    1 root     root           512 Jan 12 13:45 ..

    drwxr-xr-x    1 root     root             8 Jan  1  1970 527fd478-f2ea3747-127a-bcaec527ae9b

    drwxr-xr-t    1 root     root          2940 Aug 22 20:35 548cb1c4-30d22a56-b3a7-bcaec527ae9b

    drwxr-xr-t    1 root     root          1400 Nov 13  2016 54921312-3735e336-0618-bcaec527ae9b

    drwxr-xr-t    1 root     root          1680 Jan  5 20:12 54921344-dba16a0f-8aa6-bcaec527ae9b

    drwxr-xr-x    1 root     root             8 Jan  1  1970 715de0e4-ad245e89-1b34-6b0c39efb6a5

    lrwxr-xr-x    1 root     root            35 Jan 12 13:59 IBM_RAID10_4HDD -> 54921312-3735e336-0618-bcaec527ae9b

    lrwxr-xr-x    1 root     root            35 Jan 12 13:59 IBM_RAID10_4SSD -> 548cb1c4-30d22a56-b3a7-bcaec527ae9b

    lrwxr-xr-x    1 root     root            35 Jan 12 13:59 SM_1HDD -> 54921344-dba16a0f-8aa6-bcaec527ae9b

    drwxr-xr-x    1 root     root             8 Jan  1  1970 e46647ea-2f245bae-a60d-d5dc0ad390f0



  • 4.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 12, 2019 02:11 PM

    You mentioned IPMI. Are there any entries in the IPMI logs regarding the reboots?

    André



  • 5.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 12, 2019 05:52 PM

    The IPMI log had only one memory correction message.



  • 6.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 12, 2019 03:46 PM

    This is exactly why I stopped building whiteboxes for clients decades ago.

    I would  temporarily remove the drives, throw another single drive in there, and then install a simple OS like Win7 and run whatever diag or burn in tools for me to see where it happens.

    It's always easier to diag machines on new installs of anything.



  • 7.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 13, 2019 08:06 PM

    eRJe​ This seems probably a RAM problem as what I have experienced. since you have 12x 8GB DDR3 ECC Reg. Remove all your RAM and you can try starting Each processor side 8GB RAM module. If nothing happens Increased to second module and go on like that.

    Hope it will solved the problem, if not reply please.

    May the force be with you,
    Prime919



  • 8.  RE: ESXi host crashes random; no dump or log entries

    Posted Jan 25, 2019 08:30 AM

    Prime201110141​ I took my time to do some testing before reporting back. I did what you suggested and removed all memory accept for 1 DIMM for each CPU. The behavior was the same. I then replaced the memory with 2 other DIMMS. This time the server did not crash for +3 days but I noticed that vCenter was frozen. I restarted vCenter VM and withing 1 hour the server crashed again continuously.

    I then replaced again the memory but also disabled the 2nd CPU in the BIOS. The server still crashed but less than once a day. A big improvement but still not satisfying. Again I changed the memory and the crashing frequency went up to 12-15 crashes a day. I continued swapping memory without significant change.

    I cannot imagine that all 12 DIMMS are faulty but there is definitely a noticeable change with certain memory configurations. I now considering to physically swap the 2 CPU's. When this also doesn't make a difference, I think I have to consider the motherboard faulty and replace it.

    Any thoughts?

    Regards,

    Robbert