Hi,
My home lab host was running ESXi 5.5 for almost 6 years without any crash. No hardware was changed, only the location in August 2018. A couple of weeks ago I noticed that the host is rebooting regularly. Initially I thought it had something to do with an IPv6 roll out on the location, which was known to be buggy in the version I was running. After first disabling IPv6, I eventually updated to ESXi 6.5. The host is still crashing as you can see from the Xorg.log below. By reading the network switch log, I can tell that the host sometimes crashes multiple times before completing the reboot.
2019-01-10T10:48:40Z mark: storage-path-claim-completed
2019-01-10T14:10:44Z mark: storage-path-claim-completed
2019-01-10T15:46:38Z mark: storage-path-claim-completed
2019-01-10T17:03:56Z mark: storage-path-claim-completed
2019-01-10T17:46:53Z mark: storage-path-claim-completed
2019-01-10T18:12:50Z mark: storage-path-claim-completed
2019-01-10T19:24:31Z mark: storage-path-claim-completed
2019-01-10T19:51:06Z mark: storage-path-claim-completed
2019-01-10T21:23:26Z mark: storage-path-claim-completed
2019-01-10T23:30:24Z mark: storage-path-claim-completed
2019-01-10T23:59:18Z mark: storage-path-claim-completed
2019-01-11T00:29:38Z mark: storage-path-claim-completed
2019-01-11T01:32:12Z mark: storage-path-claim-completed
2019-01-11T02:19:19Z mark: storage-path-claim-completed
2019-01-11T04:09:31Z mark: storage-path-claim-completed
2019-01-11T05:35:51Z mark: storage-path-claim-completed
2019-01-11T06:51:12Z mark: storage-path-claim-completed
2019-01-11T07:17:11Z mark: storage-path-claim-completed
2019-01-11T07:57:42Z mark: storage-path-claim-completed
2019-01-11T08:30:11Z mark: storage-path-claim-completed
2019-01-11T14:59:32Z mark: storage-path-claim-completed
2019-01-11T15:37:45Z mark: storage-path-claim-completed
2019-01-11T16:20:33Z mark: storage-path-claim-completed
2019-01-11T16:49:22Z mark: storage-path-claim-completed
2019-01-11T18:30:27Z mark: storage-path-claim-completed
2019-01-11T20:23:01Z mark: storage-path-claim-completed
2019-01-12T00:06:59Z mark: storage-path-claim-completed
2019-01-12T00:34:22Z mark: storage-path-claim-completed
2019-01-12T01:02:04Z mark: storage-path-claim-completed
2019-01-12T09:10:54Z mark: storage-path-claim-completed
2019-01-12T11:16:12Z mark: storage-path-claim-completed
The crashes appear like if you just unplug the power cable. There is no purple screen, no dump files and in none of the logs I can find hints of what happened prior to the reboot.
In vmkernel.log I noticed a few memory corrections so to be sure, I ran memtest86 for 72 hours. No errors found.
The server has a redundant power supply. Currently I have disabled one module to see if it makes a difference. after 24 hours I will swap to the other module. It's a shot in the dark, specially since the memtest could run for 72 hours. Other than this I am running out of idea's. Although I can remove some non essential hardware from the host like a GFX card and 2nd storage controller. But I would expect log entries when those are failing.
Any input will be very much appreciated. Perhaps there are other logs that I am missing which contain info. Or can I enable more advanced logging?
Thanks,
Robbert
Hardware:
Motherboard: ASUS KGPE-D16 with 2x AMD Opteron 6134 / 8 core
Memory: 12x 8GB DDR3 ECC Reg
Main controller: IBM ServeRAID M5016 SAS/SATA -> LSI 9266-8i MegaRaid with SSD RAID10 and HDD RAID10
2nd controller: Supermicro -USAS2-L8i 8-Port SAS/2 (pass through)