Hello all,
I have found myself in a situation where one of my two ESXi hosts becomes completely uncontrollable. It is still pingable, reachable on its web interface and also reachable through VCSA. But it wont do anything, it won't poweroff VM's, it won't show status. Nothing at all.
The story is quite long but I try to give as much background info as possible that might be related. I hope you can follow my thoughts.
When the host is uncontrollable it also allow SSH logins, and even through the CLI, the VM's wont poweroff and cannot be killed.
Rebooting does not work either, it just keeps saying that "a restart is in progress" and it never continues to actually restart. To be fair: I waiting 15 minutes or longer until I decided to pull the plug.
The only thing that works is to do a hard reset through IPMI.
This situation has happened 3 times in a row over a course of 6 days.
My setup is as following:
2 ESXI hosts: both ESXi 7.0.2 (hardware vendor supports the version)
1 VCSA on version 7.0.2 u2
2 iSCSI hosts on seperate VLANs, both using MPIO and round robin.
The ESXi hosts have 2 NIC's: both nic's have a vmkernel for iscsi and the first nic has a vmkernel for management/vmotion.
Both ESXi hosts also have internal nvme SSD's containing VM's. And also these VM's are uncontrollable. By which I mean: cant poweroff, reboot, shutdown. One of these two even showed a blue screen while its host was uncontrollable.
There have been two changes in the last week, so I am now doubting which might be the cause or that I am just unlucky and am experiencing a bug of some sort.
1) I updated from 7.0.1 (both ESXi and VCSA) to 7.0.2
2) I implemented more VLANs, especially the management vmkernel (vmotion and management) and most of the VM's are now all bound to a DPortGroup on VLAN10 instead of the default VLAN1. I made this DPortGroup on a DSwitch that both hosts were already attached to.
After implementing the updates and the VLAN's everything worked fine for about 48 hours. Then most VM's became unresponsive/crashed and even one Windows Domain controllers showed a blue screen "HAL INITIALIZATION FAILED".
I had to hard reset the ESXi hosts to make them function again. After everything was up and running again, this happened again about 48 hours later. I then started doubting myself and the VLAN configuration that I made. So I rebooted everything again: quickly moved all VM's to host B, completely reinstalled host A. Host B was originally hosting almost all VM's due to the prior error. I completely reinstalled host B but did not reinstall host A since it didn't freeze/become uncontrollable and was now hosting all VM's. Again about 48 hours later host A is now the host that is uncontrollable and also needs a hard reset to make it function again. After the host was up I migrated all VM's to host B and also reinstalled host A.
Both are now reinstalled, VCSA is still the as it is (upgraded from 6.x to 7.x and now to 7.0.2. U2).
I did run in to a lot of trouble while updating both ESXi and VCSA through the lifecycle manager. Ultimately both needed a manual update requiring the ISO files instead of the usual update procedure. I never had to do it this way.
I also found that after upgrading VCSA to 7.0.2 U2 I needed to upgrade the DSwitch. And somehow it kept saying "the update is still in progress" on one of the ESXi hosts. Perhaps something went wrong in this phase that might explain the problem I am having?
I am still worried that this problem might reoccur, and I have no clue what might be wrong.
It might be my own mistake in some configuration that I am not aware of. It might be some network/vlan setting. Or perhaps there is an issue with ESXi / VCSA 7.0.2?
Is there perhaps anyone with more experience and more insight in to the problems? I also started to think about installing a new VCSA and trying to migrate but that didn't work due to the version being the same. The reason that I am thinking about reinstalling VCSA is that ever since upgrading VCSA 6.x to 7.x it always shows a warning in vSphere health about ESXi host connectivity.
I have verified that there is no such problem, I can see all UDP heartbeats on port 902 coming in and also going out exactly every 10 seconds. I also had a frequent/permanent low ram situation on VCSA since updating the VCSA 7.x. I now added another 2 GB of ram to VCSA. I use the tiny installation and have now added 2GB of ram for the 2nd time. So it has a total of 14 GB now instead of the 10 GB I always used to dedicate to VCSA.
To make my long story short: I am lost. Am I doing something wrong or is it likely that there might be a bug or a broken driver or anything else that is going wrong?