Hi Guys, battling a tough problem with a new cluster and could use some feedback: We have been having issues where one of the hosts (random) goes into a disconnected state. If you let it sit long enough it will come back by itself. Restarting the vcenter services seems to speed up the process also. Last night I went to put on of the hosts into maintenance mode (to apply the latest patches) and all sorts of bad things happened. Got to the point where 2 hosts became disconnected. Situation got so bad that one of the hosts could not get back into the cluster and we had to shut machines down and re-register them on the cluster hosts. (Fun at 3:00AM) . VMware was a little stumped last night with what was happening so I have to re-engage them next week. Any help / ideas would be much appreciated.
Here is the hardware / details.
-Running 5.5 Build 2302651 (Dell Specific ISO)
-3 x Dell PowerEdge R730 [Boot from Flash] (firmware completely up to date)
-1 x Dell Powervault MD3420 12gb SAS connectivity (dual controller) (firmware current)
-There is a Dell PowerEdge R730XD direct connected to the MD3420 running Windows 2012R2 / Veeam 8.0 Update 1 for backups
-We are not sure if Veeam could be causing this to happen. Trying to get them involved. For now we have Veeam completely disabled.
-A putty session to one of the disconnected hosts that had locked up during a management agent restart came back magically when a veeam replication job was cancelled (was replicating a machine out of the backup repository so I have no idea why that would matter)
-Currently have DRS automation and Application monitoring disabled to mitigate risk
-Starting to move workloads to another cluster to reduce risk