vSAN1

View Only

Back to discussions

Expand all | Collapse all

Cluster Issues due to failed host / Hosts no longer communicate

1. Cluster Issues due to failed host / Hosts no longer communicate

Recommend
NewComer213
Posted Jan 06, 2024 02:36 AM
| view attached (3)

Reply Reply Privately
A host failed in my cluster and could not recover it. A new host was deployed but after removing the old host from the cluster and adding the new host, the Datastores are no longer synced. I'm getting the alerts below and "Host cannot communicate with one or more other nodes in the vSAN enabled cluster". If I go to vsandatastore it no longer shows files like it did before. I'm showing my VMs online but they cannot be moved between hosts or able to perform snapshots either. I don't want to lose my VMs or data and want to see if someone can point me in the right direction to make sure I can fix the alerts without losing my data. Unfortunately, I do not have support.
2. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
zchris06
Posted Jan 06, 2024 01:03 PM

Reply Reply Privately
Hi,
Did you re-install or new hardware? Make sure your networking, especially the vsan vmkernel, is configured properly.
3. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
NewComer213
Posted Jan 06, 2024 11:33 PM

Reply Reply Privately
Hi,
I appreciate the reply. Below is what I did.
I reinstalled the hypervisor on the same hardware. I matched all the VMKernel adapters and compared them to the other hosts but did not remove the host from the cluster (failed host) before syncing it. I received the initial error "Host cannot communicate with one or more other nodes in the vSAN enabled cluster", I placed the host in maintenance and removed it from the cluster then added it again. Thats when everything seems to have gotten worse. The vSANdatastore is empty. I powered off one VM and now it does not allow me to turn it back on. I removed the network adapter from the downed VM and added it but received an error message "A general system error occurred: vDS host error: see fault cause.
4. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
TheBobkin
Posted Jan 07, 2024 10:10 PM

Reply Reply Privately
, "I'm getting the alerts below and "Host cannot communicate with one or more other nodes in the vSAN enabled cluster". If I go to vsandatastore it no longer shows files like it did before."
This indicates (as it says) that the reinstalled node cannot communicate with the other nodes in the vSAN cluster and thus would be unable to access vsanDatastore, also from the health alerts in your second screenshot the node is possibly unable to communicate from vsanmgmtd on ESXi to vsan-health on vCenter (do you have a firewall between vCenter and the hosts?).

If you reinstall ESXi on a vSAN node then it needs the following to be able to successfully rejoin the vSAN cluster:
- Same major ESXi version as the other nodes (minor deviation within an update version is fine e.g. some nodes on ESXi 7.0 Update 3c, others on ESXi 7.0 U3 EP10 is okay, some nodes being on 7.0 U2 and others being on 7.0 U3 is not okay).
- A vmkernel adapter tagged for vsan-traffic, reachable from the other nodes vsan-tagged vmk (e.g. in the same subnet, operating at the same MTU (including the vmnics and switches backing it) and in the correct VLAN).
- That all nodes in the cluster can accept the vCenter-pushed change of the new nodes UUID (and vsan-IP) to their unicastagent lists (e.g. that 'esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates' is NOT set to 1 on any nodes).

I am unaware of what vDS issue you have regarding your last point but you should be starting here with confirming that there is basic vmkping between all nodes vsan-tagged vmk and that the unicastagent lists are correct:
https://kb.vmware.com/s/article/2144398
https://kb.vmware.com/s/article/1003728
5. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
NewComer213
Posted Jan 09, 2024 02:30 AM
| view attached (5)

Reply Reply Privately
I'm showing On-Disks format upgrade is recommended. Not sure how to upgrade. I reinstalled the ESXi image using DellEMC Customized version 6.7 since I did not have a copy of the original image. Hope this is not an issue. I don't have a firewall. I removed the host out of the cluster under maintenance mode and back into the cluster and that's when the other working hosts seemed to have gone haywire and no longer communicate with each other. Attached are a few screenshots for reference.
6. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
TheBobkin
Posted Jan 10, 2024 08:38 PM

Reply Reply Privately
"I'm showing On-Disks format upgrade is recommended. Not sure how to upgrade."
Ignore this for now, it is not relevant to the current issue.

"I reinstalled the ESXi image using DellEMC Customized version 6.7 since I did not have a copy of the original image. Hope this is not an issue."
It can be - please provide exact build numbers in use on the node e.g. '6.7' is vague and not specific enough, what build version exactly - you can check this in the UI or via SSH to the hosts using 'vmware -vl', please check this on non-problem host + reinstalled host.

The object health state is starkly different in screenshots 'vSAN Health 2.png' and 'vSAN object health.png' - were these checked from different nodes? If yes then everything looks accessible from the latter and you *should* be able to register and restart the VMs on that node.

Please share the output of this command from non-problem host + reinstalled host:
# esxcli vsan debug health summary get
7. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
NewComer213
Posted Jan 11, 2024 07:31 PM

Reply Reply Privately
The screenshots were from a working host and the non-working host. Below are the builds and debug outputs requested. Before adding the reinstalled host vMotion was working. It is no longer working and machines fail to move between the two working hosts.
non-working host
:~] vmware -vl
VMware ESXi 6.7.0 build-20497097
VMware ESXi 6.7.0 Update 3
working host
~] vmware -vl
VMware ESXi 6.7.0 build-10302608
VMware ESXi 6.7.0 Update 1

non-working host
~] esxcli vsan debug object health summary get
Health Status Number Of Objects
--------------------------------------------------------- -----------------
nonavailability-related-incompliance 0
reduced-availability-with-active-rebuild 0
nonavailabilityrelatedincompliancewithpolicypendingfailed 0
inaccessible 259
nonavailability-related-reconfig 0
reduced-availability-with-no-rebuild 0
reduced-availability-with-no-rebuild-delay-timer 0
reducedavailabilitywithpolicypending 0
healthy 0
nonavailabilityrelatedincompliancewithpausedrebuild 0
reducedavailabilitywithpausedrebuild 0
reducedavailabilitywithpolicypendingfailed 0
nonavailabilityrelatedincompliancewithpolicypending 0
data-move 0

working host
~] esxcli vsan debug object health summary get
Health Status Number Of Objects
------------------------------------------------ -----------------
reduced-availability-with-no-rebuild 247
nonavailability-related-reconfig 0
inaccessible 0
data-move 0
nonavailability-related-incompliance 0
healthy 0
reduced-availability-with-active-rebuild 0
reduced-availability-with-no-rebuild-delay-timer 0
8. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
TheBobkin
Posted Jan 11, 2024 09:42 PM

Reply Reply Privately
So, you have mix of nodes on 6.7 U1 (a build from 2018(!!!), Why?!) and 6.7 U3 - it is possible you ended up with different CMMDS versions even without updating on-disk format versions (e.g. even attempting this can cause that) - this would result in higher version node being unable to rejoin the cluster even though network configuration is fine (https://kb.vmware.com/s/article/76841).

This can be confirmed by comparing the following on working + non-working host:

# /usr/lib/vmware/vsan/bin/clom-tool stats | grep -i version

If this returns different values on the nodes then you have the root cause, however it may not be possible just to reinstall the ESXi 6.7 U3 node as 6.7 U1 as this can be a 'one-way-ticket', in that case you would have to update all the other nodes to 6.7 U3 (and you really should be doing that anyway) which will temporarily mean making data availability worse before making it better (e.g. all VMs should be powered-off as they the data is already in a reduced-availability state and updating the 6.7 U1 nodes will mean these objects temporarily become inaccessible).
9. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
NewComer213
Posted Jan 23, 2024 10:09 PM

Reply Reply Privately
I appreciate the help
Here are the outputs
Newly installed version
"minNodeMajorVersion": 10,
The other two hosts that were working are on "minNodeMajorVersion": 7.
The issue I'm having with upgrading is that when I turn off a machine, I do not see the machine files in the Datastores. I'm able to export to OVF and will attempt to migrate to the newly installed ESXi and see if that works. I would prefer rolling back the newly installed with the higher version to match the others but per the link provided, it appears that it may just cause more issues.
10. RE: Cluster Issues due to failed host / Hosts no longer communicate

Recommend
TheBobkin
Posted Jan 27, 2024 10:54 PM

Reply Reply Privately
, Happy to be able to help.
That the nodes have different versions of CMMDS in-use confirms the reason the reinstalled node is isolated from the cluster (despite vmkping communication and network configuration (assumedly) being fine etc.).

Unless something has changed here (or you have some other problem on one of the other nodes which you haven't noted yet), all data should be accessible from just the 2 lower build version nodes.
You can confirm this by powering off the higher build version node and confirming all VMs can be registered and powered-on, if this is the case then you should just reinstall the higher build version node with ESXi 6.7 U1 then (if the disks on it did get updated to ODF version 10 are NOT on ODF v7) recreate the Disk-Group on it then repair all data back to a redundant state.

Then please update all nodes to a modern version of ESXi.

vSAN1

Cluster Issues due to failed host / Hosts no longer communicate

NewComer213Jan 06, 2024 02:36 AM

zchris06Jan 06, 2024 01:03 PM

NewComer213Jan 06, 2024 11:33 PM

TheBobkinJan 07, 2024 10:10 PM

NewComer213Jan 09, 2024 02:30 AM

TheBobkinJan 10, 2024 08:38 PM

NewComer213Jan 11, 2024 07:31 PM

TheBobkinJan 11, 2024 09:42 PM

NewComer213Jan 23, 2024 10:09 PM

TheBobkinJan 27, 2024 10:54 PM

1. Cluster Issues due to failed host / Hosts no longer communicate

2. RE: Cluster Issues due to failed host / Hosts no longer communicate

3. RE: Cluster Issues due to failed host / Hosts no longer communicate

4. RE: Cluster Issues due to failed host / Hosts no longer communicate

5. RE: Cluster Issues due to failed host / Hosts no longer communicate

6. RE: Cluster Issues due to failed host / Hosts no longer communicate

7. RE: Cluster Issues due to failed host / Hosts no longer communicate

8. RE: Cluster Issues due to failed host / Hosts no longer communicate

9. RE: Cluster Issues due to failed host / Hosts no longer communicate

10. RE: Cluster Issues due to failed host / Hosts no longer communicate