vSAN1

View Only

Back to discussions

Expand all | Collapse all

vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Jump to Best Answer

1. vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
davemuench
Posted Oct 26, 2021 06:56 PM

Reply Reply Privately
Hi,
I used the new vSAN cluster shutdown wizard yesterday for the first time when I had to shut a lab down for a power outage. The cluster consists of three Dell R730xd nodes, with vCenter residing on a different non-vSAN node. The shutdown was clean, and the hosts were shut down properly before the power was lost. On bootup, all the vsan VMs are listed as Inaccessible, and aren't visible if browsed to in the datastore (via GUI or command line).
The button to restart the cluster was not present, so I followed the instructions to manually restart the cluster via the command line. The recover script however times out:
[root@esx01:/tmp] python /usr/lib/vmware/vsan/bin/reboot_helper.py recover Begin to recover the cluster... Time among connected hosts are synchronized. Scheduled vSAN cluster restore task. Waiting for the scheduled task...(18s left) Checking network status... Recovery is not ready, retry after 10s... Recovery is not ready, retry after 10s... Recovery is not ready, retry after 10s... Timeout, please try again later

I have been digging since then with no success. The cluster looks to have reformed properly:
[root@esx01:/tmp] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2021-10-26T18:22:51Z Local Node UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c Sub-Cluster Backup UUID: 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0 Sub-Cluster UUID: 528157e4-4935-2809-ab88-5d161aec89a5 Sub-Cluster Membership Entry Revision: 4 Sub-Cluster Member Count: 3 Sub-Cluster Member UUIDs: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c, 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0, 614b991e-f0f6-8762-c918-801844e56f42 Sub-Cluster Member HostNames: esx01, esx02, esx03 Sub-Cluster Membership UUID: fd197861-36c6-b896-868a-a0369f59e56c Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: a01749d0-4f70-4911-aeb0-919cfdc176bb 31 2021-10-26T18:12:08.28 Mode: REGULAR

esx01 being the master, esx02 showing as backup, and esx03 as agent. The unicast list looks correct, and the vsan vmk's were re-tagged properly.
In RVC, this is typical of what I see for each VM:
/localhost/Lab/vms/wan-test> vsan.vm_object_info . VM wan-test: Disk backing: [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmx DOM Object: 65d06061-65f9-6456-0383-a0369f59e56c (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS) RAID_1 Component: 65d06061-6ce4-8057-b167-a0369f59e56c (state: ACTIVE (5), host: esx01, capacity: naa.5002538e30820eb4, cache: naa.55cd2e404b795ca0, votes: 1, usage: 0.2 GB, proxy component: false) Component: 65d06061-7415-8457-02b2-a0369f59e56c (state: ABSENT (6), csn: STALE (109!=160), host: esx03, capacity: naa.5002538ec110af17, cache: naa.55cd2e404b796059, dataToSync: 0.21 GB, votes: 1, usage: 0.2 GB, proxy component: false) Witness: 65d06061-92d5-8757-6ca8-a0369f59e56c (state: ACTIVE (5), host: esx02, capacity: naa.5002538e4066db91, cache: naa.55cd2e404b78c07a, votes: 1, usage: 0.0 GB, proxy component: false) [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS) RAID_1 Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059, dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false) Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a, votes: 1, usage: 1.6 GB, proxy component: false) Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0, votes: 1, usage: 0.0 GB, proxy component: false)

The two things that stand out for me is the no policy found in CMMDS, and that in every VM's case the component that resides in esx03 is the absent component - whether it is a component or a witness.
I have tried various things to try to manipulate the storage policy and reduce the ftt to 0, but none of them take as the new policy can't be defined due to the invalid state of the VM.
Any help would be greatly appreciated. I'd love to open a support ticket but we don't have support on this small lab environment, and I'd rather not have to rebuild from backups if I can avoid it. I'm also trying to learn why this happened in the first place and if the shutdown cluster functionality can be relied upon or not.
2. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
davemuench
Posted Oct 26, 2021 10:16 PM

Reply Reply Privately
Thanks to a suggestion on reddit, I am making some progress.
vsish -e set /vmkModules/vsan/dom/ownerAbdicate <uuid>
Is getting the components back from absent to active. Still working out my next step though.
3. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
Broadcom Employee

Jeff Hunter
Posted Oct 28, 2021 12:21 PM

Reply Reply Privately
Please open a support request (SR) if you are able to. 1. We would like to help troubleshoot and 2. It would great to gather logs and such to determine what went wrong and fix the issue in the code in an upcoming release.
4. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
davemuench
Posted Oct 28, 2021 01:15 PM

Reply Reply Privately
I very much wish I could, but despite this being on enterprise hardware it's a VMUG homelab.
Beyond getting the components all into an active state by abdicating ownership (which does not survive a node reboot or cluster reboot) I appear to be at a dead end, the cluster looks to be a total loss. All the objects and VMs stored on vSAN are inaccessible, and newly created VMs also go inaccessible a short time after creation.
More than anything I'd also like to know what happened here as my main concern is if it were to happen again, in this environment or more critical environments.
5. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster
Best Answer

Recommend
davemuench
Posted Nov 03, 2021 03:54 PM

Reply Reply Privately
Here's the post-mortem.. I reinstalled the cluster with 7.0u3a. The same behavior started again almost immediately after configuring vSAN - objects/VMs going inaccessible, I/O errors listing the datastore, etc. I even zeroed out the vsan disks ahead of time to make sure no old metadata was picked up by the new install.
Reinstall again with ESXi 7.0u2d (but the same vCenter 7.0u3a). Works great now, no problems at all. The non-wizard cluster shutdown and startup also works great.
7.0u3/3a has some serious issues with vSAN, at least on my R730xds. I wish I could open a support case to help diagnose the problem, but it's a homelab.
6. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
llb1
Posted Dec 23, 2021 05:05 AM

Reply Reply Privately
Hello, I have the same problem. Final installation 7.0U2D running?
Is the VM machine recovered？
7. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
TheBobkin
Posted Dec 23, 2021 02:13 PM

Reply Reply Privately
, I would advise being wary of making assertions such as "I have the same problem" when it is very unclear what OPs problem was here.

From looking at the data, OP clearly had issues on node esx03 for example you can see that it was missing hundreds of data-updates to components e.g. 'STALE (67!=202)' means the current data is on revision 202 and this component on node esx03 is on data revision 67 (e.g. async and way behind):
[vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
RAID_1
Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
votes: 1, usage: 1.6 GB, proxy component: false)
Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
votes: 1, usage: 0.0 GB, proxy component: false)

, please provide more information e.g. what did you do, what happened, what is the data-state and cluster health?
8. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
llb1
Posted Dec 24, 2021 08:03 AM

Reply Reply Privately
It is not clear what the cause of the problem is, but the same problem has occurred. All virtual machines appear orphaned and browse storage is not visible. Operation process: shut down all virtual machines by pressing the shut down cluster button.
9. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
TheBobkin
Posted Dec 24, 2021 03:49 PM

Reply Reply Privately
, Was it after using cluster shutdown feature and then attempting to revert this like indicated occurred in their case?

Asking as the cluster shutdown (either using vSphere-function or older ESXi reboot_helper script) is expected to make all VM data in the cluster inaccessible as it isolates all the nodes from one another by untagging vsan-network on the vmk configured for this.

Have you attempted the recover/restore part of this workflow and/or if not possible (e.g. mentioned there was no button visible for reverting from the cluster shutdown option) checked are all the vsan-vmk (and witness-vmk if stretched cluster) untagged for these traffic types and tried re-tagging them?

Is this a production cluster with S&S or a homelab? If the former and what I suggested above doesn't help then please open a P1 Support Request so that my colleagues can check this properly (feel free to PM me the SR number for awareness but I am on PTO most of the next few weeks so unlikely to be actively looking into it myself).
10. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
davemuench
Posted Dec 23, 2021 02:54 PM

Reply Reply Privately
Hi , my setup has been stable (and through several cluster shutdowns without issue) since I reloaded and rebuilt on the same hardware with 7.0u2d. I had to restore all data from backups.
If you do have the same problem I wish you the best of luck. For me the obvious variable was 7.0u3 and I will not be upgrading to it in the future.
11. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
TheBobkin
Posted Dec 23, 2021 02:59 PM

Reply Reply Privately
, Bit of an oddly specific question but do you by any chance have these nodes 'daisy-chained' e.g. directly connected and not using a switch?
12. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
davemuench
Posted Dec 23, 2021 03:01 PM

Reply Reply Privately
No, they are attached to 10gb switches.
13. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
heky777
Posted Jan 18, 2022 07:21 AM

Reply Reply Privately
We have had the same strange behaviour, absolutely identical.. Opened SR with VMware and they came up with solution, which is change status of the cluster to ClusterpoweredOff. And that's it. Gui button to poweron VSAN appeared and we were able to start it up without a problem. Just like that. Here are the steps
if you call log into the vsan mob you can run through the following
-Change the status of the VsanClusterPowerSystem.
#vsanmob -> VsanClusterPowerSystem -> UpdateClusterPowerStatus -> apply from "clusterPoweredOn" to "clusterPoweredOff"

Hope it helps smbody.
14. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
TheBobkin
Posted Jan 18, 2022 08:28 PM

Reply Reply Privately
, If this was in EMEA then I may have been the engineer-behind-the-engineer determining this solution - I also found an alternative solution outside of vCenter that just requires 2 esxcfg-advcfg settings to be reverted - I have written a KB article covering both the vCenter and ESXi workarounds but am awaiting approval from engineering to make it publicly-available (and found that a colleague did the same yesterday!), I will link the KB here once available.
15. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
heky777
Posted Jan 25, 2022 11:58 AM

Reply Reply Privately
yes it was EMEA and thank you for your effort. This kind of problem is quite a real showstopper (reeeeaally glad to have test enviro and not to be under pressure of stucked production when dealing with such dead end problem) and speaking frankly we were left breathless no way from this. And I was aware of this article here (and couple reddit based with similar story) and your suggestion to open SR and voila, all turned to the solution not only for us but all. Thank you&your colleagues for helping us all really appreciate your work!
16. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
TheBobkin
Posted Jan 25, 2022 12:56 PM

Reply Reply Privately
Happy to just understand how this issue occurs to be honest - this was annoying me in back of head since mentioned it and it irked me as if cluster is formed then data should be okay but this is basically not the case if updates are still paused.

KB documenting this issue and various ways of resolving it is now public and accessible here:
https://kb.vmware.com/s/article/87350
17. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
davemuench
Posted Jan 25, 2022 04:00 PM

Reply Reply Privately
Thank you very much for your diligence on this - it gives me some confidence to move forward on the newer updates of vCenter in the future.
18. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
senwebtek
Posted Mar 15, 2022 12:14 AM

Reply Reply Privately
I've been experiencing the exact same issue running on my homelab on VMWare Workstation 16.2.3. Here's the situation: I've set up a vSphere/vCenter 7.03 update 3 (c I think) environment on VMworkstation 16.2.3 using vSAN, HA, and DRS with a 3 node cluster. Each esxi host has 4cpus and 24gigs of memory. Everything seems to be working fine...vMotion works, running vms off vSAN works fine, etc. My problem is that each time I shutdown the cluster(using the 'Shutdown Cluster' function) and bring the cluster back up, the vSAN is hosed. It either has no capacity or only the capacity of the disks from one host and the contents are gone. The vCLSs(inaccessible and can't remove them) won't start because the datastore they were on (vSAN) is now empty. I've recreated my environment at least about 10 times now and the same results. My host system is a Ryzen 9 5900x with 128gigs of memory so I'm not running short on resources. I've tried using the Quickstart and manually creating the vSAN and get the same results. I have 2 virtual nics in each esxi host and have vMotion and vSAN running on their own DSwitch on the 2nd nic(the 1st nic is acutally connected to my physical nic in my host PC). I'm thinking of adding another esxi host and make a 4 node cluster to see if that helps. Any ideas would be greatly appreciated. Also, I've run Skyline Health Diagnostics and it didn't really find anything except nics losing connectivity once.
19. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
TheBobkin
Posted Mar 15, 2022 09:30 AM

Reply Reply Privately
, when it is in that state, are you using the restore function (reverse of shutdown)?

Can you check if this returns 1 (enabled) on any host:
# esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs

If so then this is the same issue I linked kb for above.
20. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
senwebtek
Posted Mar 15, 2022 05:30 PM

Reply Reply Privately
The cluster 'restart' command only appeared for me once but I can't remember the specific circumstances. I'm in the process of recreating my cluster(again) and I'll run that command and post back what I get. At this point, I was thinking about rolling back my esxi hosts to 7.0u2? as mentioned earlier in this thread but I would rather be running the current version if I can get it to work properly.
21. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
senwebtek
Posted Mar 15, 2022 06:59 PM

Reply Reply Privately
Replied at the wrong place. Hopefully move reply to the right place.
22. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Recommend
senwebtek
Posted Mar 15, 2022 07:04 PM

Reply Reply Privately
Trying to reply to at 03-15-2022 02:29 AM OK, I've rebuilt my cluster as before. Before shutting down cluster, value of command is 0 (which I guess is expected at this point). Next executed cluster shutdown command. At this point, the 'Restart Cluster' command is present. I powered up all 3 hosts and the 'Restart Cluster' is still present. Next, ran the command you gave me and the value is '1' at this point. After that, ran the 'Restart Cluster' command now that the esxi hosts were back up. After the 'restart cluster' command finished, the vSAN storage was only showing the capacity equal to on of the esxi hosts and it was empty. Running the command you gave returned '1' again. So, I'm assuming that my solution would be for scenario 3 in the knowledge base article you referenced. I'm very new to this so I'll try to get thru step 2 of the solution for scenario 2. Thanks so much for your help.

vSAN1

vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

davemuenchOct 26, 2021 06:56 PM

davemuenchOct 26, 2021 10:16 PM

Jeff HunterOct 28, 2021 12:21 PM

davemuenchOct 28, 2021 01:15 PM

davemuenchNov 03, 2021 03:54 PMBest Answer

llb1Dec 23, 2021 05:05 AM

TheBobkinDec 23, 2021 02:13 PM

llb1Dec 24, 2021 08:03 AM

TheBobkinDec 24, 2021 03:49 PM

davemuenchDec 23, 2021 02:54 PM

TheBobkinDec 23, 2021 02:59 PM

davemuenchDec 23, 2021 03:01 PM

heky777Jan 18, 2022 07:21 AM

TheBobkinJan 18, 2022 08:28 PM

heky777Jan 25, 2022 11:58 AM

TheBobkinJan 25, 2022 12:56 PM

davemuenchJan 25, 2022 04:00 PM

senwebtekMar 15, 2022 12:14 AM

TheBobkinMar 15, 2022 09:30 AM

senwebtekMar 15, 2022 05:30 PM

senwebtekMar 15, 2022 06:59 PM

senwebtekMar 15, 2022 07:04 PM

1. vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

2. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

3. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

4. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

5. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster Best Answer

6. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

7. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

8. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

9. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

10. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

11. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

12. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

13. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

14. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

15. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

16. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

17. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

18. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

19. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

20. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

21. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

22. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

5. RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster
Best Answer