vSAN1

 View Only
Expand all | Collapse all

vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

  • 1.  vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Oct 26, 2021 06:56 PM

    Hi,

    I used the new vSAN cluster shutdown wizard yesterday for the first time when I had to shut a lab down for a power outage. The cluster consists of three Dell R730xd nodes, with vCenter residing on a different non-vSAN node. The shutdown was clean, and the hosts were shut down properly before the power was lost. On bootup, all the vsan VMs are listed as Inaccessible, and aren't visible if browsed to in the datastore (via GUI or command line).

    The button to restart the cluster was not present, so I followed the instructions to manually restart the cluster via the command line. The recover script however times out:

    [root@esx01:/tmp] python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
    Begin to recover the cluster...
    Time among connected hosts are synchronized.
    Scheduled vSAN cluster restore task.
    Waiting for the scheduled task...(18s left)
    Checking network status...
    Recovery is not ready, retry after 10s...
    Recovery is not ready, retry after 10s...
    Recovery is not ready, retry after 10s...
    Timeout, please try again later

     

    I have been digging since then with no success. The cluster looks to have reformed properly:

    [root@esx01:/tmp] esxcli vsan cluster get
    Cluster Information
       Enabled: true
       Current Local Time: 2021-10-26T18:22:51Z
       Local Node UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
       Local Node Type: NORMAL
       Local Node State: MASTER
       Local Node Health State: HEALTHY
       Sub-Cluster Master UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
       Sub-Cluster Backup UUID: 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0
       Sub-Cluster UUID: 528157e4-4935-2809-ab88-5d161aec89a5
       Sub-Cluster Membership Entry Revision: 4
       Sub-Cluster Member Count: 3
       Sub-Cluster Member UUIDs: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c, 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0, 614b991e-f0f6-8762-c918-801844e56f42
       Sub-Cluster Member HostNames: esx01, esx02, esx03
       Sub-Cluster Membership UUID: fd197861-36c6-b896-868a-a0369f59e56c
       Unicast Mode Enabled: true
       Maintenance Mode State: OFF
       Config Generation: a01749d0-4f70-4911-aeb0-919cfdc176bb 31 2021-10-26T18:12:08.28
       Mode: REGULAR

     

    esx01 being the master, esx02 showing as backup, and esx03 as agent. The unicast list looks correct, and the vsan vmk's were re-tagged properly.

    In RVC, this is typical of what I see for each VM:

    /localhost/Lab/vms/wan-test> vsan.vm_object_info .
    VM wan-test:
      Disk backing:
        [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
    
      [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmx
        DOM Object: 65d06061-65f9-6456-0383-a0369f59e56c (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
          RAID_1
            Component: 65d06061-6ce4-8057-b167-a0369f59e56c (state: ACTIVE (5), host: esx01, capacity: naa.5002538e30820eb4, cache: naa.55cd2e404b795ca0,
                                                             votes: 1, usage: 0.2 GB, proxy component: false)
            Component: 65d06061-7415-8457-02b2-a0369f59e56c (state: ABSENT (6), csn: STALE (109!=160), host: esx03, capacity: naa.5002538ec110af17, cache: naa.55cd2e404b796059,
                                                             dataToSync: 0.21 GB, votes: 1, usage: 0.2 GB, proxy component: false)
          Witness: 65d06061-92d5-8757-6ca8-a0369f59e56c (state: ACTIVE (5), host: esx02, capacity: naa.5002538e4066db91, cache: naa.55cd2e404b78c07a,
                                                         votes: 1, usage: 0.0 GB, proxy component: false)
    
      [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
        DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
          RAID_1
            Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
                                                             dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
            Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
                                                             votes: 1, usage: 1.6 GB, proxy component: false)
          Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
                                                         votes: 1, usage: 0.0 GB, proxy component: false)

     

    The two things that stand out for me is the no policy found in CMMDS, and that in every VM's case the component that resides in esx03 is the absent component - whether it is a component or a witness.

    I have tried various things to try to manipulate the storage policy and reduce the ftt to 0, but none of them take as the new policy can't be defined due to the invalid state of the VM.

    Any help would be greatly appreciated. I'd love to open a support ticket but we don't have support on this small lab environment, and I'd rather not have to rebuild from backups if I can avoid it. I'm also trying to learn why this happened in the first place and if the shutdown cluster functionality can be relied upon or not.



  • 2.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Oct 26, 2021 10:16 PM

    Thanks to a suggestion on reddit, I am making some progress.

    vsish -e set /vmkModules/vsan/dom/ownerAbdicate <uuid>

    Is getting the components back from absent to active. Still working out my next step though.



  • 3.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Broadcom Employee
    Posted Oct 28, 2021 12:21 PM

    Please open a support request (SR) if you are able to. 1. We would like to help troubleshoot and 2. It would great to gather logs and such to determine what went wrong and fix the issue in the code in an upcoming release.



  • 4.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Oct 28, 2021 01:15 PM

    I very much wish I could, but despite this being on enterprise hardware it's a VMUG homelab.

    Beyond getting the components all into an active state by abdicating ownership (which does not survive a node reboot or cluster reboot) I appear to be at a dead end, the cluster looks to be a total loss. All the objects and VMs stored on vSAN are inaccessible, and newly created VMs also go inaccessible a short time after creation.

    More than anything I'd also like to know what happened here as my main concern is if it were to happen again, in this environment or more critical environments.



  • 5.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster
    Best Answer

    Posted Nov 03, 2021 03:54 PM

    Here's the post-mortem.. I reinstalled the cluster with 7.0u3a. The same behavior started again almost immediately after configuring vSAN - objects/VMs going inaccessible, I/O errors listing the datastore, etc. I even zeroed out the vsan disks ahead of time to make sure no old metadata was picked up by the new install.

    Reinstall again with ESXi 7.0u2d (but the same vCenter 7.0u3a). Works great now, no problems at all. The non-wizard cluster shutdown and startup also works great.

    7.0u3/3a has some serious issues with vSAN, at least on my R730xds. I wish I could open a support case to help diagnose the problem, but it's a homelab.



  • 6.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 23, 2021 05:05 AM

    Hello, I have the same problem. Final installation 7.0U2D running?  

    Is the VM machine recovered?



  • 7.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 23, 2021 02:13 PM

     , I would advise being wary of making assertions such as "I have the same problem" when it is very unclear what OPs problem was here.


    From looking at the data, OP clearly had issues on node esx03 for example you can see that it was missing hundreds of data-updates to components e.g. 'STALE (67!=202)' means the current data is on revision 202 and this component on node esx03 is on data revision 67 (e.g. async and way behind):
    [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
    DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
    RAID_1
    Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
    dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
    Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
    votes: 1, usage: 1.6 GB, proxy component: false)
    Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
    votes: 1, usage: 0.0 GB, proxy component: false)



    , please provide more information e.g. what did you do, what happened, what is the data-state and cluster health?



  • 8.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 24, 2021 08:03 AM

    It is not clear what the cause of the problem is, but the same problem has occurred. All virtual machines appear orphaned and browse storage is not visible. Operation process: shut down all virtual machines by pressing the shut down cluster button.



  • 9.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 24, 2021 03:49 PM

     , Was it after using cluster shutdown feature and then attempting to revert this like  indicated occurred in their case?

     

    Asking as the cluster shutdown (either using vSphere-function or older ESXi reboot_helper script) is expected to make all VM data in the cluster inaccessible as it isolates all the nodes from one another by untagging vsan-network on the vmk configured for this.

     

    Have you attempted the recover/restore part of this workflow and/or if not possible (e.g.  mentioned there was no button visible for reverting from the cluster shutdown option) checked are all the vsan-vmk (and witness-vmk if stretched cluster) untagged for these traffic types and tried re-tagging them?

     

    Is this a production cluster with S&S or a homelab? If the former and what I suggested above doesn't help then please open a P1 Support Request so that my colleagues can check this properly (feel free to PM me the SR number for awareness but I am on PTO most of the next few weeks so unlikely to be actively looking into it myself).



  • 10.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 23, 2021 02:54 PM

    Hi , my setup has been stable (and through several cluster shutdowns without issue) since I reloaded and rebuilt on the same hardware with 7.0u2d. I had to restore all data from backups.

    If you do have the same problem I wish you the best of luck. For me the obvious variable was 7.0u3 and I will not be upgrading to it in the future.



  • 11.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 23, 2021 02:59 PM

    , Bit of an oddly specific question but do you by any chance have these nodes 'daisy-chained' e.g. directly connected and not using a switch?



  • 12.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Dec 23, 2021 03:01 PM

     No, they are attached to 10gb switches.



  • 13.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Jan 18, 2022 07:21 AM

    We have had the same strange behaviour, absolutely identical.. Opened SR with VMware and they came up with solution, which is change status of the cluster to ClusterpoweredOff. And that's it. Gui button to poweron VSAN appeared and we were able to start it up without a problem. Just like that. Here are the steps

     if you call log into the vsan mob you can run through the following
    -Change the status of the VsanClusterPowerSystem.
     #vsanmob -> VsanClusterPowerSystem -> UpdateClusterPowerStatus -> apply from "clusterPoweredOn" to "clusterPoweredOff"

     

    Hope it helps smbody.

     



  • 14.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Jan 18, 2022 08:28 PM

     , If this was in EMEA then I may have been the engineer-behind-the-engineer determining this solution - I also found an alternative solution outside of vCenter that just requires 2 esxcfg-advcfg settings to be reverted - I have written a KB article covering both the vCenter and ESXi workarounds but am awaiting approval from engineering to make it  publicly-available (and found that a colleague did the same yesterday!), I will link the KB here once available.



  • 15.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Jan 25, 2022 11:58 AM

    yes   it was EMEA and thank you for your effort. This kind of problem is quite a real showstopper (reeeeaally glad to have test enviro and not to be under pressure of stucked production when dealing with such dead end problem)  and speaking frankly we were left breathless no way from this. And I was aware of this article here (and couple reddit based with similar story) and your suggestion to open SR and voila, all turned to the solution not only for us but all. Thank you&your colleagues for helping us all really appreciate your work!



  • 16.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Jan 25, 2022 12:56 PM

     Happy to just understand how this issue occurs to be honest - this was annoying me in back of head since  mentioned it and it irked me as if cluster is formed then data should be okay but this is basically not the case if updates are still paused.

     

    KB documenting this issue and various ways of resolving it is now public and accessible here:

    https://kb.vmware.com/s/article/87350



  • 17.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Jan 25, 2022 04:00 PM

    Thank you very much for your diligence on this - it gives me some confidence to move forward on the newer updates of vCenter in the future.



  • 18.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Mar 15, 2022 12:14 AM

    I've been experiencing the exact same issue running on my homelab on VMWare Workstation 16.2.3. Here's the situation: I've set up a vSphere/vCenter 7.03 update 3 (c I think) environment on VMworkstation 16.2.3 using vSAN, HA, and DRS with a 3 node cluster. Each esxi host has 4cpus and 24gigs of memory. Everything seems to be working fine...vMotion works, running vms off vSAN works fine, etc. My problem is that each time I shutdown the cluster(using the 'Shutdown Cluster' function) and bring the cluster back up, the vSAN is hosed. It either has no capacity or only the capacity of the disks from one host and the contents are gone. The vCLSs(inaccessible and can't remove them) won't start because the datastore they were on (vSAN) is now empty. I've recreated my environment at least about 10 times now and the same results. My host system is a Ryzen 9 5900x with 128gigs of memory so I'm not running short on resources. I've tried using the Quickstart and manually creating the vSAN and get the same results. I have 2 virtual nics in each esxi host and have vMotion and vSAN running on their own DSwitch on the 2nd nic(the 1st nic is acutally connected to my physical nic in my host PC). I'm thinking of adding another esxi host and make a 4 node cluster to see if that helps. Any ideas would be greatly appreciated. Also, I've run Skyline Health Diagnostics and it didn't really find anything except nics losing connectivity once.



  • 19.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Mar 15, 2022 09:30 AM

     , when it is in that state, are you using the restore function (reverse of shutdown)?

     

    Can you check if this returns 1 (enabled) on any host:

    # esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs

     

    If so then this is the same issue I linked kb for above.

     



  • 20.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Mar 15, 2022 05:30 PM

    The cluster 'restart' command only appeared for me once but I can't remember the specific circumstances. I'm in the process of recreating my cluster(again) and I'll run that command and post back what I get. At this point, I was thinking about rolling back my esxi hosts to 7.0u2? as mentioned earlier in this thread but I would rather be running the current version if I can get it to work properly.



  • 21.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Mar 15, 2022 06:59 PM

    Replied at the wrong place. Hopefully move reply to the right place.



  • 22.  RE: vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

    Posted Mar 15, 2022 07:04 PM

    Trying to reply to  at 03-15-2022 02:29 AM OK, I've rebuilt my cluster as before. Before shutting down cluster, value of command is 0 (which I guess is expected at this point). Next executed cluster shutdown command. At this point, the 'Restart Cluster' command is present. I powered up all 3 hosts and the 'Restart Cluster' is still present. Next, ran the command you gave me and the value is '1' at this point. After that, ran the 'Restart Cluster' command now that the esxi hosts were back up. After the 'restart cluster' command finished, the vSAN storage was only showing the capacity equal to on of the esxi hosts and it was empty. Running the command you gave returned '1' again. So, I'm assuming that my solution would be for scenario 3 in the knowledge base article you referenced. I'm very new to this so I'll try to get thru step 2 of the solution for scenario 2. Thanks so much for your help.