Hi,
I used the new vSAN cluster shutdown wizard yesterday for the first time when I had to shut a lab down for a power outage. The cluster consists of three Dell R730xd nodes, with vCenter residing on a different non-vSAN node. The shutdown was clean, and the hosts were shut down properly before the power was lost. On bootup, all the vsan VMs are listed as Inaccessible, and aren't visible if browsed to in the datastore (via GUI or command line).
The button to restart the cluster was not present, so I followed the instructions to manually restart the cluster via the command line. The recover script however times out:
[root@esx01:/tmp] python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
Begin to recover the cluster...
Time among connected hosts are synchronized.
Scheduled vSAN cluster restore task.
Waiting for the scheduled task...(18s left)
Checking network status...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Timeout, please try again later
I have been digging since then with no success. The cluster looks to have reformed properly:
[root@esx01:/tmp] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-10-26T18:22:51Z
Local Node UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
Sub-Cluster Backup UUID: 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0
Sub-Cluster UUID: 528157e4-4935-2809-ab88-5d161aec89a5
Sub-Cluster Membership Entry Revision: 4
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c, 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0, 614b991e-f0f6-8762-c918-801844e56f42
Sub-Cluster Member HostNames: esx01, esx02, esx03
Sub-Cluster Membership UUID: fd197861-36c6-b896-868a-a0369f59e56c
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: a01749d0-4f70-4911-aeb0-919cfdc176bb 31 2021-10-26T18:12:08.28
Mode: REGULAR
esx01 being the master, esx02 showing as backup, and esx03 as agent. The unicast list looks correct, and the vsan vmk's were re-tagged properly.
In RVC, this is typical of what I see for each VM:
/localhost/Lab/vms/wan-test> vsan.vm_object_info .
VM wan-test:
Disk backing:
[vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
[vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmx
DOM Object: 65d06061-65f9-6456-0383-a0369f59e56c (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
RAID_1
Component: 65d06061-6ce4-8057-b167-a0369f59e56c (state: ACTIVE (5), host: esx01, capacity: naa.5002538e30820eb4, cache: naa.55cd2e404b795ca0,
votes: 1, usage: 0.2 GB, proxy component: false)
Component: 65d06061-7415-8457-02b2-a0369f59e56c (state: ABSENT (6), csn: STALE (109!=160), host: esx03, capacity: naa.5002538ec110af17, cache: naa.55cd2e404b796059,
dataToSync: 0.21 GB, votes: 1, usage: 0.2 GB, proxy component: false)
Witness: 65d06061-92d5-8757-6ca8-a0369f59e56c (state: ACTIVE (5), host: esx02, capacity: naa.5002538e4066db91, cache: naa.55cd2e404b78c07a,
votes: 1, usage: 0.0 GB, proxy component: false)
[vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
RAID_1
Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
votes: 1, usage: 1.6 GB, proxy component: false)
Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
votes: 1, usage: 0.0 GB, proxy component: false)
The two things that stand out for me is the no policy found in CMMDS, and that in every VM's case the component that resides in esx03 is the absent component - whether it is a component or a witness.
I have tried various things to try to manipulate the storage policy and reduce the ftt to 0, but none of them take as the new policy can't be defined due to the invalid state of the VM.
Any help would be greatly appreciated. I'd love to open a support ticket but we don't have support on this small lab environment, and I'd rather not have to rebuild from backups if I can avoid it. I'm also trying to learn why this happened in the first place and if the shutdown cluster functionality can be relied upon or not.