vSAN1

 View Only

 Blew up my VSAN again, need help

Jump to  Best Answer
dbutch1976's profile image
dbutch1976 posted Jul 30, 2024 01:58 PM

Luckily this is a lab or I would have been fired by now, lol.

So I having an issue where I couldn't perform pre-checks for patching or enable/disable HA for two hosts in a four node cluster. While troubleshooting I found an article which indicated this could happen if the VSAN shutdown wizard had been started but had not completed successfully.

I have never consciously used this wizard before, but with all the issues I've had with my VSAN I believed that this could possibly be related to what I was seeing. I decided to try using the wizard to shutdown then restart my cluster and see if that resolved my issues with the HA agent. The shutdown worked flawlessly, the it powered everything back on and things looked good, however when I attempted to restart the cluster the two nodes failed to install the HA agent which in turn caused the wizard to be unable to complete bringing the VSAN but up fully, although everything was working except I had an HA error.

At this point I made a mistake. I believe I must have turned VSAN off entirely. Since this is a lab I decided I was just accept the data loss and re-form the VSAN from scratch, however I can't do this because it detects that there is already data on the disks and says the disks are ineligible.  

Perhaps I can still recover the VSAN and my disks. I found this article on manually restarting the VSAN cluster, is this the process I would need to use to recover my VSAN and get my data back?

Manually Shut Down and Restart the vSAN Cluster (vmware.com)

TheBobkin's profile image
TheBobkin  Best Answer

@Mohammed Viquar Ahmed - Sorry but you are misinformed, that is completely untrue, please don't spread such misinformation.

When you disable vSAN on a cluster all it does is: leave cluster, disable the vSAN modules and remove the unicastagent list on the nodes - it DOES NOT remove partitions from disks.

@dbutch1976, it should just be a case of re-enabling vSAN (unless perhaps if you scrapped the cluster/reinstalled ESXi) - if it cannot detect the vSAN cluster UUID and reform the 'old' cluster, when you enable vSAN you may end up with a 'new' cluster, this has the negative impact of new vsanDatastore (and issues), but this is quite easily fixed - you just need the original cluster UUID (which can be inferred from the 'old' vsanDatastore path also), join the nodes to that cluster UUID from command line (e.g. esxcli vsan cluster join -u <UUID>) then when you enable vSAN on the cluster it should detect that UUID and use it.

Mohammed Viquar Ahmed's profile image
Broadcom Employee Mohammed Viquar Ahmed

Correction 

Disabling vSAN will stop the distributed storage functionality of vSAN. Your cluster will no longer use vSAN for storage, so it won't be able to provide features like distributed storage, replication, and resiliency.

  1. If you later decide to re-enable vSAN, you might have to reconfigure it and ensure that your data and configurations are restored or migrated appropriately.

To avoid potential issues, it's essential to plan the migration of your data and VMs to another storage solution before disabling vSAN and ensure you have backups of critical data.

Sebastian Ulatowski's profile image
Sebastian Ulatowski

@Mohammed Viquar Ahmed Are you reaaly Broadcome Employee ? I am asking becouse you write not truth. 

When you disable vSAN service and enable again, everything works, vms are in inventory, vsan datastore is the same. 

@dbutch1976 Did you disable HA and enable again ? What version of vSAN you have ? I am asking, becouse some old version had an error with restart vSAN service. 

Regards,

Sebastian

Mohammed Viquar Ahmed's profile image
Broadcom Employee Mohammed Viquar Ahmed

@Sebastian Ulatowski ,

Correction : When u disable VSAN ( https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-planning.doc/GUID-BA802B7B-1014-4F45-8237-E205E6B0A573.html ) .

Sebastian Ulatowski's profile image
Sebastian Ulatowski

@Mohammed Viquar Ahmed Have you read and understood the linked document?

There is no word about data loss in this document, there is information about unavailability. I tested this situations in my lab by turning the service off and then turning it back on, so I'm sure how it will work. 

regards, 

Sebastian

dbutch1976's profile image
dbutch1976

Hello,

Thanks for all the feedback, @TheBobkin I was able to get my data back with your assistance. From within the vCenter I have a single node (dessloch) which I cannot add to the cluster, because from the vCenter view it appears that VSAN has not been configured for the cluster, however, with the steps below I was able to get the VSAN working for each of my 4 nodes, and recover all my data. Here are the exact steps for anyone finding themselves in a similar situation:

#blue  
esxcli vsan network list vmk1
esxcli network ip interface ipv4 get | grep vmk1 192.168.0.134 
mac 00:50:56:64:c7:73
cmmds-tool whoami 64f8e854-6bb4-5163-ba94-48210b505bbb
#grey  
esxcli vsan network list vmk1
esxcli network ip interface ipv4 get | grep vmk1 192.168.0.172 
mac 00:50:56:6b:bf:a1
cmmds-tool whoami 64f8e301-42b9-9a23-96dc-48210b3f450b
#pink  
esxcli vsan network list vmk1
esxcli network ip interface ipv4 get | grep vmk1 192.168.0.189 
mac 00:50:56:68:5b:3c
cmmds-tool whoami 64f8ed26-acd7-5c09-3e1e-48210b506c23
#dessloch  
esxcli vsan network list vmk1
esxcli network ip interface ipv4 get | grep vmk1 192.168.0.190 
mac  00:50:56:63:20:a4
cmmds-tool whoami 6087f705-b8d8-e534-1b48-000acd39de18
#Dessloch
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates
esxcli vsan cluster join -u 523fee61-ca75-9436-d5b8-0e248d50647a
esxcli vsan cluster unicastagent add -t node -u 64f8e854-6bb4-5163-ba94-48210b505bbb -U true -a 192.168.0.134  -p 12321
esxcli vsan cluster unicastagent add -t node -u 64f8e301-42b9-9a23-96dc-48210b3f450b -U true -a 192.168.0.172  -p 12321
esxcli vsan cluster unicastagent add -t node -u 64f8ed26-acd7-5c09-3e1e-48210b506c23 -U true -a 192.168.0.189  -p 12321
esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
#blue
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates
esxcli vsan cluster unicastagent list
esxcli vsan cluster join -u 523fee61-ca75-9436-d5b8-0e248d50647a
esxcli vsan cluster unicastagent add -t node -u 64f8e301-42b9-9a23-96dc-48210b3f450b -U true -a 192.168.0.172  -p 12321
esxcli vsan cluster unicastagent add -t node -u 64f8ed26-acd7-5c09-3e1e-48210b506c23 -U true -a 192.168.0.189  -p 12321
esxcli vsan cluster unicastagent add -t node -u 6087f705-b8d8-e534-1b48-000acd39de18 -U true -a 192.168.0.190  -p 12321
esxcli vsan cluster unicastagent list
#grey
esxcli vsan cluster unicastagent list
esxcli vsan cluster join -u 523fee61-ca75-9436-d5b8-0e248d50647a
esxcli vsan cluster unicastagent add -t node -u 64f8ed26-acd7-5c09-3e1e-48210b506c23 -U true -a 192.168.0.189  -p 12321
esxcli vsan cluster unicastagent add -t node -u 64f8e854-6bb4-5163-ba94-48210b505bbb -U true -a 192.168.0.134  -p 12321
esxcli vsan cluster unicastagent add -t node -u 6087f705-b8d8-e534-1b48-000acd39de18 -U true -a 192.168.0.190  -p 12321
esxcli vsan cluster unicastagent list
#pink
esxcli vsan cluster unicastagent list
esxcli vsan cluster join -u 523fee61-ca75-9436-d5b8-0e248d50647a
esxcli vsan cluster unicastagent add -t node -u 64f8e854-6bb4-5163-ba94-48210b505bbb -U true -a 192.168.0.134  -p 12321
esxcli vsan cluster unicastagent add -t node -u 64f8e301-42b9-9a23-96dc-48210b3f450b -U true -a 192.168.0.172  -p 12321
esxcli vsan cluster unicastagent add -t node -u 6087f705-b8d8-e534-1b48-000acd39de18 -U true -a 192.168.0.190  -p 12321
esxcli vsan cluster unicastagent list




TheBobkin's profile image
TheBobkin

Hello @dbutch1976,

Happy to help you get your cluster back together and glad to see it.

Just for awareness - while your method of piecing the cluster back together via esxcli is perfectly okay (I have done the same more times than let's say 'most'), this is going to be more error-prone and possibly tangled for those not familiar with esxcli and/or vSAN.

Thus why I mentioned what I did in last comment - if you just join cluster (single esxcli command per node, they will still be isolated from one another) and then re-enable vSAN on the vSphere cluster, assuming vsan-tagged vmks are correctly configured then vCenter will do 90% of the rest of the work here to get it properly configured and all nodes clustered.