vSAN1

 View Only
Expand all | Collapse all

I have lost my vSAN. Could you help?

  • 1.  I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 01:26 PM

    Is there somebody who can help me retrieve my vSAN setup?

    That day was sooo bad, I was making bad decisions one after the other. I can not believe my self. Literally I can not believe myself :smileysad:

    Let you describe my set up briefly: 3 nodes Dell r720xd, with 3 SSDs each. vCenter run in vSAN.

    1) I loose my UPS an APC. I send it to be fixed, meanwhile I used an old one with kink of old batteries. They have never returned this to me. So I leave the old one.

    - I never change it. I say OK it works for some minutes we are fine.

    2)  I loose one SSD the system works fine.

    - I say OK it works, I am too busy lets do something else.

    Now the things go reallyyyy bad! That day arrives and due to strong winds we had power outages in the night. I am arriving at the place in the morning. I could not locate any VMs. I say OK lets look maybe I have to turn them on. I log in to ESXi. It seems vCenter was not working I tried to restart the server from ESXi the Task was loading for ever. And I could not open any VM or do anything.

    so...

    3) I am restarting simultaneously the three servers. (It is bad I know, don't ask me why I did that) After that restart, every VM was appearing as Status: "Invalid". But I could see the vSAN. I could not register any VM at that point since when I was opening the explorer I could not see any files. At that time I could see the capacity of vSAN.

    4) I said OK let me close them one by one. Open them get the dust out because from the years they had much dust, and then see if something is going to be fixed! I opened them cleared them carefully but  now one out of three servers didn't want to open at all! LOL it went from bad to worse. At this point vSAN was appearing as 0 from the other two servers!

    5) I was so sad and mad and without clear mind I say ok lets retrieve from an older back up. And lets make everything from beginning. So out of the 2 running servers I am exiting one server from vSAN cluster and try to erase the one disk in order to start making from the top everything. At that point I remember that I have some really important files that I have not backed up.

    Long story short I have one server that can not be turned on (I have tried to change power supplies no luck). And one out of the two working servers with one erased SSD (out of 3 SSDs) and removed also from the vSAN cluster.

    Since I have some really important files which I want to retrieve. I would appreciate if somebody could help me with this situation.

    I think even the God will find this difficult. Or maybe a real vSAN and server expert could help me. I am really sad with my bad decisions. I do not have any support at this time. So I would be glad to discuss a payment in PM if somebody could help me!

    Thank you in advance,

    Frank



  • 2.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:04 PM

    Did you try contacting VMware Support? If possible at all, which sounds unlikely, it seems like they would be your best option!



  • 3.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:09 PM

    Okay, I have a crazy idea, which may work...

    You have 2 working servers, or at least 2 servers that can be powered on. Considering the host with the failed SSD was the first failure in your environment, the below may just work. I provide no guarantee, and can't be held liable in ANY SHAPE OR FORM!

    I would try the following:

    • Physically mark my SSDs
    • Remove the ALL the SSDs from the host of which an SSD failed  and don't touch them again
    • Remove the SSDs from the Host which cannot be powered on any longer
    • Place the SSDs from the host that could not be powered on in the host where the SSD failed in the same order.
    • Power on the host with the "new" SSDs
    • Wait and hope the objects become accessible again.


  • 4.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:14 PM

    This is a crazy idea indeed haha. I will try tomorrow although it is highly unlikely. In any case if it works. I will ask for your PayPal account to give you a donation :smileysilly:



  • 5.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:19 PM

    Nah, I work for VMware, I can't take any money. But theoretically it has a chance of success. As the host that died last should have the latest data just like the host which is 100% healthy. So if you combine the good disks with the good server you should be able to power on VMs again.



  • 6.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:10 PM

    Νo as I do not have any support subscription... This why I am asking to see if a vSAN expert could help me with payment of course!



  • 7.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:13 PM

    See the above, it may just work. That is what I would do to be honest.



  • 8.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:21 PM

    Hello Frank,

    I was going to suggest as Duncan said to remove the disks from the host you cannot power them on and put them in a server that you can power on but that doesn't have functional/current data.

    However, before you do this - can you add the host you removed from the cluster back (either via the UI or via the CLI if on an older build) and share the output of:

    # esxcli vsan debug object list

    If this is an older build which doesn't have this command then similar data can be generated using:

    # python /usr/lib/vmware/vsan/bin/vsan-health-status.pyc > /tmp/healthOut.txt

    If it cannot tell the state of the data due to everything being inaccessible then the CMMDS output should tell us what the state of the components are:

    # cmmds-tool find -t DOM_OBJECT -f json > /tmp/DOMOut.txt

    # cmmds-tool find -t HOSTNAME -f json> /tmp/HOSTNAMEOut.txt

    # cmmds-tool find -t NODE_DECOM_STATE -f json> /tmp/DECOMOut.txt

    Note that VMware offer pay-per-incident support that you may be able to avail of here.

    Anyone that knows enough about vSAN to help here probably works for VMware GSS/PSO and thus won't be able to accept payment as this would likely violate the terms of our contracts.

    Bob

    Edit: added a command.



  • 9.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:47 PM

    Hi,

    I didn't know this. OK I will see the "pay-per-incident" as my last option then! I really appreciate your help because I am in a difficult position!!!!!

    Because I am really curious with the help of you guys I found the keys and I am alone here with the servers around me! I can not wait by tomorrow :smileyhappy:

    So what I was thinking is to change power supplies and then see if I have faulty cards. This is to start the second server that doesn't power on - then put working SSDs. I am not sure how to add working server to vSAN cluster without vCenter... So I will try to power on the second server.



  • 10.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 02:54 PM

    Hello Frank,

    Sorry, you said you didn't have the vCenter already so CLI it is.

    What build of ESXi is in use? This will determine whether you may have to manually add entries to the unicastagent lists.

    If it is 6.0/<6.5P01 and not a stretched cluster then it is just a case of validating that you have the vSAN vmk configured in the same subnet as the other hosts vSAN vmk and join the cluster:

    On the node that never left cluster:

    # esxcli vsan cluster get

    Note the 'Sub-Cluster UUID'

    On the node that you left cluster on:

    # esxcli vsan cluster join -u <sub-Cluster UUID>

    As I said above, I think it is worthwhile checking the state of the data components (and where the active/absent/degraded ones reside) with what you have now before starting switching hardware components (so that we can validate whether that will help).

    Bob



  • 11.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 03:17 PM

    Hi,

    Yes that worked now with the server that returned I have some info as you see in attachment. But, the commands above does not show anything on new server. On server that has never left. I get some feedback from above commands. What do you advise me to do now?



  • 12.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 03:34 PM

    Please clear some things for me, would it be best to:

    A) Try to turn on second Server from the one that has left the cluster parts and put working SSDs?

    B) Try to put working SSDs in newly joined server?

    Also I would like to know suppose I have three SSDs one with 480GB, one with 1.6TB and one with 1.92TB on 3 servers - same setup

    if from second server I have a failed disk 1.6TB and from third server 1.92TB. Can I mix the working disks in third server? Or it has to be straight 3 from same server?

    I am saying this because I do not remember if I have on failed disk or two...



  • 13.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 04:04 PM

    Hello Frank,

    That likely won't return any useful data when you still have just one node partitioned by itself.

    Have you attempted rejoining the cluster on the node that you erased an SSD?

    "if from second server I have a failed disk 1.6TB and from third server 1.92TB. Can I mix the working disks in third server? Or it has to be straight 3 from same server?

    I am saying this because I do not remember if I have on failed disk or two..."

    Do you have failed disks or disk(s) that you wiped?

    Do you recall what disk(s) are failed/wiped? e.g. Capacity-tier or Cache-tier - if you wipe a Cache-tier device the Disk-Group is gone and no you cannot add Capacity-tier devices with data on them to an another existing Disk-Group.

    Bob



  • 14.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 04:28 PM

    Hi!

    It looks like I am getting somewhere!!!!!!!!!!!!!

    Let me explain to clear things up it is important. Suppose I have three servers lets name them

    esxi10: This is the working server that it has never left vSAN cluster (vCenter lives in there also), and has no faulty SSDs

    esxi20: This is the server that does not power on, it has never left vSAN - I do not remember if I have faulty SSDs

    esxi30: This is the server I joined back the cluster, it has one wiped SSD and one SSD that does not work at all.

    OK I put esxi30 in maintenance mode, and put all the disks of esxi20 that does not power on to esxi30. It looks like all three SSDs are working fine (Phewww). And esxi30 can now see the vSAN datastore :smileygrin:

    esxi10 continues to sees vSAN as 0. How should I continue?!



  • 15.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 06:19 PM

    Hello Frank,

    " And esxi30 can now see the vSAN datastore

    esxi10 continues to sees vSAN as 0. How should I continue?!"

    Please share the output from both nodes of:

    # df -h

    # esxcli vsan cluster get

    # vdq -Hi

    # esxcli vsan storage list

    # cmmds-tool find -t HOSTNAME -f json

    # cmmds-tool find -t NODE_DECOM_STATE -f json

    If a node is properly clustered with other nodes then it should see the size of the vsanDatastore as the total of its own storage and the other clustered nodes storage (unless they are a) in Maintenance Mode or b) have their local-storage unmounted/inaccessible).

    Bob



  • 16.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 06:28 PM

    Did you see my last post? Now both servers see the vSAN



  • 17.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 06:31 PM

    For esxi10:

    [root@esxi10:~] df -h

    Filesystem   Size   Used Available Use% Mounted on

    VMFS-6      68.2G   5.3G     63.0G   8% /vmfs/volumes/DatastoreHP 1

    VMFS-6       3.6T 151.3G      3.4T   4% /vmfs/volumes/NAS BackUP

    vfat       285.8M 209.1M     76.8M  73% /vmfs/volumes/5aef06c6-480da194-cbd0-a03                      69f1fd368

    vfat       249.7M 159.2M     90.5M  64% /vmfs/volumes/7e39d0ef-9c1fd3eb-730f-029                      c611a571f

    vfat       249.7M 151.5M     98.3M  61% /vmfs/volumes/dc7018e8-ba3651d8-949c-d20                      a095b30e1

    vsan         1.7T 964.3G    824.2G  54% /vmfs/volumes/vsanDatastore

    [root@esxi10:~] esxcli vsan cluster get

    Cluster Information

       Enabled: true

       Current Local Time: 2020-04-21T18:29:09Z

       Local Node UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

       Local Node Type: NORMAL

       Local Node State: MASTER

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 529f57d0-a063-c30e-191f-8c9dab9faada

       Sub-Cluster Membership Entry Revision: 14

       Sub-Cluster Member Count: 1

       Sub-Cluster Member UUIDs: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

       Sub-Cluster Membership UUID: 95ee975e-2188-ae76-a2f7-a0369f1fd368

       Unicast Mode Enabled: true

       Maintenance Mode State: OFF

       Config Generation: 80535089-0092-41b3-93dc-3df97c24b6b0 4 2020-03-27T10:40:06.555

    [root@esxi10:~] vdq -Hi

    Mappings:

       DiskMapping[0]:

               SSD:  naa.5000c5003017925b

                MD:  naa.5000c5003018c6d3

    [root@esxi10:~] esxcli vsan storage list

    naa.5000c5003017925b

       Device: naa.5000c5003017925b

       Display Name: naa.5000c5003017925b

       Is SSD: true

       VSAN UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8

       VSAN Disk Group UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8

       VSAN Disk Group Name: naa.5000c5003017925b

       Used by this host: true

       In CMMDS: true

       On-disk format version: 5

       Deduplication: false

       Compression: false

       Checksum: 7053597502770896794

       Checksum OK: true

       Is Capacity Tier: false

       Encryption: false

       DiskKeyLoaded: false

       Creation Time: Tue May  8 11:33:35 2018

    naa.5000c5003018c6d3

       Device: naa.5000c5003018c6d3

       Display Name: naa.5000c5003018c6d3

       Is SSD: true

       VSAN UUID: 52a1d26c-9f75-4646-9e1b-203690ad4d57

       VSAN Disk Group UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8

       VSAN Disk Group Name: naa.5000c5003017925b

       Used by this host: true

       In CMMDS: true

       On-disk format version: 5

       Deduplication: false

       Compression: false

       Checksum: 13265426673954443271

       Checksum OK: true

       Is Capacity Tier: true

       Encryption: false

       DiskKeyLoaded: false

       Creation Time: Tue May  8 11:33:35 2018

    [root@esxi10:~] cmmds-tool find -t HOSTNAME -f json

    {

    "entries":

    [

    {

       "uuid": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

       "owner": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

       "health": "Healthy",

       "revision": "0",

       "type": "HOSTNAME",

       "flag": "2",

       "minHostVersion": "0",

       "md5sum": "148ef7e719a8a60fe2691226efc28b1b",

       "valueLen": "32",

       "content": {"hostname": "esxi10.virtual.store"},

       "errorStr": "(null)"

    }

    ,{

       "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "health": "Unhealthy",

       "revision": "0",

       "type": "HOSTNAME",

       "flag": "0",

       "minHostVersion": "0",

       "md5sum": "2215d3424456e0aaf039f1e639e0014d",

       "valueLen": "32",

       "content": {"hostname": "esxi30.virtual.store"},

       "errorStr": "(null)"

    }

    ]

    }

    [root@esxi10:~] cmmds-tool find -t NODE_DECOM_STATE -f json

    {

    "entries":

    [

    {

       "uuid": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

       "owner": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

       "health": "Healthy",

       "revision": "10",

       "type": "NODE_DECOM_STATE",

       "flag": "2",

       "minHostVersion": "0",

       "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

       "valueLen": "80",

       "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

       "errorStr": "(null)"

    }

    ,{

       "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "health": "Unhealthy",

       "revision": "0",

       "type": "NODE_DECOM_STATE",

       "flag": "0",

       "minHostVersion": "0",

       "md5sum": "ccd967c4aed2b1781c86c6e18e5d8348",

       "valueLen": "80",

       "content": {"decomState": 6, "decomJobType": 1, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 5},

       "errorStr": "(null)"

    }

    ]

    }

    [root@esxi10:~]



  • 18.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 06:35 PM

    For esxi30:

    [root@esxi30:~] df -h

    Filesystem   Size   Used Available Use% Mounted on

    VMFS-6      68.2G   5.3G     63.0G   8% /vmfs/volumes/DatastoreHP 2

    VMFS-6       3.6T 151.3G      3.4T   4% /vmfs/volumes/NAS BackUP

    vfat       249.7M 159.3M     90.5M  64% /vmfs/volumes/38700cee-efb20e12-8f17-ea7cb3da94b7

    vfat       285.8M 209.1M     76.8M  73% /vmfs/volumes/5aef35d7-7f17cad8-5f16-a0369f1fd36c

    vfat       249.7M 151.5M     98.3M  61% /vmfs/volumes/b7cdaaec-14e48f24-4c35-859dabb8d5b9

    vsan         1.7T 768.6G   1019.9G  43% /vmfs/volumes/vsanDatastore

    [root@esxi30:~] esxcli vsan cluster get

    Cluster Information

       Enabled: true

       Current Local Time: 2020-04-21T18:33:22Z

       Local Node UUID: 5aef3508-100f-974c-ba2e-a0369f1fd36c

       Local Node Type: NORMAL

       Local Node State: MASTER

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 5aef3508-100f-974c-ba2e-a0369f1fd36c

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

       Sub-Cluster Membership Entry Revision: 0

       Sub-Cluster Member Count: 1

       Sub-Cluster Member UUIDs: 5aef3508-100f-974c-ba2e-a0369f1fd36c

       Sub-Cluster Membership UUID: 6d1a9f5e-7c21-2810-3f16-a0369f1fd36c

       Unicast Mode Enabled: true

       Maintenance Mode State: OFF

       Config Generation: None 0 0.0

    [root@esxi30:~] vdq -Hi

    Mappings:

       DiskMapping[0]:

               SSD:  naa.5000c50030176437

                MD:  naa.5000c5003015fc67

    [root@esxi30:~] esxcli vsan storage list

    naa.5000c50030176437

       Device: naa.5000c50030176437

       Display Name: naa.5000c50030176437

       Is SSD: true

       VSAN UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389

       VSAN Disk Group UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389

       VSAN Disk Group Name: naa.5000c50030176437

       Used by this host: true

       In CMMDS: true

       On-disk format version: 5

       Deduplication: false

       Compression: false

       Checksum: 584334970149048049

       Checksum OK: true

       Is Capacity Tier: false

       Encryption: false

       DiskKeyLoaded: false

       Creation Time: Tue May  8 14:33:49 2018

    naa.5000c5003015fc67

       Device: naa.5000c5003015fc67

       Display Name: naa.5000c5003015fc67

       Is SSD: true

       VSAN UUID: 52c7e41d-69a6-bf50-ff6f-0988952f2379

       VSAN Disk Group UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389

       VSAN Disk Group Name: naa.5000c50030176437

       Used by this host: true

       In CMMDS: true

       On-disk format version: 5

       Deduplication: false

       Compression: false

       Checksum: 3089451523050927556

       Checksum OK: true

       Is Capacity Tier: true

       Encryption: false

       DiskKeyLoaded: false

       Creation Time: Tue May  8 14:33:49 2018

    [root@esxi30:~] cmmds-tool find -t HOSTNAME -f json

    {

    "entries":

    [

    {

       "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "health": "Healthy",

       "revision": "0",

       "type": "HOSTNAME",

       "flag": "2",

       "minHostVersion": "0",

       "md5sum": "2215d3424456e0aaf039f1e639e0014d",

       "valueLen": "32",

       "content": {"hostname": "esxi30.virtual.store"},

       "errorStr": "(null)"

    }

    ]

    }

    [root@esxi30:~] cmmds-tool find -t NODE_DECOM_STATE -f json

    {

    "entries":

    [

    {

       "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

       "health": "Healthy",

       "revision": "7",

       "type": "NODE_DECOM_STATE",

       "flag": "2",

       "minHostVersion": "0",

       "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

       "valueLen": "80",

       "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

       "errorStr": "(null)"

    }

    ]

    }

    [root@esxi30:~]



  • 19.  RE: I have lost my vSAN. Could you help?
    Best Answer

    Posted Apr 21, 2020 06:50 PM

    Hello Frank,

    "Sub-Cluster Member Count: 1"

    They are not members of the same cluster

    Because you used node esxi10's UUID instead of the cluster UUID:

    "[root@esxi30:~] esxcli vsan cluster get

    ...

       Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368"

    So basically you created a new cluster with that UUID.

    Leave and rejoin the cluster correctly using on esxi30:

    # esxcli vsan cluster leave

    # esxcli vsan cluster join -u 529f57d0-a063-c30e-191f-8c9dab9faada

    You may need to manually repopulate the unicastagent lists:

    VMware Knowledge Base

    Bob



  • 20.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 08:32 PM

    I am not sure how to express my feelings! God bless you!!! You saved a life from somewhere in the world (Crete). I could not think that it could ever work from that initial situation!

    Thank you everybody!

    I wish the very best to you and all VMware team for bringing such awesome software alive!

    :smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin::smileygrin:



  • 21.  RE: I have lost my vSAN. Could you help?

    Posted Apr 22, 2020 06:18 AM

    This is awesome to hear Frank!

    But please, please... Backup everything ASAP! Also, please have the discussion around a support contract, as in situations like these you don't want to be dependent on the help of random people on the VMware communities.

    Hope you enjoy the rest of the week.



  • 22.  RE: I have lost my vSAN. Could you help?

    Posted Apr 22, 2020 08:37 AM

    Hello Frank,

    Glad to hear you got your data back and happy that we could help get you there.

    But please as Duncan said, if you are running/storing anything that you even remotely care about then ensure you have a robust back-up plan in place for situations like this where the importance of the data is only realised after disks/hosts are wiped etc. .

    Bob



  • 23.  RE: I have lost my vSAN. Could you help?

    Posted Apr 21, 2020 06:17 PM

    OK I have tried to restart services for esxi10 with:

    /etc/init.d/hostd restart

    /etc/init.d/vpxa restart

    Now it looks like esxi10 sees the vSAN also! Now I came to my initial state (step "3") -> where I could not access the VMs after restart but could see the vSAN datastore.

    Please check attachments.

    1) I can not unregister and register VMs as vSAN appears empty in explorer.

    2) I can not power on any machine (it is grayed out) as they appear invalid.