vSAN1

 View Only
  • 1.  Full vSAN Cluster Power-Off for Site Migration

    Posted 6 days ago

    Hello Community,

    We're planning a full migration of all hosts in one of our vSAN clusters from one physical site to another. This will require a complete shutdown of the vSAN cluster for approximately 12 hours. The plan is to power it back on after the relocation.

    We've reviewed Broadcom's documentation, but we'd really appreciate hearing from anyone who has gone through a similar process. Any tips, lessons learned, or potential pitfalls to watch out for would be very helpful.

    Thanks in advance!



  • 2.  RE: Full vSAN Cluster Power-Off for Site Migration

    Posted 6 days ago

    Hi Cristian,

    We recently went through the same exercise and also followed Broadcom's documentation. The one thing you need to make sure of is when you shutdown down the vSAN cluster, make a note of which vSAN cluster host is the "orchestration host".

    When you start up the vSAN cluster - start up the "orchestration host" first! We didn't do that, and it caused issues.

    Good luck!




  • 3.  RE: Full vSAN Cluster Power-Off for Site Migration

    Posted 5 days ago

    Hello,

    I tested the vSAN shutdown in lab many times with issues simulation and I can say, that the "shutdown button" in VCSA working well when no issues happen and all hosts will boot sucessfully without any issues. In case of damaged disk group or not booting host, it is hard to bring the cluster up. It is possible, but I would not like to do it in production environment. Especially when there is no official documentation. I opened vmware case to clarify this functionality in case of some troubles and no info was provided. Simply, in case of some issue, please contact support. Horrible way.

    I rather prefer the way of manual shutdown described in 

    Manually Shut Down and Restart the vSAN Cluster

    Broadcom remove preview
    Manually Shut Down and Restart the vSAN Cluster
    You can manually shut down the entire vSAN cluster to perform maintenance or troubleshooting.
    View this on Broadcom >

    Simply because if some host will not boot again or there would be some damaged disk group, you still can bring the vsan cluster up without corrupted host (if vSAN storage policy is still fullfilled). Last host can be started later and is added to cluster by two simple steps:

    1.  "esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates"

    2.  enable vSAN kernel on the host

    The difference between this manual shutdown, which simply disable vSAN kernel port on all the hosts at the same time, and "is easy" to fix it. The automatic vSAN shutdown do something similar, but in some background way and start the cluster is "officially impossible" without vmware support as I already mentioned. There are manual steps which can fix the problem, but in production environment it is huge unnecessary risk.

    Also would be recommended to move VCSA and one DNS server to non vSAN storage to easier start after the break. 

    Maybe the best option is shutdown VM workload and terminate power sources in entire rack, what terminate all hosts at the same time like during unexpected power break. :D

    After unexpected power failure I have never saw vSAN issue in case, that some host will not boot and storage policy is fullfilled.

    In all cases, good luck and hopefully no HW issue will be recognized after movement.




  • 4.  RE: Full vSAN Cluster Power-Off for Site Migration

    Posted 3 days ago
     
    "Simply, in case of some issue, please contact support. Horrible way."
     
    Sorry if that response seems too boiler-plate (and if you had asked me personally, I might have elaborated), but this is likely due to the state of things being necessary to be checked properly as opposed to you just doing whatever and calling us after and situation is worse by you doing whatever and then being harder to fix.
     
    "The automatic vSAN shutdown do something similar, but in some background way and start the cluster is "officially impossible" without vmware support as I already mentioned. "
     
    No, it isn't really and it is not undocumented - both reboot_helper.py and 'Cluster Shutdown Wizard' both enable /VSAN/IgnoreClusterMemberListUpdates so that vCenter cannot push any cluster membership changes, and then both methods prevent any sub-set of nodes updating data in an FTT=0 manner - the former does this by isolating all nodes at the exact same time by untagging the vsan-vmk, the latter does this by stopping all nodes from publishing data-updates by enabling /VSAN/DOMPauseAllCCPs (and also setting on vCenter that that cluster is in shutdown state which is also very easily manually reverted).
     
    "After unexpected power failure I have never saw vSAN issue in case, that some host will not boot and storage policy is fullfilled."
     
    That is assuming that is clean and all nodes went down at the exact same time, the problem is if 1. some sub-set of nodes are still actively updating data and 2. if you have a hardware failure where those last updated data (updated in FTT=0 state) reside - this possibility is literally the exact reason for either of these shutdown methods existing in the first place.
     
    Please evaluate which of the available options for shutting down the cluster you prefer, either reboot_helper.py method or vSAN Cluster Shutdown Wizard method are both valid and supported means of gracefully shutting down the cluster - if any doubt either before or after doing this, please do contact us folks in vSAN Global Support by opening a case with us.




  • 5.  RE: Full vSAN Cluster Power-Off for Site Migration

    Posted 3 days ago

    Hello,

    @TheBobkin

    I wrote that "reboot_helper.py method" is Okay. But the "vSAN Cluster Shutdown Wizard method" is the road to ****. It is working, if nothing is going wrong only. In case of one failed disk, you have to open case, possibly depends on your support level, if your environment will be operatable in several hours or days. I have not got any answers when I opened the case for test environment. Official statement was, when this will happen in production, just open the case again and wait. If the disk fails during unexpected power outage, the environment will be okay in more than 99 %, I guess. If not, you still need to open the case, no difference. 

    When I tested this a year ago, there was no clear documentation, how to fix some common issues (like not booted host or failed diskgroup) in Shutdown wizzard method. There was only written, that you have to remove the not responding host from cluster to be able finish the wizard recovery. Still it was not working, because firstly we must run the "esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates" on all ESXi hosts to be able to see, that you removed the node from cluster. After that the wizard recovery was able to finish. That´s reason, why I don´t want to do these steps in production envirement even though it was working on test environment.

    I dug to the missing fixing steps for very common possible issues, which can happen. (damaged disk group and not responsing host). If IT provider is not able to restore the service because must wait for official support after official vSAN shutdown process, it is pure alibism, because vmware have no responsibility for anything and pay nothing for service delay. That the reason, why I called it as "Horrible way". Vmware users are punnished for using official steps, which are working only if everything will go smoothly. The same is true also for non offical ways. If they will go smoothly, no problem. If they not, open the support case. :-)