vSAN1

 View Only
  • 1.  vSAN Health Checks

    Posted Nov 13, 2020 06:56 PM

    I am a bit new to using vSAN and have a couple questions on the health checks.

    One I am getting a warning on the "vSAN Disk Balance" and 2 of my 5 servers, looks like 3 disks from one and 1 from the other, have the "Proactive rebalance is needed". If I use the "Configure Automatic Rebalance" will this affect any of the current data or any issues at all to the the vSAN?

    Another health check that is in error state is the "vCenter state is authoritative" with all 5 hosts out of sync. What exactly will happen if I "Update ESXi Configuration", is there a risk to lose any data at all?

    Thank you!



  • 2.  RE: vSAN Health Checks

    Posted Nov 14, 2020 12:36 PM

    , Welcome to Communities and vSAN.

    vSAN Health alert relating to Disk Balance is basically an informational alert that there is >30% variance between highest and lowest used Capacity-Tier disks in the cluster - vSAN doesn't automate moving data to newly added/blank disks for multiple reasons and thus why Proactive Rebalance exists. If all nodes in the cluster are 6.7 U3 or higher then you can configure this to proactively rebalance automatically, if lower then you can push the button from the Health UI to start this task, this is intentionally very very low priority IO and thus should have no impact on performance (and zero impact on data-state as it is just moving data and only removing the original data it moved once completed).

     

    "vCenter state is authoritative" is a little more complex - this triggers if vCenter has not pushed and/or is not in sync with the unicast agent lists on the nodes (basically the list nodes have to know who is in the cluster) - did you add this cluster to a new or restored from backup vCenter and/or were there any manual changes via the CLI of the unicast agent entries?
    First steps should be to validate that no nodes are set to ignore vCenter membership updates:

    # esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates
    If this is set to 1 then this node is set to ignore vCenter updates, this should by default be set to 0:
    # esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates
    Next you should validate that the vSphere cluster members match the vSAN cluster members, check the nodes in the vSphere UI vs the members and as per 'esxcli vsan cluster get' - if they match and the above checks and remediation (where necessary) has been done, then you can go ahead and click "Update ESXi Configuration" - either way there is no risk to the data but the above checks need to be performed to avoid any possible cluster partition.



  • 3.  RE: vSAN Health Checks

    Posted Jan 21, 2021 01:37 PM

    For the ""vCenter state is authoritative" we had to add the hosts to a new vCenter, but we disassociated the hosts from the old vCenter server before adding them to the new one.

    If I use the "Update ESXi Configuration" option in the Skyline Health section is there any risk of losing data?

    I will run the commands below against the ESXi servers to see.



  • 4.  RE: vSAN Health Checks

    Posted Jan 21, 2021 06:32 PM

      actually if you add an ESXi host to a new vCenter it should actually automatically get disconnected from the old vCenter (or at least it did last I checked likely around vSphere 6.5).

    "If I use the "Update ESXi Configuration" option in the Skyline Health section is there any risk of losing data?"
    No, worst possible case scenario is the cluster becomes partitioned (which shouldn't occur provided you perform the necessary checks and is easily remediated by repopulating the unicast lists on the nodes) - one thing to also validate is that you created the cluster with the correct configurations as they were in the previous vSphere cluster (e.g. if Deduplication or encryption were enabled then they should be here too, if it is a Stretched cluster then the Fault Domains should be configured etc.).



  • 5.  RE: vSAN Health Checks

    Posted Jan 26, 2021 04:46 PM

    So   I ran the command "esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates" against all of the ESXi hosts and only one host out of the 5 were set to 1. If setting this option to 0 there will be no effect on data.

    Also I apologize about constantly asking about data. We currently have no backup plan in place for these servers (still working on it) so I am trying to keep the network up and running. Also new to all the vSAN workings.

    I also came across that one of them has a different host name. Do I have to disconnect the host, put it into maintenance mode and rename it (https://kb.vmware.com/s/article/1010821) or can I just change the hostname?



  • 6.  RE: vSAN Health Checks

    Posted Jan 26, 2021 10:32 PM

    "I ran the command "esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates" against all of the ESXi hosts and only one host out of the 5 were set to 1"
     , So this can explain all nodes stating requiring remediation due to vCenter not being authoritative - One node is set to ignore vC membership updates and thus it can't validate and/or remediate this nodes settings/configuration (even if they don't differ from vC's opinion of the cluster etc.).

    "If setting this option to 0 there will be no effect on data. Also I apologize about constantly asking about data."
    Absolutely no need for apologies whatsoever - my colleague (Hi Kam!) has a statement (maybe more of a mantra  ) that I completely believe in when dealing with any situation where there is any question of the outcome and/or approach:
    'Our first priority here is the data. Our second priority here is the data. Our third priority here is the data. All other priorities fall after these.'

    Setting this to 0 (the default setting) on this node won't do anything, however, it will allow (when remediate button is pushed) vC to check the settings/configurations on this node and push any necessary changes to this and the other nodes - typically this is only relegated to unicastagent updates e.g. if what the nodes have as the vsan IP of a node differs with what it is currently set to from vC's perspective. The worst possible outcome from such change would be that node getting isolated from the cluster (which doesn't permanently impair data and is easily remediated), but this shouldn't be possible provided the information on the nodes and also the from vC perspective (e.g. check in the UI) are matching and correct.

    This can be easily validated via checking the following:
    Check the unicastagent list on each node - this should contain correct UUID and vsan-IP entry for all nodes in the cluster except the node this is being run on (e.g. a node in a 6-node cluster with single vsan-enabled vmk will have 5 entries):
    # esxcli vsan cluster unicastagent list
    Check that the information from the above IPs match the hosts information for the vsan-enabled vmk in the UI.
    Host to UUID information can be determined via:
    (run on each node and only returns itself)
    #cmmds-tool whoami
    or from any node:
    # cmmds-tool find -t HOSTNAME | grep -iE 'uuid|health|content'

    "Do I have to disconnect the host, put it into maintenance mode and rename it (https://kb.vmware.com/s/article/1010821) or can I just change the hostname?"
    Yes, host name changes requires the steps in that kb (e.g. it has to be in Maintenance Mode (with Ensure Accessibility to keep all VMs accessible), remove from cluster (this does NOT remove the vSAN Disk-Groups, just kicks it out of the cluster), disconnect, remove from inventory, make the changes, reconnect, move back into cluster, exit MM.
    While there are slightly 'hacky' means of keeping a node in a vSAN cluster while removed from vC inventory, these are not advisable and thus I won't be advising on these.



  • 7.  RE: vSAN Health Checks

    Posted Feb 02, 2021 04:26 PM

    . All of this has worked an our vSAN is back to normal health. I apologize about the time in between responses, but thank you greatly for the help!



  • 8.  RE: vSAN Health Checks

    Posted Nov 13, 2024 05:29 PM

    Hello!

    I'm in a very similar situation. I have a 3-node vSphere cluster, there was a network partitioning that I could fix manually. Now the vCenter server says "vCenter state is authoritative" (I'm going to double check the UUIDs and IPs, but everything seems correct). However, the vCenter server version is 6.5U1g, and I can use only the vSphere Client (HTML5), in which the vSAN Monitoring and Configuration is not implemented yet, so I can not click on "Update ESXi Configuration".

    I want to upgrade the vCenter to 6.5U3 then, 7.0U3. In 6.5U3 the vSAN-related features are already implemented. So, regarding the discussion above my impression is that I can safely do the patch upgrade on the vCenter, then I would be able to do the remediation. My concern was if I update the vCenter while this error is not fixed it might update the configuration of the ESXi hosts and the VMs might become Inaccessible, but if I understand the posts correctly the worst-case scenario is that the unicast agents get into an inconsistent state, and I have to fix them again manually?
    So, can I safely do the upgrade even if this error is still there?