vSAN1

 View Only
  • 1.  2 node cluster -split brain

    Posted Jun 14, 2017 10:51 PM

    hi there,

    In our test environment, vsan (version 6.5) is configured with 2 node direct connection with witness appliance. While working on servers' firmware upgrade (most likely network isolation and plus cold reboot), I think I've caused split brain in our vsan cluster.

    If both host1 and host2 are up and running (witness as well) then all the vms seems inaccessible (orphan) on the hosts' inventory and even no objects/vm folders can be seen on vsan storage, it is just empty.

    If host2 and witness appliance are up and running then vms are accessible and can be powered on. Even in this stage, if host1 is restarted then cause the same problem -vms are being inaccessible.

    If Host1 and witness appliance are up and running then vms are inaccessible and vsan storage seems empty.

    "esxcli vsan cluster get" output (below) on both hosts +witness appliance, that shows the members' UUID and Master UUID *witness sees host2 as a master.

    "esxcli vsan cluster" leave and join on host1 might work but I'd like to hear and appreciate whoever has same or similar experience before taking an action.

    host1 -

    Cluster Information

       Enabled: true

       Current Local Time: 2017-06-14T18:17:26Z

       Local Node UUID: 58d5c198-bb69-73e6-d154-005056851dd8

       Local Node Type: NORMAL

       Local Node State: MASTER

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 58d5c198-bb69-73e6-d154-005056851dd8

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 522cc8bf-c136-0570-5a09-7b6ca3738dc2

       Sub-Cluster Membership Entry Revision: 1

       Sub-Cluster Member Count: 2

       Sub-Cluster Member UUIDs: 58d5c198-bb69-73e6-d154-005056851dd8, 5935f59b-107f-e956-55f0-005056b12d1b

       Sub-Cluster Membership UUID: ce774159-22c1-ab1c-cbe9-180373f813f2

       Unicast Mode Enabled: true

       Maintenance Mode State: OFF

    host2 -

    Cluster Information

       Enabled: true

       Current Local Time: 2017-06-14T18:11:57Z

       Local Node UUID: 58d35025-4d1f-1362-375d-000c298a9ad5

       Local Node Type: NORMAL

       Local Node State: MASTER

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 58d35025-4d1f-1362-375d-000c298a9ad5

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 522cc8bf-c136-0570-5a09-7b6ca3738dc2

       Sub-Cluster Membership Entry Revision: 2

       Sub-Cluster Member Count: 1

       Sub-Cluster Member UUIDs: 58d35025-4d1f-1362-375d-000c298a9ad5

       Sub-Cluster Membership UUID: d5774159-9511-463a-4f15-180373f8131a

       Unicast Mode Enabled: true

       Maintenance Mode State: ON

    witness appliance

    Cluster Information

       Enabled: true

       Current Local Time: 2017-06-14T22:36:40Z

       Local Node UUID: 5935f59b-107f-e956-55f0-005056b12d1b

       Local Node Type: WITNESS

       Local Node State: AGENT

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 58d35025-4d1f-1362-375d-000c298a9ad5

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 522cc8bf-c136-0570-5a09-7b6ca3738dc2

       Sub-Cluster Membership Entry Revision: 1

       Sub-Cluster Member Count: 2

       Sub-Cluster Member UUIDs: 58d35025-4d1f-1362-375d-000c298a9ad5, 5935f59b-107f-e956-55f0-005056b12d1b

       Sub-Cluster Membership UUID: 1f8a4159-6900-7633-902e-180373f8131a

       Unicast Mode Enabled: true

       Maintenance Mode State: OFF



  • 2.  RE: 2 node cluster -split brain

    Posted Jun 15, 2017 12:12 AM

    Hello,

    Using vsan cluster leave/join isn't going to wipe data off disks or anything like that if that is your concern, but best to try/check/establish some other things first though:

    Are all of your hosts (including Witness) currently on the same ESXi build? (can check via GUI or via SSH with #vmware -vl)

    Is the vCenter that these hosts are registered on of a version equal or higher than the ESXi build?

    From the Sub-cluster Member Count it shows ESXi 2 (58d35025-4d1f-1362-375d-000c298a9ad5) as being isolated (but it is in vSAN Maintenance mode)

    host2 -

    ...

    Maintenance Mode State: ON

    Is this host in MM currently? (unfortunately MM in vCenter does not always mean in MM at the vSAN level)

    Stranger here (and alluding to the notion of 'split-brain') is that the Witness Appliance thinks the 2 cluster members are host2 (58d35025-4d1f-1362-375d-000c298a9ad5) and itself (5935f59b-107f-e956-55f0-005056b12d1b) while host1 thinks the cluster members are itself (58d5c198-bb69-73e6-d154-005056851dd8) and the Witness Appliance (5935f59b-107f-e956-55f0-005056b12d1b).

    Can you double-check whether everything is functional with host1+Witness or host2+Witness?

    If host2 taken out of MM shows as in cluster with the Witness Appliance and enough data components to result in functional VMs then after verifying this check the component states using RVC/cmmds-tool/Health GUI. If nothing is inaccessible/unrecoverable then try leave/join cluster (522cc8bf-c136-0570-5a09-7b6ca3738dc2) on host1.

    Bob

    -o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-



  • 3.  RE: 2 node cluster -split brain

    Posted Jun 15, 2017 01:15 AM

    hi,

    6.5 build 5310538 is running all the hosts including witness app and build #5318154 is running on vcenter (vcenter+psc more precisely).

    VMs are working fine on Host2+ witness app combination while host1 was turned off. After restarting host1, all the running vms on host2 became inaccessible/orphan and not responsive(no ping reply as vms are being shutdown). So I've decided to disable HA to see if both hosts can run at the same time without causing any issue on the running vms. However this puts vms into APD/PDL

    scenario which I can ping vms but can't do anything.

    I've put both hosts in maintenance mode several times so but I'm just providing the last output below; (host1 is turned off)

    host2

    Cluster Information

       Enabled: true

       Current Local Time: 2017-06-15T01:11:12Z

       Local Node UUID: 58d35025-4d1f-1362-375d-000c298a9ad5

       Local Node Type: NORMAL

       Local Node State: MASTER

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 58d35025-4d1f-1362-375d-000c298a9ad5

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 522cc8bf-c136-0570-5a09-7b6ca3738dc2

       Sub-Cluster Membership Entry Revision: 9

       Sub-Cluster Member Count: 2

       Sub-Cluster Member UUIDs: 58d35025-4d1f-1362-375d-000c298a9ad5, 5935f59b-107f-e956-55f0-005056b12d1b

       Sub-Cluster Membership UUID: 1f8a4159-6900-7633-902e-180373f8131a

       Unicast Mode Enabled: true

       Maintenance Mode State: OFF

    witness app

    Cluster Information

       Enabled: true

       Current Local Time: 2017-06-15T01:11:59Z

       Local Node UUID: 5935f59b-107f-e956-55f0-005056b12d1b

       Local Node Type: WITNESS

       Local Node State: AGENT

       Local Node Health State: HEALTHY

       Sub-Cluster Master UUID: 58d35025-4d1f-1362-375d-000c298a9ad5

       Sub-Cluster Backup UUID:

       Sub-Cluster UUID: 522cc8bf-c136-0570-5a09-7b6ca3738dc2

       Sub-Cluster Membership Entry Revision: 9

       Sub-Cluster Member Count: 2

       Sub-Cluster Member UUIDs: 58d35025-4d1f-1362-375d-000c298a9ad5, 5935f59b-107f-e956-55f0-005056b12d1b

       Sub-Cluster Membership UUID: 1f8a4159-6900-7633-902e-180373f8131a

       Unicast Mode Enabled: true

       Maintenance Mode State: OFF



  • 4.  RE: 2 node cluster -split brain
    Best Answer

    Posted Jun 15, 2017 06:36 PM

    Hello ciscen,

    Likely not going to help but with all 3 nodes up have you tried leaving and rejoining the cluster on host1?

    I reckon it is likely a communication issue between host1 and host2.

    Can you check if host1 and host2 are able to communicate with one another via Unicast and over their vsan-configured interfaces?

    Try vmkping between the vmk configured on one host to the other e.g.:

    # vmkping -I vmkX 192.168.X.X

    (get the IP and vmk# of the vSAN interface from GUI or #esxcfg-vmknic -l)

    Then check if all hosts have each other listed in their Unicast Agent list:

    # esxcli vsan cluster unicastagent list

    If they do not have all other nodes in cluster listed then add them:

    # esxcli vsan cluster unicastagent add -a <IP_of_vSAN_vmk_on_hosts>

    http://pubs.vmware.com/vsphere-65/index.jsp?topic=%2Fcom.vmware.vcli.ref.doc%2Fesxcli_vsan.html

    If they can communicate and isolation is not the issue and there is something else wrong here you could always rebuild the cluster from host2+witness, but need to ensure that they can communicate first as otherwise this would be pointless.

    Bob

    -o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-



  • 5.  RE: 2 node cluster -split brain

    Posted Jun 17, 2017 01:02 AM

    Hi,

    As you recommended cluster leave and join commands worked well in my case.

    Thanks,