vSAN1

 View Only
Expand all | Collapse all

Update VSAN configuration taking a long time

jkasalzuma

jkasalzumaMar 16, 2014 11:16 PM

  • 1.  Update VSAN configuration taking a long time

    Posted Mar 15, 2014 05:07 AM

    Has anyone else experienced the "Update VSA Configuration" tasks that runs after enabling vSAN on the hosts taking a long time?

    So far my tasks are running against 3 hosts and have taken 45 minutes thus far and still going. I don't remember it running this long in the beta. I am running the 5.5up1 bits on a fresh build that I wiped disks prior to rebuilding and this is a brand new 5.5up1 vCenter Appliance. There is nothing left of the beta.

    It hasn't timed out after 45 minutes so I'm guessing that's a good thing.

    I see no activity on the disks either so not sure if they are being zero'd or something along that line.

    Any ideas? I'll update this post if something changes...



  • 2.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 05:31 AM

    So it appears the data in my Webclient was stale despite clicking refresh. The C# client shows the tasks timed out after exactly 30 min. Tryed it again in manual mode (tried Auto prior) and the management agents stopped responding. Had to restart then via ssh.

    I'm guessing something in my lab isn't right if its not working correctly. But the odd thing is its the same symptoms across all 3 hosts. Maybe its vCenter?



  • 3.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 06:02 AM

    OK, I may have found the solution for my issue.

    I was attempting to enable vSAN on an existing cluster with hosts that had load. In the beta I am pretty sure I was able to enable vSAN on an existing cluster containing hosts with load and NOT in Maintenance mode but maybe my memory eludes me.

    Either way, the solution to my issue was this...

    1. Restart management agents on all 3 hosts in my cluster (hostd & vpx) as they were not responding, ssh was responding however.

    2. Create a new cluster with DRS, HA and vSAN enabled.

    3. Place one Host into maint mode.

    4. Move her to the new cluster, cluster config worked just fine in under 30 sec.

    5. Removed her from Maint mode

    6. Moved some VMs to the first hos in the new cluster

    7. Put second into maintence mode

    8. Move her to the new cluster

    9. Rise, lather repeat until all hosts and VMs were in the new vSAN enable cluster.

    Now I'm getting an error on all host disk groups and they show "Unhealthy" with "Dead or Error" on all magnetic disks in the details.

    The mag disks don't even show under any of the 3 host's storage devices. Weird stuff for sure...

    I remember this much easier in the beta for some reason...



  • 4.  RE: Update VSAN configuration taking a long time

    Posted Sep 21, 2014 02:49 PM

    I ran into the same situation....the problem was that between HA trying to upgrade vCenter agent, configure HA and update the vSAN cluster, all resource utilization was killing the hostd process. When the hostd process is not able to maintain the ESXi OS/management services because it's "hung", you'll see that very same behavior. The fix was exactly what you said....moving the ESXi hosts out of the vSAN cluster and restarding them.



  • 5.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 06:32 PM

    Hi,

    Thanks for this. I can confirm exactly the same symptoms from my side.

    I have followed your instructions above and confirm I have gotten to the same scenario as yourself.


    I'm getting an error on all host disk groups and they show "Unhealthy" with "Dead or Error".


    I'm using the AHCI driver in my setup and this exact same setup in terms of hardware worked fine through the beta period. I'm using the vCSA.

    Have you had any luck in getting your hosts to recognize your magnetic drives? I'm going to test a different number of magnetic drives and a different drive to see if I can get any other behaviour.

    Not really sure what's going on here as this worked very well in the beta and since rolling the GA bits I have had nothing but issues in the past few days since it's release.

    Any ideas would be greatly appreciated!



  • 6.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 06:49 PM

    Hey SolidCactus,

    No zero luck thus far. Its actually gotten worst.

    I tried rebooting a single host to see if the HDD would comeback and now the host can't even be managed via ssh. Its pingable, but no management of any kind. My lab is at my office so I havn't been able to check the monitor for a PSOD.

    So don't reboot your hosts!

    I'm starting to follow the lead that I hit earlier were things began to work when I had zero VM load on the hosts. I'm dusting off my old ML110G6 to move all VMs off these vSAN hosts (or hopefully future vSAN) and trying again. I may rebuild the hosts yet again for a vanilla build (gparting all the disks too).

    I would reccomend to you if your hosts are all communicating with vCenter to disable vSAN and hopfully you would run into the same issues I did with hosts going management dark. I'd be interested in knowing after disabling vSAN and rebooting a host if you have any issues.

    FYI I am rolling a similar Lab as Erik Bussink but with i5's and no mSATA. I am using USB to boot from.

    Homelab with vSphere 5.5 and VSAN | Erik Bussink

    Jkasal



  • 7.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 07:08 PM

    Hi Jkasal,

    Thanks for the quick response! At the moment I have now disabled vSAN and I'm back to running over an iSCSI setup. I would really like to get vSAN running and test out the GA build.

    When first tried to set it up vSAN on the GA bits I lost all connectivity to the hosts as you described. I then tried rebooting the hosts as I was unsure of what management agents to reset. None of the hosts managed to successfully boot within an hour and a half. I hopped on the remote management of the hosts and they were all stuck on "usbarbitrator start"


    I was unable to do anything other with the hosts than rebuild again with the GA bits again. I would be interested if you had the same issue when you are able to see your hosts again?


    I thought I had messed up with the vSAN configuration as everything worked as expected in the beta setup. It's good to know that I'm not alone!

    Looking at the link you pinged across it looks like you are running vSAN on the AHCI driver as well.


    Is anyone else having luck with running vSAN GA build on the AHCI driver? If so, any tricks or tips?

    Thanks,

    SolidCactus!



  • 8.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 07:43 PM

    Ya I'll take a look at what the screen shows bit later today and let you know.

    Out of curiosity, did you have any VMs running on your hosts when you enabled vSAN?



  • 9.  RE: Update VSAN configuration taking a long time

    Posted Mar 15, 2014 10:33 PM

    So looking at the Monitor of my ESXi host that didn't come back up and it appears it never shuttdown completely.

    It is stuck at "Shutting down VSAN IO layer...", "Running vsantraced stop".

    She would not respond to any keyboard commands. Had to hard power her down.

    I will be rebuilding my hosts and trying to enable vSAN all over again with no load at all on my hosts. I'll see where that gets me.

    Update:

    Prior to testing my luck with the vSAN setup again, I investigated what SolidCactus was talking about with AHCI .

    I did a little investigation and found this gentleman's thread. VMware Front Experience: How to make your unsupported SATA AHCI Controller work with ESXi 5.5

    After researching my AHCI controller using Mr. Peetz's command, I found I was using an "Intel Cougar Point 6 port SATA AHCI" controller. Class 0106: 8086:1c02.

    I search the ahci.map file referenced in his article and found my controller to be listed.

    Not sure what that means but I hope it's a positive!



  • 10.  RE: Update VSAN configuration taking a long time

    Posted Mar 16, 2014 01:21 PM

    Thanks for the response.

    My AHCI controller is supported out of the box from the GA build so I don't think I really need to do anything further in terms of article listed but at least it shows that your controller is recognized and ready for use.

    When creating the Disk Groups I only have one host with any virtual machines running on it. Unfortunately, this exactly the same setup as I had in the beta builds and it worked without error.

    Any other ideas?



  • 11.  RE: Update VSAN configuration taking a long time

    Posted Mar 16, 2014 10:19 PM

    Well last night I tried again. Here is the path I took and the conclusion. Hint: it didn't go well...

    Start with freshly build ESXi 5.5 up1 hosts.

    1. Created a cluster with the following enabled

      DRS - Full Auto

      HA - Admission Control Disabled

      vSAN - Manual

    2. Checked all 3 hosts for required settings and networking

    3. Placed all 3 hosts into maintenance mode

    4. Added hosts to vSAN cluster one at a time waiting for the "Update VSAN Config" to fully complete.

    5. Verified all 3 hosts still saw both their SSd and HDD.

    6. Exited Maint Mode 1 at a time waiting for the process to complete before removing the next from Maint Mode

    7. Double checked all settings once again. (Figured treating this like a rocket launch would help).

    8. Before adding any disks to vSAN I reviewed the Cluster Props > vSAN > General page and it saw

      3 hosts,

      0 fo 3 Eligible SSDs,

      0 of 3 Data disks,

      Total Cap 0.00B,

      Free Cap 0.00 B

      Network Normal

      Looks good!

    9. Selected esx01, Clicked "Created a new disk group"

    10. Selected the one SSD and one HDD, clicked OK and waited for "Create a New disk group" task to complete...

      Viewed the C# client and it stated "Initialize disks to be used by VSAN

      I probably will put in a feature request to be more specific in the Web Client.

    11. The task timed out after 30 minutes and the HDD had disappeared. There was a spike in traffic to the HDD but then it quickly died out. (See attached image)

    I think the vCenter vpxd timeout might need to be increased possibly. But still I don't think its going to solve anything.

      http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1017253

    It doesn't appear the GA vSAN is lab/enthusiast friendly at this point. If you are going to test for prod, you will probably need to pony up the cash for HW on the HCL.

    Still I am going to export logs and open a case with VMware and see if they can lend a hand from purely a software POV.

    This is obviously not 100% hardware related, there is probably a bug or mis-configuration somewhere. The fact that the host loses it's disk (a fully functional disk and controller w/o vSAN enabled) and has issues during reboots ONLY when vSAN is enabled means there is more going on under the hood.

    So for now it appears vSAN is a no go for me unless VMware support is willing to lend a hand on unsupported HW and has some ideas.

    Maybe someone reading this has other ideas.

    Thanks, JKasal



  • 12.  RE: Update VSAN configuration taking a long time

    Posted Mar 16, 2014 10:25 PM

    Hi,

    Wow thanks for the update and the great detail you have supplied. Exactly the same scenario my end I'm afraid.

    Please open the case and loop myself in as I'm happy to help provide logs etc to help get this resolved.

    My AHCI driver is supported out of the box with 5.5U1 and might be able to lend a hand in getting this looked at?

    Did the beta builds work for you at all? Do you know where the logs for vSAN are located?

    Anyways let me know and happy to help out however possible!



  • 13.  RE: Update VSAN configuration taking a long time