vSAN1

 View Only
  • 1.  Remount disk taken offline with latency

    Posted Feb 28, 2016 10:09 AM

    We've had an issue on a lab setup where VSAN has taken disks offline for latency (assuming it's based on this article VSAN 6.1 New Feature - Handling of Problematic Disks - CormacHogan.com)

    It looks like multiple disks have been taken offline and now some of the VMs are no longer accessible (including vCenter). I can't find any commands to force VSAN to bring the disks back online - is this possible?

    I've now run the commands in the article above to avoid this in future, but would like to avoid rebuilding the lab from scratch (again) if I can help it.

    I also find it quite strange that VSAN would choose to take disks offline and cause VMs to not be available over some increased latency as a worst case scenario?

    Thanks.



  • 2.  RE: Remount disk taken offline with latency

    Posted Feb 29, 2016 11:19 PM

    Hello,

    I think collecting logs would be helpful, here are some commands for generating HTML-formatted logs:

    vsan.observer <cluster> --run-webserver --force --generate-html-bundle /tmp --interval 30 --max-runtime 1

    This command creates the entire set of required HTML files and then stores them in a tar.gz offline bundle in the /tmp directory (in this example). The name will be similar to /tmp/vsan-observer-<date-time-stamp>.tar.gz.

    To review the offline bundle, extract the tar.gz in an appropriate location that can be navigated to from a web browser.

    collect some data and let's see what's in there.



  • 3.  RE: Remount disk taken offline with latency

    Posted Mar 01, 2016 04:22 AM

    That would certainly be possible if the disks going offline hadn't rendered my vCenter server useless (it will not boot as complains about missing disks). Every VM is actually broken since VSAN took the multiple disks offline (It's a lab, so not major) but am curious if I can force the individual disks online, also why VSAN would mark them as offline to cause data loss.

    I see this on the offline disks when I check from each host:

    esxcli van storage list

    naa.6b82a720d70583001dcd3fd31e99c463

       Device: naa.6b82a720d70583001dcd3fd31e99c463

       Display Name: naa.6b82a720d70583001dcd3fd31e99c463

       Is SSD: false

       VSAN UUID: 52e4b5b6-17d9-5b5c-7a71-602788ae693e

       VSAN Disk Group UUID: 52aa1f89-f9e5-5c33-8132-28b589fca7d8

       VSAN Disk Group Name: naa.6b82a720d70583001b60aa790ce49961

       Used by this host: false

       In CMMDS: false

       Checksum: 6120321646211848238

       Checksum OK: true

       Emulated DIX/DIF Enabled: false



  • 4.  RE: Remount disk taken offline with latency

    Broadcom Employee
    Posted Mar 01, 2016 10:15 AM

    Do you see the messages as described in the blog post, which would verify that this was a result of the Problematic Disk Handling feature, and not an underlying error.

    If you had FTT=1, and only one disk group was unmounted, then the VMs should still be accessible.

    IIRC, there should be an "esxcli vsan disk mount" command - sorry, I'm in transit now so cannot double check the syntax.

    If you are absolutely sure this was problematic disk handling, and the latency issue no longer exists, you can try remounting using that esxcli command.

    Also, you can contact support for more assistance with verifying root cause, and resolution.

    HTH

    Cormac



  • 5.  RE: Remount disk taken offline with latency

    Posted Mar 02, 2016 07:08 AM

    Hi Cormac,

    Yes - the messages I saw were the ones described in the blog post. I ended up rebuilding the lab and have run the commands at the end of your blog post.

    Worth noting that all the hardware I have is on the HCL (the lab is a few older servers) so unsure why it runs into this issue.

    I have 3 servers, with 3 capacity disks in each - the issue affected 5 of those disks, so I'm pretty sure the data was lost.

    Thanks.



  • 6.  RE: Remount disk taken offline with latency

    Broadcom Employee
    Posted Mar 07, 2016 03:19 PM

    OK - that's a shame. The feature only unmounts the disk groups - you should have been able to mount them again.