vSAN1

 View Only
Expand all | Collapse all

vSan disk errors when migrate

  • 1.  vSan disk errors when migrate

    Posted Mar 26, 2017 08:26 PM

    I´m migrating guests to a new vsan solution and have added 10 guests and put the into migrate to the vsan datastore and now recieves "congestion" and "overall disk health" warning and fail on 3 of 6 SSD discs (see pic). I have been moving 10 guests during one hour. The warning and fail warnings come and go on the host and will probably disappear soon when the last guest is done migrating. Is it normal to get this alarms with that kind of action or what could be wrong?

    Environment:

    3 x host (DL380 G9)

    2 discgroups per host with: 1x SDD (200GB) + 4x SAS (1TB)



  • 2.  RE: vSan disk errors when migrate

    Posted Mar 26, 2017 10:20 PM

    Hello Ivan,

    First-off, the thresholds for this alarm are very sensitive (30 for Warning, 60 for Failed), congestion only really becomes an issue if it start getting near or past 200 and/or is not transient. But as you suggested these come and go so I don't think you have anything to worry about here, try migrating less at a time if you are still concerned though.

    Also note that the congestion type specified is 'ssd congestion':

    I will mention that you are using very small SSDs here and that is likely the bottleneck, best practice advises you use a minimum of 10% cache to capacity ratio (5% here, though realistically this advises 10% of USED capacity), either way if you can acquire larger cache-tier SSDs you will likely see better performance.

    You can further clarify this by running vSAN Observer while doing these actions and see potential bottlenecks/issues:

    https://kb.vmware.com/kb/2064240

    Bob

    -o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button, please ask follow-up questions if you have any -o-



  • 3.  RE: vSan disk errors when migrate

    Posted Mar 27, 2017 08:13 AM

    Thanks for fast reply and good answer!

    Hmm, sad that the kb seems deleted, link broken..

    Now it feels better! :smileyhappy:

    I was also confused regarding the sizeing of the SSD and asked the seller about this. Answer was that we are running FFT:1 (default vsan policy so yes we are) and this means that we should get around 11TB used space over a total of 22TB disk. It´s the 11TB that I should have 10% SSD of, hope this is correct? :smileyhappy:

    I ran the vcenter builtin stresstest for vsan for 10min and recieved results in attached exceldocu.

    The multicast performance test also passed with:

    • Recieve: 81MB/s
    • Max Achieved: 125MB/s


  • 4.  RE: vSan disk errors when migrate

    Broadcom Employee
    Posted Mar 27, 2017 12:44 PM

    The issue may have different sources, although your cache tier is not the recommended size, your workload is very small in this case. I would recommend verifying that you are using the latest firmware and drivers for all your components (controllers, drives, NICs, etc.). Also the type of disks you use has an impact (queue depth). SATA SSD have a small queue depth and not optimal.

    Also note that many enhancements have been done on latest versions of vSAN (6.0 U3 & 6.5).

    You can download HCIBench (vmware fling), to run performance testing. HCIBench will also launch vsan observer for you which you can review after the test is done.

    HCIBench



  • 5.  RE: vSan disk errors when migrate

    Posted Mar 27, 2017 10:16 PM

    How is your network setup? Is your VSAN on a 1Gb network? vMotion on a different NIC?

    If they're on the same NIC, you'll probably see congestion on 1Gb. If you're VSAN NIC is using 1GB, you'll probably see congestion more often than you'd like.



  • 6.  RE: vSan disk errors when migrate



  • 7.  RE: vSan disk errors when migrate

    Posted Mar 28, 2017 05:38 AM

    Network is separated, vmotion and mgmt runs over 1GbEth and vSan over 10GbEth on dediicated switches.

    Heres the result from HCIBench, during the test there was no alarm and the highest congestion value was 12. Looks like i only get congestion when i migrate over guests to the vsanDatastore.

    Datastore: vsanDatastore

    VMs        = 6

    IOPS       = 133098.05 IO/s

    THROUGHPUT = 519.93 MB/s

    LATENCY    = 2.8987 ms

    R_LATENCY  = 2.3227 ms

    W_LATENCY  = 4.2437 ms

    =============================

    Datastore: vsanDatastore

    95th Percentile Latency = 3.3978333333333333

    IOPS associated with 95th Percentile Latency = 115146.0

    =============================

    Resource Usage:

    CPU USAGE  = 38.75%

    RAM USAGE  = 20.46%

    VSAN PCPU USAGE = 14.8447%

    And here is the chart for last 3h. During the night and with 20vms running in the cluster the congestion was always at 0. It looks like it only gets up when i migrate vm:s in to the vsanDatastore, as you can see in the chart below congestion is 0 until the migrate begins then it goes up to 14. If this is normal or not is a question.. seems like BOB have same experience?



  • 8.  RE: vSan disk errors when migrate

    Broadcom Employee
    Posted Mar 28, 2017 01:41 PM

    I was under the impression you were all-flash, but now it is making more sense. When you are migrating VMs to vsanDatastore, you are doing sequential writes, which is probably different that your normal workloads. Launch vsan observer prior to migrating VMs,and then you should be able to see what is going on.



  • 9.  RE: vSan disk errors when migrate

    Posted Mar 30, 2017 08:35 AM

    Yepp, looks like theres no problem at all in normal workload. I have migrated 50 vm:s now and congestion i zero when im not migrating so everything looks fine.



  • 10.  RE: vSan disk errors when migrate

    Posted Mar 27, 2017 03:25 PM

    Hello Ivan,

    That link works for me, maybe site was in maintenance at the time you checked so try again or go to kb.vmware.com and put 2064240 in the 'View by Article ID' field.

    I find generating an offline bundle covering a period when the issue is being observed to be the most useful for analysis.

    Out of curiosity, what are the make + model of the SSDs and capacity drives in use here?

    Regarding the 'used' % cache:capacity ratio with FTT=1 worksets and whether this really affects the realistic 'used', that is something I will have to look into further.

    Bob

    -o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button, please ask follow-up questions if you have any -o-



  • 11.  RE: vSan disk errors when migrate

    Posted Mar 27, 2017 06:48 PM

    GreatWhiteTec: What should the recommended cache tier size for the ssd be? I dont really get how it could be wrong, the setup should be HP:s Ready Hodes HY6 that are an HP package built for vSphere and with vSan. The builtin drivercheck in vcenter did first say that storage controller driver was not ok so i upgraded to the compatible version then was everything compatible but that was before the errors started.

    I'm running the latest 6.5. I will try the HCIBench.

    BOB, Your right, the site works now!

    Capacity as shown in vsan physical discs:

    SAS: 189GB (HP 1TB 6G SAS 7.2K 2.5in SC MDL - 832514-B21)

    SSD: 931GB (HP 200GB 12G SAS ME 2.5in EM SC H2 - 779164-B21)



  • 12.  RE: vSan disk errors when migrate

    Broadcom Employee
    Posted Mar 27, 2017 07:38 PM

    The previous recommendation was to have at least 10% of the workload size. Recently the recommendation has shifted more to workload IO mix (100% WR vs. 70/30 RD/WR)

    See this VMW blog for more info  Designing vSAN Disk groups - All Flash Cache Ratio Update - Virtual Blocks



  • 13.  RE: vSan disk errors when migrate

    Posted Mar 27, 2017 09:00 PM

    Looks like that recommendation is only for all flash, I´m running hybrid. From the blog:

    I’ve deployed or am going to deploy a Hybrid vSAN cluster.  Am I impacted?

    No, the revised sizing guidelines are applicable only to All Flash

    Hybrid will continue to use 1:10 caching to usable capacity ratio