vSAN1

 View Only
Expand all | Collapse all

vSAN Outstanding IO and Latency

  • 1.  vSAN Outstanding IO and Latency

    Posted Sep 13, 2023 08:51 AM

    Hi All, 

    If you seen the below graphs, would you be concerned? Or does this look ok?

    • Outstanding IO Peak = 168
    • Write Latency Peak = 10.4ms (Spike lasted 1hr 20 mins from 2am to 3.20am)
    • Below is capture of last 24 Hours.
    • As can be seen there are no congestions at all which is good.
    • This is a SQL Cluster with batch jobs that run

    Screenshot1.png

    Screenshot3 - Copy.pngScreenshot2 - Copy.png



  • 2.  RE: vSAN Outstanding IO and Latency

    Broadcom Employee
    Posted Sep 13, 2023 09:10 AM

    Hi  ,

    I would not worry much just by these graphs as it was just one peak on outstanding IOs and latency and then settled down. Throughput looks good on the cluster.

    Definitely, I would look for trend/pattern of this behavior and ask few questions to myself:

    - Is this behavior observed every day

    - What exactly is running on guest OS or application when peak is observed

    - Has this been a day#1 issue or it started happening recently on daily basis

    - As its SQL cluster which is very high IO intensive application, I would consider validating vSAN storage policy (Write vs Read ) and what configuration is set on it (https://blogs.vmware.com/virtualblocks/2019/03/26/considerations-for-running-microsoft-sql-server-workloads-on-vmware-vsan/)

    - Other than this batch job, Is there any other extensive task (Back up/Antivirus Scan/Security Scan) running in environment which can add up to this

     

    A good read: https://blogs.vmware.com/virtualblocks/2021/01/21/stripe-width-improvements-in-vsan-7-u1/



  • 3.  RE: vSAN Outstanding IO and Latency

    Posted Sep 13, 2023 09:27 AM

        



  • 4.  RE: vSAN Outstanding IO and Latency

    Broadcom Employee
    Posted Sep 13, 2023 09:59 AM

    Hi  ,

    Cluster wise hardware component looks great to me and definitely should serve huge number of IOPs.

    From last latency screenshot that you shared looks high to me and worth to look into the best practice configuration.

    SQL is high IO intensive application and should be run on RAID10. Hence the feature of storage policy based management of vSAN can help you with that regards where in you should run high IO application to RAID10 and less IO intensive VM on RAID5.

    With regards to number of VMs and amount of IOPs expected with all 100 SQL servers, I would recommend you to use vSAN sizer tool.

    https://vsansizer.vmware.com/

    This tool would help you to plan your migration from vSAN perspective.

    I would recommend using a test SQL server and validate performance of it by using different set of vSAN storage policy.



  • 5.  RE: vSAN Outstanding IO and Latency

    Posted Sep 13, 2023 10:20 AM

    Hi 

    Yes, so with the analysis, i see that there are two possibilities:

    1. Is there a problem/concern with the vSAN Cluster, spec and performance (by viewing vSAN Backend and outstanding IO)?
    2. Is there a problem/concern with the VM Layer (by viewing the Cluster level metrics for VM Latency and Top Contributors)?

    The last screen shot was of the VMs related to point 2 above which shows high metrics. These VMs do have 3 paravirtual controllers each and other SQL Recommendations, the only issue i can see is they are R5 so i would recommend R1.

    So from the graphs you have seen, you believe there is no concern for point 1 above about the cluster and this is more of a VM related performance issue?



  • 6.  RE: vSAN Outstanding IO and Latency

    Broadcom Employee
    Posted Sep 13, 2023 10:31 AM

    Yes correct. It looks to be more of VM - Storage layer configuration parameter consideration rather entire cluster.



  • 7.  RE: vSAN Outstanding IO and Latency

    Posted Sep 13, 2023 02:55 PM

    Hi,

     

    that your VMs are already using multiple Disks attached to multiple pvSCSI controllers is a good decision.

    As your environment is build out of 7 Nodes with 3 DGs/Node you could also increase the Numbers of stripes used (if not already done) in the SPBM used by the SQL VMs.

    To figure out if a single vmdk might be responsible for the higher latency values you could use esxtop as it does allow a breakdown on a per vmdk basis.
    Here's a snippet taken from our vCSA, as you could see each vmdk is listed.

    kastlr_0-1694616406796.png

    Simply open esxtop, select v to see all VMs on that Node.
    Than use the Num Pad, <2> and <8> allows you to move up and down and finally press <6> to get the extended view. 

    If your investigation did find a hot spot/bottleneck you could think about using the Guest OS LVM to create a striped volume inside a VM build from multiple smaller vmdks.

    We did this multiple times and usually such approach did distribute the IOPs more evenly to all vmdks.

    And finally you still have the option to upgrade to vSphere 8 and use the new Large Write Buffer feature.



  • 8.  RE: vSAN Outstanding IO and Latency

    Posted Sep 14, 2023 03:01 PM

    Hi  what figures would  see that would cause concern for the Outstanding IO and Latency Metrics?

    So are there any guidelines in VMware whereby if you seen Outstanding IO of 1000 for over 1 Hour that would be an immediate indication of an issue?

    I have another cluster that's showing a a 1hr 20 minute Outstanding IO Spike up to 2,930 outstanding IO for that 1hr 20 minute duration and write latency of 50ms for that same duration. 

    Be good to know a guideline such as 

    • 90% of the time Outstanding IO should be below 200
    •  of the time Latency should be below 2ms and ideally sub 1 ms

    Thanks



  • 9.  RE: vSAN Outstanding IO and Latency

    Posted Sep 14, 2023 08:14 PM

    Hi,

    you said that you plan to migrate additional 50 SQL VMs from an existing cluster to this vSAN Cluster.

    So how is that cluster looking from an IO performance point of view?
    And do those SQL VMs run their jobs in the same timeframe?

    Keep in mind that while vSAN is able to handle really high IOPS rates it does has (as every technical system) sweet spots and areas where it could lack behind "classic" storage. 

    As an example, classic arrays usually uses RAM as cache while vSAN uses SSDs.
    If a single VM would produce a high IOPS workload with a small working set while only using few outstanding IOs vSAN might not be as fast as an classic array using DRAM as cache.

    As writes are always handled by the Cache SSDs you should check if those OIO's are distributed evenly over your diskgroups.
    And you should figure out if only a few vmdks are responsible for the load or not.

    I've seen SQL VMs with 10+ vmdks where the majority of vmdks do report low latency values, but because of only one or two vmdks did hit the ceiling the performance of the whole VM was negatively impacted.

     

    ---Edited---

    I'm pretty sure you're already aware of this article, but just in case I'm wrong you could use the following tweak on your SQL VMs.

    Large-scale workloads with intensive I/O patterns might require queue depths significantly greater than Paravirtual SCSI default values



  • 10.  RE: vSAN Outstanding IO and Latency

    Broadcom Employee
    Posted Sep 15, 2023 07:29 AM

    Hi  ,

    There is no specific number that is present as all environments are different.

    Document that   shared should also be looked at.

    Changing storage policy to RAID10 should be a good start to observe the bahavior and change. Also queue depth on para controller should be considered too for high IO intensive environment.

    Hardware specs that you shared is premium. So one step at a time:

    - Is this behavior observed every day

    - What exactly is running on guest OS or application when peak is observed

    - Has this been a day#1 issue or it started happening recently on daily basis

    - As its SQL cluster which is very high IO intensive application, I would consider validating vSAN storage policy (Write vs Read ) and what configuration is set on it (https://blogs.vmware.com/virtualblocks/2019/03/26/considerations-for-running-microsoft-sql-server-wo...)

    - Other than this batch job, Is there any other extensive task (Back up/Antivirus Scan/Security Scan) running in environment which can add up to this

    https://vsansizer.vmware.com/

    https://kb.vmware.com/s/article/2053145

     

    One quick question that also should be validated : Is there actually a performance problem with application and batch job or is it just graph that is worring you.

     



  • 11.  RE: vSAN Outstanding IO and Latency

    Posted Sep 15, 2023 10:08 AM

    Hi  

    Thanks. Yes so i can see there are some top contributor VMs that definitely need to be looked at to align a more suitable storage policy like RAID1 aswell as look at optimizations for the PVSCSI Adapters, HBA, etc. So this is a more focussed task and that can be looked at. 

    Regarding the bigger picture, my only concern is with the graphs for OIO and backend latency, no performance issues have been reported by application owners and the figures on OIO, Backend etc are seen at the same time very day and always have been to my knowledge and its correlated with the Batch jobs that run. Because i am considering adding double the amount of SQL VMs i am investigating. From a storage/CPU/Memory perspective, the cluster can onboard the additional 50 SQL VMs fine, however the final check is performance and this is what i am considering now. 

    Your point about arrays is a good one and another consideration of mine whereby these other 50 SQL VMS currently reside on a high spec HPE 3PAR Array. A LiveOptics capture has shown these 50 VMs to be hitting IOPS of 69k at Peak, 31k at 95%. Current vSAN SQL VMs are hitting 34k at Peak, 22k at 95%. So at peak, these other 50 SQL VMs are hitting nearly double the IOPS of the vSAN SQL VMs and therefore if these are all combined into vSAN, the peak IOPS could be just over 100k. The 3PAR VMs are in a SDRS Cluster with SIOC and can see from that the max read latency is 8ms and write max of 35ms with highest IOPS for a datastore at 35k.



  • 12.  RE: vSAN Outstanding IO and Latency

    Posted Sep 15, 2023 10:25 AM

    Hi,

    if you already have a LiveOptics capture you should definitively check the VMware Excel output.

    kastlr_0-1694773404964.png

    It does contain information about all of your VMs including (basic) performance metrics for each of them.

    This would allow you to figure out if all VMs are responsible for the load or only a subset.
    Simply performing the storage & performance calculation by i.e. dividing Peak & AVG IOPS by the number of VMs/vmdks isn't a good idea.