vSAN1

 View Only
  • 1.  vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted Apr 10, 2024 01:46 PM

    Hi All, 

    Issue: 

    Screenshot 2024-04-10 143446 - Copy.png

     

     

     

     

    We have a 3-Node vSAN Hybrid 7.0.3 cluster. Each 2U Host has 18 disks fitted, 3 Disk groups in each host, each with 1 x SSD 745Gb SAS cache and 5 x HDD SAS 2.4TB Capacity Disks. 

    Focus is to determine if the cache is sufficient. The cluster consists of 3-Nodes that each have three disk groups totalling 9 x 745Gb SAS SSD Cache Drives (6.7TB Cache) and 45 x 2.4Tb SAS HDD Capacity Drives. This equates to an approximate 12% Cache to Used capacity ratio (Used 56.5TB) with the guidance of a 10% ratio.

    Performance charts are showing Read Latency with a spike as shown below. 

    Screenshot 2024-04-10 134650 - Copy.png

     

     

     

     

    Screenshot 2024-04-10 142634 - Copy (2).pngScreenshot 2024-04-10 143040 - Copy.png

     

     

     

     

     

     

     

     

     

     

    The goal of vSAN is to have a 90% cache hit rate. A cache hit is when a read request is found on the read cache. Subsequently, a cache miss is when the block needs to be retrieved from the capacity tier. Since the capacity tier is using magnetic disks the read operation will incur latency. Looking at the below 9 x Disk Groups read cache hit rate it does not look like its reaching 90% very often and therefore there are a lot of Cache Miss?

    So would the cluster benefit from an addition DG per host and therefore more cache? 

    Or would a SPBM Cache reservation be advised for the affected VMs? (There are 2 x VMs that run the main LOB applications and complete batch jobs daily, this was taking 10 hrs to complete now 12 hrs to complete) so this is an review to see if the cache is struggling. 

    Thanks



  • 2.  RE: vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted Apr 11, 2024 09:16 AM

    I would just create a different policy for the impacted VMs and give it a read cache reservation. But only if the performance is lower than the customer/app owner expect.



  • 3.  RE: vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted Apr 11, 2024 01:19 PM

     thanks, yes this was one of my plans however initially i want to determine if the cache in the cluster is sufficient or not.

    From what i can see "Evictions" have now been renamed to "Removals" in version 7.x onwards. 

    I cant find anything definitive on:

    1. How to determine if cache is sufficient or not, i know the guidance is that vSAN will aim for 90% hit rate on cache however the graph i am seeing is difficult to interpret as is going up and down so is that ok or not? The "Removals" also looks to be very active. 

    2. Also no guidance i can see on removals such as if you see XX amount of removals more that 5 times in a 24 hr period, add more cache?

    Screenshot 2024-04-11 134152 - Copy.png

     



  • 4.  RE: vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted Apr 17, 2024 08:47 AM

    The way I look at these things is fairly basic, are people complaining about the performance? If not, then it is good enough



  • 5.  RE: vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted Apr 17, 2024 08:54 AM

    Morning, 

    Yes they are complaining that the processes are taking longer and longer to complete, so from the initial baseline when the cluster was deployed, the process now takes 2 hours longer to complete. 

    Of course this could be the individual VM software, application issue, database has got larger etc, so we can look into that however for now i want to start with the basis of all this and the vSAN Cluster, cache etc so see how that is getting on.



  • 6.  RE: vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted Apr 22, 2024 12:47 PM

    You could, as discussed, increase the read cache (reservation) for those VMs, if it is a limited number of VMs, which would be able to benefit from this. Considering the read cache hit rate is relatively low, using this policy capability could improve performance, that is if the performance is lagging because of slower reads of course.



  • 7.  RE: vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

    Posted May 22, 2024 02:57 AM
    Edited by kastlr May 22, 2024 02:59 AM

    Hi,

    maybe a different design would do the trick, as IMHO the two other messages also belongs to the problem.

    I would recommend the following changes for the two main LOB VMs.

    • make sure your VMs do either use pvSCSI or NVMe controllers
    • use multiple SCSI controllers (up to 4)
    • create 3-4 smaller vmdks (per SCSI Controller)
      • using a value between 2-4 for "number of disk stripes per object" 
      • size depend on the total amount of capacity needed
    • use the Guest OS LVM to create one/multiple striped volumes with a block size that align to the IO size used by your main LOB VMs
    • format the volumes using the same block size
    • increase the IO queue depth for the pvCSI controllers (described here)

    Such design change allow the IO load injected by those two main LOB VMs to be handled by more vmdks resulting in

    • way more IO commands slots available for those VMs
    • higher chance that vSAN distributes the "objects" belonging to the vmdks to all Nodes/DGs/HDDs minimizing hot spots
    • higher chance that the working set (aka hot data of those two VMs) is distributed over all available vmdks
    • due to striping even sequential read misses tend to end on different backend HDDs to prevent maxing out a single HDD

    While this approach doesn't increase the Read Hit Ratio it should reduce the reported IO latency.