ESXi

 View Only
Expand all | Collapse all

Storage latency on large environment

  • 1.  Storage latency on large environment

    Posted Sep 15, 2015 03:25 PM

    Hello,

    I just want to ask community, is there anybody see similar behavior like my environment. And how you treat it?

    I have lot of clusters on a datacenter. Some of the clusters on Cisco UCS, some on HP BL460 blades. Some of the clusters connected on NetApp SAN, some on EMC SAN.

    All these devices connected on 4 or 6 EMC SAN switches. All devices in same physical data center.

    Each cluster has about 15 - 25 hosts, and average 15VMs  per host.

    I always see weird storage latency on all farms. From vmkernel logs, it indicates it's LUN timeout, datastore heartbeat lost, performance degrading...etc.

    It more like too many VMs running on shared storage lead to it cannot provide enough IOPS to VMs.

    Do you see similar problem like mine? How you fix it?

    My understand is that virtual disks share same datastore, datastore of a cluster reside on a physical LUN, the LUN shares same physical storage pool with other luns, the storage pool consist of lot of physical disks.

    so if some virtual disks are busy, it will lead to the lun slow, then lead to storage pool slow, then impact other luns in same pool, then virtual disks on other luns will be slow down.

    Is that correct?



  • 2.  RE: Storage latency on large environment

    Posted Sep 15, 2015 03:34 PM

    are you seeing performance problems on both storage systems? emx and netapp?

    if yes, there could also be a problem with the san switches. they are maybe overloaded.

    how much latency do you see on the disk performance charts from your esx server? if they are too high, this could also be a storage system performance problem :smileyhappy:



  • 3.  RE: Storage latency on large environment

    Posted Sep 15, 2015 11:57 PM

    Yes, I see performance issue on both. Latency was over 30 ms.



  • 4.  RE: Storage latency on large environment

    Posted Sep 15, 2015 03:46 PM

    Your general idea of VMware clustering on shared storage is correct, although terminology differs from vendor to vendor.

    I'd suggest you start looking at your problem one LUN full of VMs at a time. Normally, multiple virtual machines reside together on the same iSCSI or FC LUN, or together on the same NFS export.    To do this, select a LUN and make note of the virtual machines that reside on it.  View the read and write latency values for the LUN, as well as for the virtual machines residing on the LUN.  If these values exceed an average of 10ms or stay above 10ms for a long period of time, take a look at the virtual machines on that LUN and try to determine whether any of them are known to incur lots of I/O (e.g. SQL Servers, Oracle, etc.)  Other antagonists can be concurrent virus scans running inside virtual machines, concurrent backup jobs running against the virtual machines, etc. Try to troubleshoot one LUN full of VMs and determine if there is a common denominator.  If nothing abnormal is found there, start zooming out and looking at IOPS on the ESXi HBAs to see if you're throwing more at the arrays or LUN than they are designed to handle.  For this, you'd need to know the performance characteristics or your array and how many IOPS your LUNs disk layout is capable of handling.  Since you have multiple arrays, you'll need to do some reverse engineering there.

    Most enterprise-class arrays can handle standard workloads, so if you are humming along at <10ms for the majority of the day, then performance drops, this is likely a scheduled or user-invoked application change causing the issue.  In one case, our SAN team placed a very high I/O non-virtualized database on the same disk aggregate as our VMware volumes, and the database team was running very inefficient queries at odd hours. As VMware admins, we couldn't find the culprit until we looked at other tools (in that case, NetApp Balance,) so you may need to look at performance from the side of your array after ruling out your virtualized workloads as the offender.



  • 5.  RE: Storage latency on large environment

    Posted Sep 16, 2015 12:01 AM

    You are absolutely correct. Actually I did everything like you said and further. The problem is we - virtualization team don't have vision of back end storage. I also suspect something else running on same storage.

    One thing is, you are talking about one host, but if you have 200+ VM on same cluster and 20 hosts share 100 LUNs. it's hard to only focus on single host. Because even this host got latency it doesn't mean the workload comes from this particular host, it can by any host on shared storage.



  • 6.  RE: Storage latency on large environment

    Posted Sep 15, 2015 05:58 PM

    Hi,

    beside checking the vmkernel.log you should also check vobd.log as this log is focused on storage issues.

    When you do see frequent ATS Miscompare messages in the vmkernel.log you should follow the steps documented here.

    VMware KB: Enabling or disabling VAAI ATS heartbeat

    That could solve many of the issues you did report.

    Even if the article was written for an IBM Storwize array, I could confirm that the workaround will also work with EMC VMAX, VPLEX and VNX arrays.

    HTH,

    Ralf



  • 7.  RE: Storage latency on large environment

    Posted Sep 16, 2015 12:03 AM

    Thanks for your reply.

    The KB clearly indicates it's for IBM storage and ESXi 5.5 U2.

    We had this kind of problem since ESXi 5.0, maybe more early.

    BTW, NetApp also had ATS issue on early ESXi version.



  • 8.  RE: Storage latency on large environment

    Posted Sep 16, 2015 09:02 AM

    Hi,

    when you're affected by these kind of problems you need to check either vmkernel.log or vobd.log.

    The messages are included in the VMware KB article.

    Regards,

    Ralf



  • 9.  RE: Storage latency on large environment

    Posted Sep 23, 2015 03:58 PM

    Thanks for reply.



  • 10.  RE: Storage latency on large environment

    Posted Sep 18, 2015 02:45 PM

    To work on this Storage latency issue, you have to find out exact cause which due to which latency is there.

    • Resource Crunch in environment?
    • Overload of storage operations?
    • Misalignment of virtual machines per ESXi host?
    • Lack of management for Virtual machines; i;e large snapshots?
    • Balancing of workload i;e backup jobs, other database backup, production hour workloads, these are managed as per best practices.
    • Virtual machines having large IOPS are aligned in better way? ..

    • List out the datastores which are generating high IOPS.
    • Segregate virtual machines which are consuming most storage resource.
    • Offload workloads which are running parallel. i;e reschedule backup jobs, or any other task which may consume more iops during business hours.
    • Ensure that Multipathing(RR, Fixed, MRU) are configured as per best practices.
    • Try changing IOPS value from default 1000 to 1 on applicable datastores, which may also help.

    If still having same issue, check if enough resources available as per your environment.

    Pranay K Jha

    (VCAP5-DCA, VCAP5-DCD)



  • 11.  RE: Storage latency on large environment

    Posted Oct 12, 2015 05:25 PM

    Hi Wilber,

    Below is the link for troubleshooting steps for storage latency.

    Troubleshooting Storage Performance in vSphere – Part 1 - The Basics - VMware vSphere Blog - VMware Blogs



  • 12.  RE: Storage latency on large environment

    Posted Oct 13, 2015 03:04 AM

    I have faced same kind of issue and escalated to emc. the emc engineer collected the performance log of the Storage box and targets and found out that all the targets are 90% utilized.so we added another engine to vmax and shared the load. so I would suggest you to log a case with storage vendor to do performance check and get solutions from them. one more thing, we are using power-path software for multi-pathing so that the load shared accross all targets, in your case if you are using native path as path policy, i would suggest you to use round-robin with IOPS 1 (*** based on your storage support)