VMware vSphere

 View Only
Expand all | Collapse all

High IO latency from simple file copy?

  • 1.  High IO latency from simple file copy?

    Posted May 23, 2012 08:39 AM

    A simple file copy on the local C: disk, from one folder to another, on a Windows 2008 R2 virtual machine causes disk latency (DAVG/wr) to go up to 300 ms. If I give the virtual machine another drive, D:, that is on another LUN, a file copy from D:\ to C:\ even makes DAVG/wr latency go up to 1500ms. The high write latency is measurable on other virtual machines on the same LUN.

    The same file copying activity on a Windows 7 virtual machine on the same LUN leaves disk latency (DAVG/wr) below 20ms.

    Latency is measured with esxtop on the host and iometer inside guests. During my tests there were no other virtual machines running.

    Is the disk latency *supposed* to go so high from a simple file copy? It would make me uncomfortable if somebody copying a large file to another folder on the file server could blow up write latency for all other virtual machines too. Or is it not supposed to, and I have misconfigured something?

    Our setup:

    Server is HP Proliant DL165 G7

    SAN is HP MSA P2000i G3

    ESXi 5.0 Driver Rollup 2

    Server has 4 gigabit ethernet cards. vmnic0 is connected to a switch, here the management and virtual machine networks are connected. vmnic1 is not used. vmnic2 goes to port A0 on the SAN. vmnic3 goes to port A1 on the SAN. Controller B on the MSA has been shut down. The SAN has two LUNs. Using "Manage paths" I have disabled path vmnic2-A1 for Lun0 and vmnic3-A0 for Lun1, so each LUN has a dedicated cat6 cable. Both the Windows 2008R2 and the Windows 7 virtual machines were installed to Lun0.



  • 2.  RE: High IO latency from simple file copy?

    Posted May 23, 2012 11:04 AM

    erietveld wrote:

    Is the disk latency *supposed* to go so high from a simple file copy?

    No, it is really very high numbers you get. Do you only observe this on the Windows 2008 R2 server and not on any other VMs?



  • 3.  RE: High IO latency from simple file copy?

    Posted May 23, 2012 11:20 AM

    Yes, it happens in this manner only on the Windows 2008R2 machine, and not on the Windows 7 machine. I have thrown out the vms and installed fresh a number of times.

    I can also create very high latency in esxtop on Linux virtual machines by writing directly to the disk like so:

    dd if=/dev/zero of=/dev/sda bs=1M count=5000

    This causes DAVG/wr to go above 300ms or higher in esxtop. A simple file copy on Linux does not cause high latency.



  • 4.  RE: High IO latency from simple file copy?

    Posted May 23, 2012 11:35 AM

    I am not familiar with the specific SAN you have, but could there be issues with write caching? Does it have a battery-backed cache enabled?

    Lack of such (or not configured) could cause very slow write times.



  • 5.  RE: High IO latency from simple file copy?

    Posted May 23, 2012 12:40 PM

    The SAN has write caching enabled. "Battery-free cache backup with super capacitors and compact flash"

    The console of the SAN shows no warnings or errors. I had the vendor (HP) check for hardware issues. We even replaced the controller. This did not help.

    I see no write latency issues when attaching the LUNs to a physical machine (ie, run something else instead of ESXi).

    Copying a file from one LUN to another in VSphere client also shows no issue.



  • 6.  RE: High IO latency from simple file copy?

    Posted May 23, 2012 12:44 PM

    Do you see anything else strange while doing heavy disk activity - like high CPU on the specific guest or on the host?

    What kind of network usage do you see on the vmnics?

    Which scsi controller type are you using in the Windows 2008 machine?



  • 7.  RE: High IO latency from simple file copy?

    Posted May 23, 2012 03:40 PM

    I haven't noticed anything out of the ordinary, but that doesn't mean nothing is. During the file copy, "explorer.exe" has 25% CPU usage on the guest, which is unusually high for a copy operation but does not seem to indicate a bottleneck. The Windows2008R2 server has only one virtual CPU assigned.

    On the host, during the file copy the CPU usage spikes up to about 8% from 1.5% average.

    During file copy from C: to D: (Lun0 to Lun1), the data receive on vmnic2 goes to about 50 MB/s and the data transmit on vmnic3 goes to about 50MB/s. The seems to be no other noticeable activity.

    On the Windows 2008R2 virtual machiyne, in "Device Management" -> "Storage Controller" the following device is listed:

    LSI Adapter, SAS 3000 series, 8-port with 1068

    If that is not what you meant with "Which scsi controller type" please tell me how to find that out.

    It's a fresh install from DVD, I haven't installed anything or made any changes except configuring the network, and installing iometer.



  • 8.  RE: High IO latency from simple file copy?

    Posted May 24, 2012 01:55 PM

    erietveld wrote:

    During file copy from C: to D: (Lun0 to Lun1), the data receive on vmnic2 goes to about 50 MB/s and the data transmit on vmnic3 goes to about 50MB/s. The seems to be no other noticeable activity.

    So there is low CPU usage, so it should not be the issue. The 50 MB/s you see, are this really MB (as in Megabyte) or is it Megabit? If it is MB then it is still an acceptable throughput, but not if megabit.

    On the Windows 2008R2 virtual machiyne, in "Device Management" -> "Storage Controller" the following device is listed:

    LSI Adapter, SAS 3000 series, 8-port with 1068

    If that is not what you meant with "Which scsi controller type" please tell me how to find that out.

    It could be seen from vSphere Client on the VM, check the SCSI controller type. However, it is most certainly "LSI Logic SAS", which is good and should not be the issue either.



  • 9.  RE: High IO latency from simple file copy?

    Posted May 24, 2012 03:24 PM

    What is the value of Disk.SchedNumReqOutstanding in the host advanced settings?



  • 10.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 07:44 AM

    Indeed the SCSI controller  is LSI Logic SAS.

    For the data transmit/receive, the unit listed is KBps (in that capitalization) and the value is above 50000.

    Windows 2008R2's file copy dialog reports 45 MB/second transfer speed (in that capitalization).

    Since it is a cat6 gigabit link, and the array can easily handle more IOPS, I would have expected it to be capable of twice that, but I am much more concerned about the high latency than the throughput.

    The value of Disk.SchedNumReqOutstanding is shown as 32. (I have not changed any advanced setting after installing ESXi Driver Rollup 2)



  • 11.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 08:47 AM

    erietveld wrote:

    For the data transmit/receive, the unit listed is KBps (in that capitalization) and the value is above 50000.

    Windows 2008R2's file copy dialog reports 45 MB/second transfer speed (in that capitalization).

    Since it is a cat6 gigabit link, and the array can easily handle more IOPS, I would have expected it to be capable of twice that, but I am much more concerned about the high latency than the throughput.

    It is some decent throughput, but as you say the latency values are way too high and will likely affect performance a lot.

    Could you do some esxtop screenshots while doing file copies? The screens from d, u and v.



  • 12.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 08:56 AM

    As mentioned on another thread latency is just the product of queue depth and transaction time against the number of drives.  Can you provide some info about the array?  For whatever reason, this 2k8R2 VM is just saturating it's controller queue.

    Basic file handling does though seem to be a problem with 2k8 and R2 - only last week I came across a situation where Win2k8 (not R2 in that case) will agressively cache file data to the point of exclusion of quite literally everything else (this is demonstrable on both physical and virtual installs).



  • 13.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 10:22 AM

    the top one is the show disk-statistics command on the array

    the other 3 are esxtop in d, u, and v mode, respectively.

    If you need more sampling points, please let me know. This screen was taken towards the end of the file copy (in the last minute), but the latency was consistently above 300ms, and often as high as 1000ms. The 3 esxtops are not exactly in sync, but they are within 1 second of each other.

    @J1mbo: unfortunately, I am not experienced enough to know what I can tell you about the array that would be interesting for you to know. Can you be more specific to what I should tell you about the array?

    It's a HP MSA P2000i G3, with 12 15krpm 600GB sas drives

    8 are Hitachi HUS156060VLS600

    4 are Seagate ST3600057SS



  • 14.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 11:15 AM

    Thanks for the esxtop screens.

    Just some questions, the vmhba36, this is the software iSCSI adapter I guess?

    Could you also provide a "n" esxtop screenshot while doing file copy?

    As for the SAN, do you know how the two datastores are physically configured? That is, how many disks and what RAID level?



  • 15.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 12:03 PM

    Yes, the vmhba36 is the software iscsi adapter. Attached is esxtop n screen during file copy.

    Lun0 and Lun1 are both 6 disk RAID6 arrays. Earlier, I have tested with 12 disk raid 0 array and still got latency above 300ms, but I cannot reproduce currently as I don't have free disks. SAN vendor (HP) has walked me through a long troubleshooting prodecure and has insisted that the problem is not in the SAN array or the current configuration of it.

    Copying on the same partition gives latency levels above 300ms and same transfer speed (45 MB/second)

    Both LUNs perform the same.

    There are currently no other hosts connected to the SAN.



  • 16.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 12:08 PM

    Everything looks quite good on the network view too. I see that both iSCSI vmnics are used and none of them have any real load either. No dropped packets.

    RAID5 and RAID6 do have some write penalty, both nothing like you are seeing. Are you sure that the cache settings are ok? I am sure you have verified this, but could there be something with the write-thru/write-back settings that is incorrect?

    Could you by the way test some disk performance tool, like IOmeter or other, and try to only do reads or only do writes and see what the result is?



  • 17.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 12:29 PM

    So far as I can verify, the caching settings are OK. The inferface tells me caching is enabled, there are no warnings or errors, and a HP support engineer has ensured me that the caching settings are correct. Also, connecting the SAN to another server, like a linux host instead of ESXi, we can write with 100MB/s throughput and low latency. The latency problem does not reproduce when we copy a file from one LUN to another in VSphere client, nor when we copy a file inside a Windows 7 virtual machine.

    How should I configure IOMeter to do a proper test?

    When I configure it to have one worker, to do 16K writes (0%read, 0%random), and allow it to have 8 outstanding IOs, it writes with 45MB/s throughput and 3 ms latency on the Windows 2008R2 server. The same figures are reported by iometer as I can see in esxtop. If I allow it to have 32 outstanding IOs, the throughput reported is 55MB/s, and latency goes up to 9 ms. Again iometer and esxtop agree. Doing only 16K reads, and allow 32 outstanding IOs, it reads with 110MB/s and 4 ms DAVG/rd.



  • 18.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 12:42 PM

    I wonder if the Windows 2008 R2 server is using some really really large IOs? Which could cause this extreme latencies. We saw only 50 cmds per second, and at the same time around 50 MB moved around..

    Could you check the IO size for both read and write while doing transfer:

    Avg. Disk Bytes/Read

    Avg. Disk Bytes/Write

    on the Physical Disk section in perfmon. (http://rickardnobel.se/archives/220)



  • 19.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 12:59 PM

    Also, is the 2k8r2 VM a domain controller?



  • 20.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 01:09 PM

    You are on to something here. When I tell IOMeter to do 1MB writes, it also gives me high latency.

    During file copy:

    Perfmon average disk bytes/write is around 25,000,000

    Perfmon average disk bytes/read is around 1,000,000

    The virtual machine is not (yet) a domain controller.

    Now how do I solve this problem?

    A) Is it normal for a 2008R2 server to do really large IOs?

    B) Is it normal to get really high latency on all disk IO if a server does that?

    C) What is the best way to prevent the high latency?



  • 21.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 01:17 PM

    erietveld wrote:

    Perfmon average disk bytes/write is around 25,000,000

    Perfmon average disk bytes/read is around 1,000,000

    Was it really 25 MB in write size? Really large and unusual.

    A) Is it normal for a 2008R2 server to do really large IOs?

    B) Is it normal to get really high latency on all disk IO if a server does that?

    C) What is the best way to prevent the high latency?

    A. I have not seen anything that large before. However, it could be good for a server to be able to do large IOs, which means that the data lays good on the disk and is ready for sequential access. And if there really are 25 MB of data laying next to each other I guess it could better do 1 IO with a vary large size then sending 1000 smaller IOs.

    B. The latency is typically higher the larger the IO size is, for the natural reason it will take longer for the disk system to fetch it.

    C. Perhaps there are no problem. When you do other kind of disk access it might have to use smaller IO just because there are smaller files. You could try to take, say the Program Files folder and copy it into the other partition and note the result.



  • 22.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 01:20 PM

    Yes, but 32 IO queue depth with 50MB transfers will yield massive latency with pretty much any storage system.

    Out of interest, in the HOST what is the value of Disk.MaxIOSize?



  • 23.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 01:34 PM

    I did some quick tests on a Windows 2008 R2 server with large file copies.

    I found out that it does (if possible) reads in IOs as large as 1 MB.

    Writings could be done in exactly 32 MB chunks. That is really large write IOs, but I do not think there is anything wrong with it. A bit cool to see. :smileyhappy:



  • 24.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 01:44 PM

    Surprising that there doesn't seem to be any tunables for this (that I can find, anyway) - very clearly that profile will cause latency in competing workloads during sustained transfers if IOs of that size are making it through the ESX layer (hence the interest in the DiskMaxIOSize confgurable there).

    BTW, this was a great spot - credit where it's due :smileyhappy:



  • 25.  RE: High IO latency from simple file copy?
    Best Answer

    Posted May 26, 2012 06:56 AM

    Thanks, and strange that this behavior has not got any attention. It does look it could cause some odd results on shared storage if many VMs are competing for the same disk systems.

    It might be that in reality is this kind of large IOs often not possible, there must really be 32 MB of continuous free disk space available.



  • 26.  RE: High IO latency from simple file copy?

    Posted May 30, 2012 11:27 AM

    DiskMaxIOSize was 32768kb. Setting it to 64kb seems not to have had any effect other than increasing CPU usage on the host. I can still bring the latency well above 300ms by telling IOMeter to do 4MB writes, even with that value set to 64kb.

    However, I have tried to copy the windows folder instead of a few multigigabyte test files, and the latency stays well below 20ms. It might indeed be that these kinds of large IOs do not happen in practice. In this company, I suspect it is not likely to cause any serious performance issues.



  • 27.  RE: High IO latency from simple file copy?

    Posted May 30, 2012 01:45 PM

    erietveld wrote:

    It might indeed be that these kinds of large IOs do not happen in practice.

    I think that is very likely as you say. To be able to do this kind of very-large-IOs the VM must have bit of "luck", that is have large amounts of data laying next to each other on the disk and except while copying very large files like ISO or similar that could be more uncommon.

    VM disk fragmentation will also make this harder to actually happen.



  • 28.  RE: High IO latency from simple file copy?

    Posted May 25, 2012 11:30 AM
    Some comments on the ESXTOP data so far:
    .
    The "d" screen:
    .
    Around 90 IOs per second, half reads and half writes. No kernel latency for the IOs, only device latency. Almost decent read times, around 25 ms but very high write: 438 ms.
    .
    The "u" screen:
    .
    52 read commands per second, a bit strange to still get around 52 MB read/s.. Very very large IOs?
    43 writes/s to the other LUN and 20 active commands, that is "on the fly". This also indicates that the writes are slow, since you both have 20 commands outstanding and it takes some 400 ms for each to complete.
    .
    The "v" screen:
    .
    Only the fitw02 VM is doing any disk activity, so there should be no other disturbance from these. Are there any more ESXi hosts that are connected to the same SAN?
    .
    Have you tried reading something and writing it back to the same Windows partition? That is to just involve the first LUN and throw both read and writes at it? And then try the same but on the second LUN? It could be interesting to see if they perform the same.


  • 29.  RE: High IO latency from simple file copy?

    Posted Sep 12, 2012 03:07 AM

    I am have exactly the same issue.  Did you get a resolution to the issue?



  • 30.  RE: High IO latency from simple file copy?

    Posted Oct 26, 2012 09:07 PM

    I am having a similar issue with iSCSI targets on a Nexenta storage system equiped with 16 Seagat constallation ES.2 + 2 STEC ZeusRam for ZIL and 2 Intel 520 180GB for L2ARC. Large file copies bumps the latency very high.