vSphere Storage Appliance

 View Only
Expand all | Collapse all

Disk Latency Issues

  • 1.  Disk Latency Issues

    Posted Nov 03, 2010 06:06 PM

    Hello -- I'm battling some disk latency issues on my SAN. I have 3 HP DL380 G7 ESX servers with dual Brocade 8gb HBAs hooked to dual HP (Brocade) 8GB FC switches. I have an HP P2000 G3 FC 8GB SAN and both Controller A/B are connected to both HP Switches.

    It seems to me like when one VM is backed up (high disk queue length) that it maxes out and drives the disk latency up on all LUNs, on all ESX hosts. I can see how it might max it out the LUN or Array that it's working on, but why LUNs on different disk arrays on different ESX hosts?

    I've been watching for the past hour while one of our VMs has been doing a big SQL Update. It seems like it just kills everything when the disk queue length on the VM goes above 30-50.

    I have 2 Disk Arrays (13 disk raid 6, 2 LUNs) and (5 disk raid 5, 2 LUNs) that the VMs are split up on. Those 2 disk arrays are the only arrays/LUNs owned by Controller B in this SAN.

    Here are a few screen shots during slow downs, Esxtop, SSMS, and the OS View.

    Some of the times it doesn't even seem like it's doing alot, 16mb sec read, 2mb/sec write, and then sometimes it's been at 7000 cmds/sec 50-100mb/sec read/write and not delayed at all.

    I hate this SAN, don't ever buy one if you are considering. Never heard of a SAN that doesn't come with a tool to monitor the performance on it what so ever, according to HP anyway.

  • 2.  RE: Disk Latency Issues

    Posted Nov 03, 2010 09:53 PM


    Can feel your pain. One thing I have noticed with your layout is that you have all LUNs being owned by the same controller - are you able to split these across the controllers or even have the LUN that is creating the high IO be on it's own controller? Let us know if this improves things.

    Also what Path Selection policy are you using for these datastores?

    Kind regards.

  • 3.  RE: Disk Latency Issues

    Posted Nov 03, 2010 09:58 PM

    The path policy is MRU, and it says storage array type - VMW_SATP_ALUA

    The LUNs used to be split up, on this horrible SAN you can only put an array on a controller, you can't split up the LUNs in that array between different controllers.

    I have a big SQL database server with a LUN off of controller A and all of my Vmware stuff off of Controller B

  • 4.  RE: Disk Latency Issues

    Posted Nov 03, 2010 10:28 PM

    Okay let get some of our terminology sorted - sorry for this but removes assumptions.

    If I'm understanding you correctly you have a total of 18 disks in your P2000 G3 array. These are configured as two vdisks (let call them vdisk_01 and vdisk_02) with vdisk_01 having 13 disks (RAID6) and vdisk_02 having 5 disks (RAID5). From here vdisk_01 is managed by controller A and vdisk_02 is managed by controller B.

    Within vdisk_01 you have configured two volumes (vdisk_01_vol_01 and vdisk_01_vol_02) and again on vdisk_02 there are two volumes (vdisk_02_vol_01 and vdisk_02_vol_02).

    The volumes on vdisk_01 are being accessed by a SQL server (vm/physical?) and the volumes on vdisk_02 are being used by vms hosted on ESX servers.

    Is this correct so far?

    Message was edited by: ThompsG

  • 5.  RE: Disk Latency Issues

    Posted Nov 03, 2010 10:59 PM

    Close, I didn't mention the entire SAN in my original message, just the Vmware arrays

    Total config:

    vdisk 1 - 10 disks, raid 5 - controller A

    lun 1 - 5tb, assigned to physical SQL Server

    vdisk 2 - 13 disks, raid 6 - controller B

    lun 1 - 1.9tb, assigned to vmware cluster

    lun 2 - 1.9tb - assigned to vmware cluster

    vdisk 3 - 5 disks, raid 5 - controller B

    lun 1 - 1.9tb - assigned to vmware cluster

    lun 2 - 447gb - assigned to vmware cluster

    I know this isn't the best configuration in the world but most of this disk was pre-existing when I started at this company and I have to work with it

  • 6.  RE: Disk Latency Issues

    Posted Nov 06, 2010 03:16 PM

    I'm not much help here but...

    Are you by chance able to get a maintenance window that allows you to take your VMs down and run IOMeter tests against the LUNs to test their performance? If you don't get the results you expect, then maybe you might have some kind of hardware problem going on.

    Are you positive you don't have a failed drive in the array(s)? Failed paths? Updated firmware (and identical) on both arrays? Any insight from HP? 10k or 15k disks?

    Just thinking out loud here...


  • 7.  RE: Disk Latency Issues

    Posted Nov 06, 2010 04:34 PM

    All disks are 15K 6gb SAS, with 8gb FC on the host side. All paths and everything are fine. I'm actually going to update the controller/enclosure firmware later today to see if that helps, it's a few levels behind.

    Going to do that before I call HP since that's all they will tell me to do anyway

  • 8.  RE: Disk Latency Issues

    Posted Nov 07, 2010 11:00 PM

    Okay, thanks for that and sorry for taking so long to respond.

    Which of the LUNs when workload is created increases the latency of the other LUNs, i.e. vdisk1_lun11, vdisk2_lun1, etc. What is the workload, i.e. backups (where to where), sql transactions, etc.

    Kind regards.

  • 9.  RE: Disk Latency Issues

    Posted Nov 08, 2010 01:25 AM

    I updated the firmware in the SAN last night (which turned into a 7 hour ordeal). During the update one of the enclosures dropped off completely and two of the vdisks went off-line and HP spent a good part of the night troubleshooting and getting the vdisks back on-line.

    I'm hoping that the firmware update itself will help, but now I also know a few basic commands in the CLI to see IOPs, bytes/etc and controller CPU, so during these events if they still happen I can get some more data on the total IOPs from the SAN level and if the controller is maybe CPU bound or something.

    I'll mess around with it some tomorrow during some heavy load times and see what happens --

  • 10.  RE: Disk Latency Issues

    Posted Nov 08, 2010 02:06 AM

    Looking forward to an update after the firmware to see if you still have the issue.

    My feeling would either be one of two things: a) not enough spindles b) controllers are the bottleneck.

    We actually have the latter in the that we have an EVA8400 with 176 FC and 100 FA spindles and when trying to do D2D backups we actually bring the array to it's knees. Not because of running out of IOs but the controllers hit 100% and then it is all downhill from there :smileyhappy:

    This has forced us to use local storage in the backup server in order to do VCB image backups. This may change with vStorage API but is the case at the moment.

    Kind regards.

  • 11.  RE: Disk Latency Issues

    Posted Nov 08, 2010 03:30 PM

    Well I ran a few benchmarks this morning and I'm able to duplicate the issue. It wasn't even a super intense benchmark, I ran two at the same time, one on a VM on the esx-array-ra5 and one VM on the raid6-array. I saw the disk latency reach about 50-100ms in esxtop and at the same time I was checking the limited performance information from the SAN.

    I saw a few times, like the one below where the controller CPU was very high, i'm guessing we are saturating the controllers and that's what's causing all this. That's pretty pathetic that 1000 IOPs or so can max out a controller on a new SAN.

    Well I'm not sure if there is anything else that can be done at this point. The vovo_datastore has way more usage as you can see by the counters than both of my vmware arrays, so I still think it makes sense to have vovo_datastore on controller A and the two vmware vdisks on controller B.

    1. show controller-statistics

    Durable ID CPU Load Power On Time (Secs) Bytes per second Number of Reads Number of Writes Data Read Data Written


    controller_A 14 115907 61.6MB 38324668 9889529 4761.3GB 478.7GB

    controller_B 91 115908 597.3MB 32384077 7593574 1227.5GB 181.5GB


    Success: Command completed successfully.

    1. show vdisk-statistics

    Name Serial Number Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written


    esx-array-ra5 00c0ffdabdf7000049fd7b4c00000000 187.5MB 540 2993945 768228 614.6GB 71.2GB

    raid6_array 00c0ffdabe8b0000f93dc64b00000000 338.7MB 490 6102960 724217 1877.0GB 113.7GB

    vovo_datastore 00c0ffdabdf7000015c7bb4b00000000 63.0MB 112 10227911 8052385 4765.7GB 415.8GB


  • 12.  RE: Disk Latency Issues

    Posted Nov 08, 2010 11:10 PM


    Thanks for all this. First thing I would like to get out of the way is that we must remember that the P2000 is an array based for the SME market and, to be honest, to get 1000 IOPS is pretty impressive on this configuration. If you look at the specs you can see several limiting factors, i.e. 2GB of cache per controller.

    Now that is over, there are something we could do to improve performance - or at least try to. These are only recommendations and no warranty is implied or given :smileywink:

    The first change is non-destructive and easier to do: Move esx-array-ra5 to controller_A - you would need to monitor this closely but this would limit issue. It would also load balance the spindles (15 on controller A, and 13 on controller B). I suggest esx-array-ra5 as it is the smaller vdisk and will less likely impact vovo_datastore.

    There are other things that we can do to improve performance but they are going to take a bit of work.

    1) raid6_array - there is an overhead with running raid6 and unless you need the double parity (i.e. running large ATA drives meaning the rebuild time is slow) then you can improve performance by converting to raid5

    2) Are your vmdk data disks aligned? At the top end this will reduce the number of IOs the controllers need to service and can buy you some headroom.

    3) LUN sizes - you have three LUNs on 1.9TB each. In the past (with the MSA1000s) I have found that presenting smaller LUNs is actually better for the controllers - read less work.

    Trust this helps.

    Kind regards,


  • 13.  RE: Disk Latency Issues

    Posted Nov 08, 2010 11:21 PM

    I actually thought about the raid 6 thing, and i'm doing some testing to confirm what you're talking about with the overhead. I'm running SQL IO Sim on a VM that has a vmdk on the raid5 and the raid6 so I can do easy testing.

    I need to do some more testing but it almost seems like to me I had better test results on the raid 5, even though it was 5 vs 13 spindles.

    The 1000 IOPs earlier though was the whole controller, that doesn't seem like that much to me but i'm not really used to the SME market, I came from a medium-large company.

    Can you convert raid levels on the fly with this SAN? I could always convert the raid 6 to raid 5 if it's even possible, but i'd also be worried about re-build times on a 13 drive raid 5 array.

    I wish I had the chance to build this myself, would probably be all raid10 at least with a bigger SAN. The consultant didn't even setup the SAN properly, the drive channels were cabled wrong.

  • 14.  RE: Disk Latency Issues

    Posted Nov 08, 2010 11:44 PM

    I've run a couple passes of SQLIOSim on both arrays and the results are kind of suprising.. I also tested with both the LSI SAS controller and the Paravirtual controller, but those were similar.

    I'm no expert with sqliosim but if i'm reading this correctly the performance on the raid 5 was much better even though 5 vs 13 spindles?

    raid 5 - 5 spindles

    data - Reads = 121291, Scatter Reads = 91855, Writes = 4584, Gather Writes = 70065, Total IO Time (ms) = 51199964

    log - Reads = 0, Scatter Reads = 0, Writes = 134488, Gather Writes = 0, Total IO Time (ms) = 247837

    raid 6 - 13 spindles

    data - Reads = 43713, Scatter Reads = 63561, Writes = 1892, Gather Writes = 56642, Total IO Time (ms) = 49011338

    log - Reads = 0, Scatter Reads = 0, Writes = 46043, Gather Writes = 0, Total IO Time (ms) = 952367

  • 15.  RE: Disk Latency Issues

    Posted Nov 09, 2010 03:06 PM

    Hmm I had a similar issue last weekend when uprgading our environment from 3.5 to 4.1. Latency on the SAN went high and performance dropped on our SAN.

    Our answer was that we needed to install certain Netapp tools on the ESX

    hosts and enable the ALUA protocol on the LUNs on our SAN, change array type to ALUA_PP and enable

    the ESX hosts to use Round Robin pathing rather than direct.

    Afte this performance was as ESX 3.5...not sure if that applies to you

    but there are changes from ESX 3.5 to 4.1 in pathing that really

    affected our virtual environment.

  • 16.  RE: Disk Latency Issues

    Posted Nov 09, 2010 05:27 PM

    I've been doing some more testing and it looks like whenever 300-500 IOPS are happening to my Raid 6 array, this is enough to bury the CPU on that SAN controller (B) that it's attached to.

    But yet the raid 5 arrays can take 1000-1500 from what I've seen and still not max the CPU out. So basically I need to dump this raid 6 array. and of course on this little cheapie SAN you can't even change raid levels on the fly. Pretty sad since most local raid controllers these days and even DAS units can do that..

    Pretty bad when the Dell MD3000 SAS array I use for disk backups can do all of these things and the production SAN can't.

  • 17.  RE: Disk Latency Issues
    Best Answer

    Posted Nov 10, 2010 03:23 AM


    Sounds about right in regards to performance with raid6 v 5. I would definitely ditch the raid 6 unless you need the extra redundancy.

    I agree with you in regards to the P2000 and migrating RAID levels. Even the old MSA1000s could migrate raid levels on the fly!!! Given these are meant to be replacements seems to be a massive oversight. Fortunately you are running VMware so not too painful depending on space left in array (and other datastores). Create a new datastore and Storage vMotion your VMs here. Delete volumes and vdisk then create as RAID5. Move everything back.

    Of course as I type this I can already hear your response and again feel your pain :smileyhappy:

    Pretty sure you are not going to have the space to do my suggestion so your only solution to limit the damage may be to migrate the esx-array-ra5 to controller A leaving raid6-array on it's own controller :smileysad:

  • 18.  RE: Disk Latency Issues

    Posted Nov 10, 2010 03:44 AM

    Now only if I had sVmotion, too bad it doesn't come with Essentials Plus :smileysad:

    Most of my VMs are easy to take down after hours though, what I just did was move anything that just needs disk space to the raid 6, and moved more IO intensive things to the Raid 5 array.

    Kind of a bandaid but at least I know what the issue is. If SQLIOStress is any indication this should improve performance on some of these machines by like 100%

    Thanks for all your help

  • 19.  RE: Disk Latency Issues

    Posted Nov 10, 2010 04:27 AM


    Pretty good solution given what you have to work with. Hopefully you can improve this in the future, but for now that will do the trick.

    Glad to help.

  • 20.  RE: Disk Latency Issues

    Posted Nov 11, 2010 03:16 PM

    One of the VMs I moved to the Raid 5 from Raid 6 arrays was a SQL server that runs a weekly batch update. That update ran yesterday in 1 hour, the week prior it ran in about 2 hours and 20 minutes, so the raid 6 performance on the P2000 really is that poor, even with 13 vs 5 spindles.

    My boss wants to make everything RAID 10 now and buy more drives.

    So the question is, if you had 24 drives to work with, and you need a total usable capacity of about 5-5.5TB which configuration would you use? If I do 2 12-spindle raid 10 arrays that should give me enough space, but im wondering if there is a better option. I'm not sure what the spindle limit for a raid-10 array on this SAN is, on my 6 year old DS4500 it was 30, so i could just make a huge array.

    What about RAID-50, would that be a viable option?

    Most of the work load is SQL servers, but they aren't transactional. We do lots of bulk updates and things like that and they usually don't even run at the same time, so there isn't a whole lot of contention.

  • 21.  RE: Disk Latency Issues

    Posted Nov 12, 2010 12:40 AM


    Again not surprised. Glad to hear you have positively identified your bottleneck.

    The next questions get a little harder to answer.

    I noticed earlier on you mentioned that the volume controller ownership is tied to vdisk ownership however according to this link http://h20195.www2.hp.com/v2/GetDocument.aspx?docname=4AA0-8279ENW&doctype=white paper&doclang=EN_US&searchquery=keywords=(OR) array &cc=us&lc=en the array supports ULP (Unified LUN Presentation) therefore theoritically you can assign any controller as the owner of a volume no matter what vdisk it sits on. Look at page 38 - it does mention that for performance go through the controller that owns the vdisk however in your case the controller often becomes the bottleneck so makes sense to move a volume to another controller. In the SMU reference guide (http://h20000.www2.hp.com/bizsupport/TechSupport/CoreRedirect.jsp?redirectReason=DocIndexPDF&prodSeriesId=4118559&targetPage=http%3A%2F%2Fbizsupport1.austin.hp.com%2Fbc%2Fdocs%2Fsupport%2FSupportManual%2Fc02520791%2Fc02520791.pdf) it talks about changing the mapping of a volume to set default controller ports (page 67) - does this allow you to select a default controller for a volume (LUN)?

    Just like to get this out of the way first because it does change our design considerations. Let me know.

    Kind regards,


  • 22.  RE: Disk Latency Issues

    Posted Nov 12, 2010 03:06 PM

    Nope, the mappings have nothing to do which controller it's owned by. That just lets you make that LUN visible out of any of the controller FC or iScsi ports. Controller ownership is done by vdisk on this "SAN" as they call it.

    Going to do a test today on that Vovo server since it can come down for most of the day. Moving the files off the RA5, deleting the array and re-creating as RA10. If we see a large increase in performance (which I'm kind of suspecting) then my boss is ready just to make it a company standard for raid-10 and buy more drives, hey that's fine with me.

  • 23.  RE: Disk Latency Issues

    Posted Mar 09, 2011 12:59 AM

    I don't want to hijack this thread, but if anybody has experience with both the P2000 and Iometer, I could use a bit of help.

    I seem to be getting way more IOPS than I should be, here is my thread:



  • 24.  RE: Disk Latency Issues

    Posted Nov 07, 2010 01:39 PM

    Have you checked the queue depth of your HBAs and the queue length of your ESX server? You should also check esxtop or vCenter to see what your kernel latencies are to determind if the issue is your storage system.

  • 25.  RE: Disk Latency Issues

    Posted Nov 13, 2010 05:52 PM

    Looks like about a 15-45% improvement when I switched from raid 5 to 10, and that's while a few background processes are running as well. Does anyone know in this particular SAN how the drive channels work?

    The back end drives are 6GB SAS, but is it possible to get more than 6GB if it accesses the drives through both controllers? I know on the IBM ones I worked on for example the even/odd drives were on different drive channels, so even if the back end was 4GB FC you could technically get more than 4GB bandwidth from a single LUN/Array.

    The most I'm able to get is about 641MB/sec which sounds to me like it's maxing out the drive channel, not that that's a slow speed or anything.

    Name Serial Number Bytes per second IOPS


    vovo-ra10 00c0ffdabdf700009499dd4c00000000 641.4MB 1221