ESXi

 View Only
Expand all | Collapse all

IBM HS22 - High latency to shared datastore over FibreChannel

  • 1.  IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 10, 2011 11:11 AM

    Hi,

    I've just installed and configured two new hosts for our environment, ESX13 & ESX14. They're identical IBM HS22 blade servers both with ESX 4.1 U1 installed.

    In order to make sure everything was working correctly I powered off an existing VM and migrated it across to one of the new hosts, in the first instance ESX13. I powered on the VM and got to the Windows login where I enter credentials. It then sits with "applying user profile" on the screen and the blue circle just spinning away for a good 10 minutes. A little confused I reset the VM and the same thing happens. After much digging around I discover that the datastore latency from ESX13 to the datastore where this VM resides is huge.

    With the VM still powered on and trying to login I VMotion it across to ESX14. On successful migration the VM comes to life and logs in to Windows. The latency can clearly be seen from the screenshot where I have included VM disk latency and datastore latency for both hosts. You can clearly see that the migration was just before 11:30.

    The VM resides on a NetApp SAN which is connected via FC and accessible from both hosts.

    Is there anything I can check as to what could be causing this 'host specfic' latency?

    EDIT:

    Just to add that both ESX hosts have two HBA's and are connected directly to the same fibre switches. Each switch then has two paths to the SAN, one to each controller, giving four paths to each host. The hosts are using Round Robin pathing policy and have been optimised for the NetApp SAN using the NetApp Virtual Storage Console plugin for vCenter.

    EDIT2:

    I've just migrated another VM from our other SAN (IBM DS3400 connected through the same Fibre Switches) to this host and get exactly the same behaviour. I've checked the Fibre Switch config and there's no differences between the ports for ESX13 & ESX14, so I'm really at a loss here. It's got to be an issue with the host or HBAs?



  • 2.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 10, 2011 04:39 PM

    sounds like HBA or cable to me.

    On the weird host, try chanigng to Fixed policy and seeing if its a specific path that does it.



  • 3.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 10, 2011 08:02 PM

    It shouldn't be a cable issue as the HS22 is slotted directly into an IBM BladeCenter chassis and the HBA integrates directly with the fibre switch which all 14 blades share. There are then two fibre connections from the switch to the IBM DS3400 SAN and two to the NetApp FAS2040.

    I shall definitely play around with fixed paths though to rule out HBA's.



  • 4.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 11, 2011 10:11 AM

    Rather than changing the Path Policy I simply disabled the port on each fibre switch to force IO over each HBA in turn. I started by forcing over vmhba1 and then over vmhba2. As you can see from the screenshot the latency exists on both HBAs.



  • 5.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 13, 2011 12:45 PM

    I've been looking at the ESX host logs and they are full of the following NMP messages:

    Jun 13 13:19:00 esx14 vmkernel: 0:02:46:02.096 cpu7:4103)ScsiDeviceIO: 1672: Command 0x28 to device "naa.60a980005033693947342f3472436953" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    Jun 13 13:19:01 esx14 vmkernel: 0:02:46:02.400 cpu7:4110)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4102bfa50040) to NMP device "naa.60a980005033693947342f3472436953" failed on physical path "vmhba2:C0:T2:L1" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.


    Jun 13 13:19:01 esx14 vmkernel: 0:02:46:02.400 cpu7:4110)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.60a980005033693947342f3472436953" state in doubt; requested fast path state update...

    According to this VMware KB - http://kb.vmware.com/kb/1029039 - the "H: 0x2" output means the following:

    This  status is returned when the HBA driver is unable to issue a command to  the device. This status can occur due to dropped FCP frames in the  environment.

    I've checked the interfaces on both Cisco Fibre Switches within the BladeCenter Chassis and I'm not seeing any CRC errors or discards, so I don't believe it to be an issue with the fibre itself.

    I just don't know what else to do.



  • 6.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jul 01, 2011 11:26 AM

    Do you have a solution for your issue?

    Regards,

    Chris



  • 7.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jul 01, 2011 12:41 PM

    Nope, I'm still completely baffled. It doesn't help that the hardware is in a remote data centre. I'm actually going there on Monday to perform some other tasks, but I'm going to try a little hands on troubleshooting on this issue whilst I'm there, swap Blades into known working slots in the Chassis for example.



  • 8.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 06, 2011 02:32 PM

    Has anyone got any other suggestions for testing that I can do to nail this down?

    We've now moved the whole ESX environment into a data centre in our head office so I have hands-on access to the equipment which makes life a lot easier.

    My thinking is that I'm going to try swapping the Fibre Expansion card within the two HS22's to see if the latency follows the adapter.



  • 9.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 08, 2011 03:21 AM

    Hi,

    I'm running a very similar setup to you and I'm experiencing an identicial issue on 3 of my hosts. Only major difference is I have Qlogic switch modules.

    You mentioned that you dont see any CRC or decode erros on your CISCO switches. The only place I've been able to find any evidence of a problem is on the internal port on my QLogic modules i.e. the interface internal to the BladeCenter between the blade and the switch module.

    In these interfaces I'm seeing very high CRC and decode errors, while other interfaces are ok.

    I'm about to try moving blades around to try and narrow down a cause i.e. HBA, slot etc. I'll post any updates i have.

    Brent



  • 10.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 08, 2011 08:33 AM

    Hi Brent,

    Although I'm sorry that you're experiencing the same issue it is really nice to know that I'm not alone!! As you can see, I've been dealing with this issues since mid-June!

    So, like me, you are not seeing any CRC or decode errors on the switch interface but you are seeing them on the HBA interface itself. This isn't something that I've checked as I don't know how to get this information from the QLogic HBA's within my HS22's, perhaps you could explain and I'll check this out at my end too?

    A little more background on my environment:

    Slots1 to 12 of the BladeCenter Chassis are populated with HS21 XM Type: 7995 Blades.

    Slots13 & 14 contain the HS22 Type: 7870 Blades.

    I'm not seeing any issues with 1 to 12. With ESX13 in Slot13 I see latency. With ESX14 in Slot14 it works fine. If I swap these over, ESX13 in Slot14 & ESX14 in Slot13, they both see latency, weird!

    Are you seeing the same NMP errors in the vmkernal log for the affected host?

    I eagerly await your reply. :smileyhappy:

    ###### EDIT ######

    I've installed QLogic SANSurfer CLI on the ESX hosts and monitoring the HBA Link Status I am indeed seeing issues in the shape of CRC's, Link Failures, Sync Loss and Signal Loss. This is not reflected at all in the corresponding interface on my fibre switch. Please see attached text file.

    ###### EDIT 2 ######

    The CRC's on the port are incrementing, however the disk related values are static. I'm also seeing the same disk related values on hosts that aren't experiencing this latency issue, so I think we can ignore those Link Failures, Sync Loss and Signal Loss figures.



  • 11.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 09, 2011 12:15 AM

    I figured I'm more likely to find a solution if I contribute so hopefully we're experiencing the same issue and can find a fix :smileyhappy:

    So to see the CRC errors I'm looking at the TH (internal facing) port on my QLogic switch module, see attached screenshot.

    I'm also seeing the same NMP errors in the logs:

    "Sep  8 01:01:36 vmkernel: 10:00:18:01.320 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4102bf1c0140) to NMP device "naa.60a9800057396d42694a655666784e76" failed on physical path "vmhba1:C0:T3:L4" H:0x2 D:0x0 P:0x0 Possible sense data: 0

    Sep  8 01:01:36 x0 0x0 0x0.

    Sep  8 01:01:36 vmkernel: 10:00:18:01.320 cpu3:4099)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.60a9800057396d42694a655666784e76" state in doubt; requested fast path state update..."

    My issue seems a little more random than yours, all of my blades are HS22Vs (Type 7871) with Qlogic HBAs. I'm seeing some errors on a few blades but extremely high CRC errors on the 3 blades that are most affected by latency.

    An update since yesterday. I reseated one of the affected blades and the CRC error count has stopped increasing...still monitoring to see if this changes or not.

    Unfortunately these CRC errors are the only counter I can see that indicate a problem so I'm hoping I'm not focusing on the wrong area.

    Have you by any chance logged a job with IBM? I'll be logging one today.

    Brent



  • 12.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 10, 2011 08:01 AM

    It looks like our issues are very similar then, certainly throwing up the same error within the ESX logs.

    Here's what I found on Thursday. I swapped my two HS22's around in the chassis, so ESX13 was in Slot14 & ESX14 was in Slot13. In doing so I saw latency on BOTH Blades, suddenly ESX14 was also seeing the issue when it was previously OK. So that led me to think that maybe one of the FC Expansion Cards was faulty, so I swapped the cards over leaving the blades in the same location. This allowed ESX13 to work correctly in Slot14. On moving them back to their original locations but with the cards swapped they both experience latency again.

    So my conclusion to this was that potentially I have a faulty FC card and there's something going on with Slot13, so I have raised a support case with IBM.

    I was out of the office all yesterday so I couldn't chase it up or do any further testing. I'd like to swap the HS22's with one of the HS21's and place it in any of Slot1 to 12 and see what happens. Also see whether a known working HS21 will perform OK within Slot13. That's all stuff for next week.

    Please keep me posted with your findings, I'll definitely do the same.



  • 13.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 18, 2011 07:53 AM

    OK, so here's an update on what I've found this week.

    I swapped the FC Card as I said previously, which  allowed ESX13 to work in Slot14, which it didn't previously do with the  FC Card that it shipped with. So IBM sent out an engineer with a  replacement to swap this out, which I've done. It didn't make any  difference, it still wouldn't work in Slot14 with the new FC Card.

    A little baffled I started to swap Blades around in  the chassis. It appears that my HS22's work in any slot apart from 13  & 14, explain that one! So I juggled them all around and ended up  swapping the two HS21's in Slot11 & 12 for the HS22's in Slot13  & 14. I've not experienced any disk latency issues and I've now got live  VMs happily running on them.



  • 14.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Sep 23, 2011 12:58 AM

    This is definitely a weird issue and nothing logical seems to have fixed it. After reseating the affected blades I'm no longer having any issues with latency.

    Doesnt really inspire much confidence in the IBM hardware.

    Brent



  • 15.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Nov 08, 2011 10:27 PM

    This is probably just a coincidence, but I have an issue on a specific blade in slot 13 on an IBM bladecenter with an HS22.  The issue is that disk latency goes up to 300ms on a specific HBA.  I cannot reproduce this on any other blade or bladecenter that I try (and we have 4 of them).  The latency is envoked by runing HD Tach test and appears during the random access portion of the test.  I know that it's not the SAN because we are seeing very low latency on the IBM SVC and v7000.  I don't think it's a port on the SAN either for the same reason.  I don't think it's a switch issue, because another blade using the same cisco fiber channel switch has no issue.  Also, when I svmotion the VM to the EMC SAN, the issue persists.  When I chose the path on the 2nd HBA, the issue goes away (latency less than 1ms).

    I am planning on opening a case with VMware on this in the hopes that we can enable detailed logging on the affected blade and try to see what's going on.  Strangely, the latency shoots up to 300ms and stay flat at 300ms for 60-300 seconds, depending on size of the test I choose.



  • 16.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Nov 18, 2011 12:06 PM

    That's pretty interesting! Did you ever get anywhere with VMware support/IBM?

    My issue was never HBA specific, I could only replicate it with my HS22 in Slot13 and it occurred across both HBA's. I've ended up with my two HS22's in Slot11 & 12 and all has been well for a good few months now.



  • 17.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Dec 07, 2011 07:41 PM

    Hello,

    We also currently havig the same issue with HS22 7870, Qlogic QMI2572 and Cisco 4G FC switch on IBM H Frame 8852 and also running VMware 4.1 u1.

    For us, the problem is on Slot #9 with HBA1 only. With HBA2, no issue !

    I put latest firmware on my blade and same problem.

    Did someone got a solution since last post ?



  • 18.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Apr 15, 2012 02:08 AM

    I am seeing the error “device naa.xxxxx performance has deteriorated. I/O latency increased from average value of ....”.

    New hardware setup as below,

    PROD –

    Ibm blade centre H with HS22v

    Brocade Enterprise 20-port 8 Gb SAN Switch Module

    QLogic Ethernet and 8Gb Fibre Channel Expansion Card (CFFh)

    Ibm V7000 with 6.3.0 fw

    Esxi5 on blades (not ibm version the vmware vanilla)

    DR –

    HP servers with StorageWorks 81Q PCI-e FC HBA

    SAN24B-4 Express

    Ibm V7000 with 6.3.0 fw

    Esxi5

    error appear on all blades and HP servers. I am planning to update firmware on qlogic hba’s this week.

    Anyone found a solution to this issue?



  • 19.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted May 31, 2012 12:27 PM

    I am facing the same issue. I am using XIV for storage connected through FC. I have 2 host out of action. 3rd one seems to be having the same issue. What is the result you found. I have the same errors on the hosts . Please update if anyone got the solution.

    Thanks.



  • 20.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted May 31, 2012 06:23 PM

    Hi,

    I'm still running my HS22s in Slot11 & 12 of my BladeCenter chassis with no issues at all. Did any of you above ever get anywhere with VMware/IBM Support?

    I never actually found a solution to my problem, it just went away when I shuffled my Blades around. I'd certainly be interested to hear how others have gotten on as it's been a while now.



  • 21.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted May 31, 2012 10:41 PM

    If your problem went away by moving the blade to a new slot that indicates a problem with the slot itself.  If you call back and ask for the case to be escalated so that they can determine if the slot is bad and send you a replacement then you might be happier.  I see it happen a lot.



  • 22.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted May 31, 2012 10:45 PM

    This is a general message that just means the latency of the i/o has reached a threshold that triggered a message.  It means i/o is slow basically.  It can be due to multiple factors, likely not just one thing.  Here is more info to get started analyzing

    http://kb.vmware.com/kb/2001676

    http://kb.vmware.com/kb/1008205



  • 23.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 01, 2012 06:12 PM
    Hello,
    For us, we fix our issue with ugrading our Cisco 4G FC switch on our Frame to the latest firmware provide by Cisco and not the one offer on IBM website. Since this as been done, everything works fine.
    Thanks


  • 24.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jun 01, 2012 06:31 PM

    That's interesting. I'm going to be doing this in a few weeks time anyway as the current version of Cisco SAN-OS that is running on my fibre switches doesn't support NPIV, which is required for our latest project. I'll definitely test this issue out after the upgrade to see whether I can get my HS22s running in Slot13 without the previous latency issues.



  • 25.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Jul 06, 2012 07:13 PM

    Had the same problem described in the first post. HS22 Blades.  It doesn't seem slot specfic, except that it occurs for the blades in the last 2 occupied slots in the bladecenter.

    Issue started with a bus error on one of the blades.   After updating firmware on the blade UEFI (Build P9E156C  Version 1.17 released 02/03/2012) and the AMM (Build BPET62J.  File CNETCMUS.PKT.  Released 01/19/2012), that cleared, but still latency. 

    Turns out that the DS3512 SAN controller had a false problem with the battery:

    Event type: 210C
    Description: Controller cache battery failed
    Event specific codes: 0/0/0
    Event category: Internal
    Component type: Battery Pack
    Component location: Enclosure 0, Controller 2, Slot 2
    Logged by: Controller in slot B

    This happened because of a learn cycle instigated 3 days early for some unknown reason.  I didn't see this error until after the Blade firmware was updated (!). 

    Moved all of my LUNs over to controller A as preferred path temporarily.

    Reset the controller with tools> execute script and enter: reset controller [b];  Then tools> execute script only. Where b was the affected controller.

    That fixed the issue, but then we updated the Controller firmware and NVSRAM (Controller_Code_07734000). Still need to do the drives after we shut them down.



  • 26.  RE: IBM HS22 - High latency to shared datastore over FibreChannel

    Posted Oct 29, 2012 01:14 PM

    Hi six4rm,

    Did you do the upgrade and did it fix the issue? I too am seeing the issue.

    Thanks,