vSphere Storage Appliance

 View Only
  • 1.  SCSI: 638: Queue for device *fixed*

    Posted Aug 07, 2009 02:33 PM

    Hey fellas. Ran into this error last night, finally resolved. I've seen dozens of posts on this same issue with no answers. Please feel free to post on any forum you like. Note that I'm an EMC storage guy, the ESX side of the house is not my realm.

    We had guests (59) "Freeze" for 5-10 seconds.

    VMkernel:

    Aug 6 09:56:43 pdvesx12 vmkernel: 79:18:38:56.771 cpu12:1229)SCSI: 638: Queue for device vml.02000000006006016093f02100ec41893fa943de11524149442035 is being blocked to check for hung SP.

    Aug 6 09:56:52 pdvesx12 vmkernel: 79:18:39:05.585 cpu15:1482)<4>lpfc0:0754:FPe:SCSI timeout Data: xc6aa280 x98 x29157be0 xec

    win2k3sp1 guests:

    The device, \Device\Scsi\symmpi1, is not ready for access yet.

    Linux Guests:

    Aug 5 15:44:30 pdlnetnag01 kernel: sd 0:0:0:0: SCSI error: return code = 0x00000008

    Aug 5 15:44:30 pdlnetnag01 nagios: Error: Unable to create temp file for writing status data!

    Aug 5 15:44:30 pdlnetnag01 kernel: ReiserFS: dm-1: warning: clm-6006: writing inode 110849 on readonly FS

    3 Node ESX cluster after massive log review, switch dumps, grabs, and webex the answer is,,,,,,

    One fiber cable was not fully seated in the switch.

    I'm posting this so no one else has to make a 10 hour run at something so simple, yet it trashed an entire 3 node cluster. Switch port errors were still generating after Maint. mode, so that's a good way to check.



  • 2.  RE: SCSI: 638: Queue for device *fixed*

    Posted Aug 07, 2009 02:37 PM

    One fiber cable was not fully seated in the switch.

    We can appreciate the heads up, but there is just no substitute for data integrity. In a proper data center with qualified people, things like this should never happen. Cables routed properly, no snags, when you plug in you always 'tug' on the cable to ensure a sure fit, make sure no sharp curves (especially for fiber) all of this means professionals installing the equipment, and never let 1 person sign off, ALWAYS double check.

    So lesson learned.



  • 3.  RE: SCSI: 638: Queue for device *fixed*

    Posted Aug 07, 2009 03:06 PM

    It's not a mom and pop shop. Tens of thousands of cables, thousands of hosts, dozens of storage frames, mutiple switches across multiple sites....

    One cable not properly seated should not call into question competancy of the staff, nor properness of the data center(s). As a matter of fact, 1 cable out of all the above, I'm loving my chances, competency and properness.

    Anyone that doesn't take in to account human error probably not being realistic.

    The cable was seated properly for over 3 months. Sometime during blade installs, cable runs, lease replacements, etc it was clipped.

    Lesson learned: We are human.



  • 4.  RE: SCSI: 638: Queue for device *fixed*

    Posted Aug 07, 2009 03:14 PM

    Anyone that doesn't take in to account human error probably not being realistic.

    Well management will have a different view. To error is one thing, to overlook something as simplistic as a 'fiber loosely connected' indicates someone wasn't paying attention.

    There are things we can dismiss as error, miss typed IP, and various other clerical errors... but a loose NIC/Fiber? That's a little TOO forgiving, suppose it was in a closet you couldn't get to? NOW how would you fix it, hmmmm? Yeah look outside the box and the impact it has in a data center that size its CRTICAL that things like that should not happen and if you double check and make sure each data integrity point is secure, that's a critical function, that's why we have tools to check cables, ensure end to end integrity, and after you plug the cable in you test them..

    Not sure who did your cable installs, but if that was in MY company there would have been a MAJOR fallout and not merely dismissed as a 'whoopsie'. Some things are forgivable, but if do THAT many installs cable runs you should know better.

    , 1 cable out of all the above, I'm loving my chances, competency and properness.

    I am sure in the grand scheme of things, it isn't a big deal, but if this were a problem and machines were down as a result, it wouldn't have been a simple apology to the CIO explaining this when he losses a million dollars a day as a result. It all depends on impact..