ESXi

 View Only
Expand all | Collapse all

Lost access to volume - sucessfully restoed access to volume

  • 1.  Lost access to volume - sucessfully restoed access to volume

    Posted Jul 18, 2014 03:56 AM

    on my esxi hosts, i see these messages under events

    The disconnect and recovery happens at exactly the same time

    i can access the lun just fine. Any idea?



  • 2.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 18, 2014 06:36 AM

    is it iscsi or FC? does it happen on all ESX hosts in the farm or specific hosts - is it common all the LUNs or just few - what type of storage array is it ?



  • 3.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 18, 2014 06:52 AM

    Check your vmkernel log and you can find some SCSI sense codes on that about storage connection problem.

    Check these article:

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381

    VMware KB: Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x



  • 4.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 18, 2014 08:26 AM

    Follow VMware KB: Host Connectivity Degraded in ESX/ESXi.

    Check your logs /var/log/vmkernel.log and /var/log/vmkwarning.log.



  • 5.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 18, 2014 03:39 PM

    What version of ESXi are you running?  I was seeing this a lot more often when I was still running some 4.1 hosts, but don't seem to see that behavior as often now that we're running 5.x.



  • 6.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 18, 2014 10:41 PM

    this is running 5.1



  • 7.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 19, 2014 09:59 PM

    i am eeing a lot of these

    2014-07-18T05:27:07.782Z cpu6:8375)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:647: Path "vmhba1:C0:T1:L24" (UP) command 0xa3 failed with status Timeout. H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    2014-07-18T05:27:11.217Z cpu1:8229)HBX: 255: Reclaimed heartbeat for volume 50fa3f70-d7deffa5-86e4-0025b5110aff (cx4960-fc-r1-lun83): [Timeout] [HB state abcdef02 offset 4128768 gen 15069 stampUS 8295912102496 uuid 534a1994-26e71eaf-c0c4-0025b5110a7f jr$

    2014-07-18T05:27:11.219Z cpu1:8229)FS3Misc: 1465: Long VMFS rsv time on 'cx4960-fc-r1-lun83' (held for 5296 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors

    2014-07-18T10:39:52.782Z cpu14:11136)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 19' (held for 284 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

    2014-07-18T10:40:07.888Z cpu12:9295)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 29' (held for 244 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

    2014-07-18T12:10:56.719Z cpu14:1280799)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 19' (held for 471 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

    2014-07-18T17:45:49.864Z cpu19:8211)ScsiDeviceIO: 2331: Cmd(0x4124473dca00) 0x1a, CmdSN 0x8da9ad from world 0 to dev "naa.60060160de051b007e6f3f82048ce111" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    2014-07-18T17:46:00.025Z cpu16:14908860)ScsiDeviceIO: 2331: Cmd(0x4124473dca00) 0x1a, CmdSN 0x8daa96 from world 0 to dev "naa.60060160de051b00269e16130d21dd11" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    2014-07-18T17:46:03.336Z cpu2:1281301)Vol3: 705: Couldn't read volume header from control: Not supported

    2014-07-18T17:46:03.336Z cpu2:1281301)Vol3: 705: Couldn't read volume header from control: Not supported

    2014-07-18T17:46:03.336Z cpu2:1281301)FSS: 4972: No FS driver claimed device 'control': Not supported

    2014-07-18T17:46:05.972Z cpu11:1281301)VC: 1547: Device rescan time 20780 msec (total number of devices 54)

    2014-07-18T17:46:05.972Z cpu11:1281301)VC: 1550: Filesystem probe time 4704 msec (devices probed 35 of 54)

    2014-07-18T17:46:16.043Z cpu9:12133)ScsiDeviceIO: 2331: Cmd(0x41240e82c940) 0x1a, CmdSN 0x8dad77 from world 0 to dev "naa.60060160de051b007e6f3f82048ce111" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    Then I found this

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033409

    but I am ready running even a later version driver or the fnic and enic for cisco hardware.

    any idea?



  • 8.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 21, 2014 02:10 PM

    Is the HBA on the Cisco blade up to the latest firmware.  Many times when updating the Cisco UCS platform when the blades get updated some of the other firmwares aside from the CMC and adapter get kept at older versions.  Could it be a driver / firmware mismatch issue?  You could also open a TAC case with cisco and bring this to there attention, they may even have something logged on it now.


    Let us know,



  • 9.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 21, 2014 06:03 PM

    do you think I need to play around with the queue depth and Disk.SchedNumReqOutstanding

    values?



  • 10.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 22, 2014 06:02 AM

    depends on whether your array is maxing out on queue length -- so that the hba has to queue up the i/o requests -- otherwise not much use in tweaking the queue depth i think

    HTH,

    ~Sai Garimella



  • 11.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 22, 2014 02:55 PM

    Another reason this could be happening is are you reaching your max path on your hosts?  I have seen it on larger UCS enviroments that had 8Paths pers LUN with 150+ luns or so max our there max paths per host.  This would cause a LUN to disapear, loose paths,  or prevent them from adding new LUNs

    http://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf

    How many paths per host do you have ?

    This max path number also doesn't mean per SAN, its a cumlitive total of any PATH to that host, so if you have 3 SANS connected to the host it doesn't matter how many each uses, its a total number.



  • 12.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Jul 22, 2014 08:23 AM

    The screenshot in first post does not show a time stamp.

    What is the corresponding entry in the logs?

    The log you supplied does not necessarily indicate any problem.

    A temporary loss of access is not necessarily a problem.

    If for example it is during the night when backup jobs are running or AV scans, I/O latency usually gets higher.

    Normally it is recommended to upgrade to latest drivers/FW (HBA, array).

    Also distribute load between the LUNs.

    Make sure the correct path policy is used.

    Do not play around with qdepth and other parameters. Not recommended.



  • 13.  RE: Lost access to volume - sucessfully restoed access to volume

    Posted Mar 24, 2015 07:07 PM

         These are the following reason why does it happen-

    • After a storage device has unexpectedly unpresented from the storage array, you are unable to mount it again.
    • This issue occurs when there was a running virtual machine when the storage device went offline.
    • An ESXi 5.x host cannot mount the storage after the LUN is online again .

    Error codes -

    pu34:5590)VC: 1449: Device rescan time 165 msec (total number of devices 75)

    cpu34:5590)VC: 1452: Filesystem probe time 504 msec (devices probed 48 of 75)

    cpu38:5590)ScsiDevice: 4592: naa.6006016058201700354179be0c6fdf11 device :Open count > 0, cannot be brought online

    cpu34:5590)Vol3: 647: Couldn't read volume header from control: Invalid handle

    cpu34:5590)FSS: 4333: No FS driver claimed device 'control': Not supported

    cpu38:5590)ScsiDeviceIO: 2316: Cmd(0x4124c0ea2e80) 0x28, CmdSN 0x70509 to dev "naa.6006016058201700354179be0c6fdf11" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    Please follow through the blow resolution-

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2014155

    To resolve this issue:

    1. Run this command to see the world that has the device open for the LUN:

      #esxcli storage core device world list -d naa_id
      For example:

      #esxcli storage core device world list -d naa.6006016058201700354179be0c6fdf11
      You see output similar to:

      Device                                World ID  Open Count  World Name
      ------------------------------------  --------  ----------  ----------
      naa.6006016058201700354179be0c6fdf11      2060           1  idle0
      If a VMFS volume is using the device indirectly, the world name includes the string idle0. If a virtual machine uses the device as an RDM, the virtual machine World ID is displayed. If any other process is using the raw device, the corresponding information is displayed.

      Notes:
      • If the host is not responding, run the command esxcfg-scsidevs –m | grep naa.id to get the corresponding datastore name.
      • Ensure all virtual machines registered on the volume in a PDL state do not require any further steps. If you have a virtual machine in that state, attempting to Retry or Cancel an operation will not return the virtual machine world ID. Click Cancel as the Retry operation cannot succeed unless the volume is remounted.

    2. Run this command to list all virtual machines running on the ESXi 5.x host and identify the virtual machine registered on that LUN:

      #esxcli vm process list
    3. To kill the virtual machine World ID, run this command:

      #esxcli vm process kill --type=force --world-id=World ID
      For example:

      #esxcli vm process kill --type=force --world-id=12131
    4. Rescan the storage using this command:

      #esxcfg-rescan -u vmhba#
    5. Run this command to see the device state:

      #esxcli storage core device list -d naa-id
    6. If the issue persists, reboot the ESXi 5.x host where virtual machine was registered.

    If you have any questions, Please let me know, I will try my best to answer it.

    Thank you.