ESXi

 View Only
Expand all | Collapse all

KB Article: 1016106 and vSphere ESXi 5

  • 1.  KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 09, 2011 01:29 AM

    Did anyone experience the same issue described in this KB with ESXi5? With the HBA disabled, the ESXi5 host complete the load/boot in minutes. But with HBA enabled that has few RDM and some LUN that are not defined, it's taking hours to load/boot. I have case open with VMware on this and waiting. Please share your experience and any input will be appreciated. 



  • 2.  RE: KB Article: 1016106 and vSphere ESXi 5

    Broadcom Employee
    Posted Sep 09, 2011 05:36 AM

    Yeah thats been an issue forever, had this since ESX 3.

    The resolution is in that KB just lower the retries and timeout, it helps but still is painfull, But in reality you shouldnt be rebooting too often so it doesnt matter.

    What I did being a large orginisation is that I made a purpose built MSCS ESX cluster so there was only a few host that were affected, and everything else sits on the main corprate cluster.

    Dont know if having a support case open will accomplish anything for this problem.



  • 3.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 10, 2011 02:25 AM

    Still waiting to hear from VMware Support.

    I know we don't need to reboot the ESX hosts so often once it is loaded and in service. But think about the time it takes to load is you have 500+ hosts to upgrade to ESXi 5.

    I do have MSCS isolated to few clusters only.



  • 4.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 10, 2011 06:57 AM

    Have you tried to set the parameter mentioned in the KB?



  • 5.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 10, 2011 11:52 AM

    As AARCO mentioned above, those advanced options are not available in 5.0



  • 6.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 13, 2011 03:12 AM

    I see the same hang during boot under vSphere 5i on both Cisco UCS blades, HP DL360G7s and nested vSphere5i instances - all connecting to iscsi devices. In my situation some of our iSCSI SANs are not on the HCL anymore for vSphere5 and consequently when I tried to raise the issue with VMware support, support was quite limited other than to confirm that my iscsi configuration was correct. I can reproduce the issue on a nested ESX5i instance hooking up to a NexentaStor device. I suspect the issue is a generic issue with vSphere5i - VMware - please can you look at this. We are running these iscsi devices; HP MSA2000, HP MSA 2012i, NexentaStor. Boot time varies between 10 minutes and one hour depending on the configuration.



  • 7.  RE: KB Article: 1016106 and vSphere ESXi 5

    Broadcom Employee
    Posted Sep 13, 2011 08:37 AM

    @ashleyw: doesen't look like you are running MS Failover Clustering are you ? Since VMware doesen't
    support MSCS over iSCSI. Looks like your slow boot problem is unrelated to MSCS.

    Could you please run these on the ESX command shell:

    ~# cd /var/run/log

    ~# fgrep '0xb 0x24 0x' vmkernel.log

    ~# for i in vmkern*gz; do gzip -cd $i | fgrep '0xb 0x24 0x' ; done

    If it turns up a bunch of matches, we know this issue exists with a bunch of iscsi targets (a target

    bug, not ESX).

    If not, please open an SR; or just give me the SR id if you already provided vmware with full support logs.



  • 8.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 14, 2011 12:47 AM

    @kchowksey: no I'm not running MS Failover Clustering.

    When I run the fgrep command it doesn't find anything. The vmkernal logs have not been gzipped yet so there are no vmkern*gz files.

    the case number I attached the log files to was; 11096075809

    I've attached the log file from our nested ESX5i host that shows the same "hang" at boot time connecting only to a NexentaStor box via iscsi - the "hang" time in this situation is around 4 minutes - but interestingly I see a lot of "Network is unreachable" and "iscsid: Login Failed" errors which is interesting as there are no issues with the connectivity - I see these same type of messages on our production farm as well.

    update on 14/09/2011 18:45: I have removed the log file to avoid confiusion - see below..



  • 9.  RE: KB Article: 1016106 and vSphere ESXi 5

    Broadcom Employee
    Posted Sep 14, 2011 04:39 AM

    Thanks ashley. Have forwarded your report to the right people. Suggest contacting Nexenta support too.



  • 10.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 14, 2011 05:30 AM

    thanks for your help. To eliminate as much garbage as possible form the log files (as I may have appended some incorrect information), I cleared all logs and then rebooted - it took around 6 minutes on the nested esxi box... the bulk of the time was spend during the iscsi phase after vmw_satp_alua loaded successfully message on the console. On a UCS blade, this process takes aorund 15 minutes, on a DL360G7 the process takes around 30 minutes - see

    http://communities.vmware.com/thread/326077?tstart=0

    I've summarised the logs as a single small attachment.

    When I look closely at the vmkernel.log file I see the bulk of the time is spent in this section;

    <pre>

    ...

    ...

    2011-09-14T04:50:40.503Z cpu0:2604)ScsiDevice: 3121: Successfully registered device "naa.600144f02aa50c0000004e640a430001" from plugin "NMP" of type 0
    2011-09-14T04:50:40.524Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
    2011-09-14T04:50:40.555Z cpu0:2604)VC: 1449: Device rescan time 20 msec (total number of devices 5)
    2011-09-14T04:50:40.555Z cpu0:2604)VC: 1452: Filesystem probe time 29 msec (devices probed 5 of 5)
    2011-09-14T04:50:43.471Z cpu0:2050)LVM: 13188: One or more LVM devices have been discovered.
    2011-09-14T04:51:06.754Z cpu1:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
    2011-09-14T04:51:06.775Z cpu1:2604)VC: 1449: Device rescan time 22 msec (total number of devices 5)
    2011-09-14T04:51:06.775Z cpu1:2604)VC: 1452: Filesystem probe time 19 msec (devices probed 5 of 5)
    2011-09-14T04:51:32.987Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported

    ...

    </pre>

    For some reason, it looks like it is repeatedly trying to access vmhba32 which appears to be the controller the CDrom device is hanging off. sigh,..

    I guess this is a bug in vsphere5? Please advise.



  • 11.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 19, 2011 10:44 AM

    I managed to make a little progress on this today. To the point where the host rescan times at least have come down to a minute. Thanks to @kchowksey for some good suggestions. I noticed that my QNAP was being picked up as an ALUA array. This was in addition to the failed IO with sense data 0xb 0x24 0x0.

    The claim rule I applied was as follows:

    esxcli nmp satp rule add -d "<naa.deviceid>" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

    From what I can tell this problem impacts QNAP and Netgear. I've also got OpenFiler and it didn't appear to be impacted, but I have done limited testing. Note that none of these storage systems are currently on the HCL. I believe the reason for the problem is that the iSCSI targets do not implement the t10 standards correctly. I'm going to be working with VMware support on this as well. So far the only iSCSI storage I've got that works is the HP P4000 aka Lefthand Networks VSA's with SAN/IQ9x.



  • 12.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 20, 2011 11:59 AM

    For iSCSI access to targets from vSphere 5 hosts it'll try and access every target for discovery from every vmkernel port that is bound to the initiator. It will try a number of times for each combination, until it'll finally give up and move on. 



  • 13.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 27, 2011 01:54 AM

    I see the KB is updated now. It's a pain to go thru all of te ESX hosts and run the command against each RDM LUN.

    The following link has the PowerCLI script to find the RDM LUN. Hope the same can be extended to run the recommendation in the KB article.

    http://www.virtu-al.net/2008/12/23/list-vms-with-rdm/



  • 14.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 27, 2011 02:49 PM

    Hello,

    We have QNAP too.  We try de  command:

    esxcli nmp satp rule add -d "<naa.deviceid>" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

    but this error arises:

    ~ # esxcli nmp satp rule add -d naa.6001405a0f1cc60ddafed4daedbc09df  -s VMW_SATP_DEFAULT_AA -o disable_ssd
    Error: Unknown command or namespace nmp satp rule add

    We have ESXi 5.0 ....



  • 15.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 29, 2011 07:27 AM

    I have the same problem too. ESXi 5.0 on an HP DL380 G7 with iSCSI Lun on a QNAP 809 U PRO. I have a slow boot time..8 minutes to boot the ESXi:

    2011-09-29T07:01:00.482Z cpu0:4120)ScsiDeviceIO: 2305: Cmd(0x4124003ed900) 0x12, CmdSN 0x43 to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.
    2011-09-29T07:01:00.494Z cpu0:4120)ScsiDeviceIO: 2305: Cmd(0x4124003ed900) 0x12, CmdSN 0x43 to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.
    2011-09-29T07:01:00.531Z cpu0:4120)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x12 (0x4124003ed900) to dev "naa.600140550c978cedf241d4b5fda8eedb" on path "vmhba32:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.Act:NONE

    If i use the command indicated i have this output:

    # esxcli nmp satp rule add -d "naa.600140550c978cedf241d4b5fda8eedb" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

    Error: Unknown command or namespace nmp satp rule add

    Anyone have this problem?
    Best Regards
    Andrea



  • 16.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Oct 01, 2011 09:55 AM

    Hi Guys,

    My fault, sorry, in ESXi 5.0 the esxcli commands changed slightly. Now you need to specify which top level namespace, i.e. esxcli storage, or esxcli network.

    So the command you'd run would be:

    esxcli storage nmp satp rule add -d "naa.600140550c978cedf241d4b5fda8eedb" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

    Sorry that my earlier post was incorrect and neglected to include the "storage" part of the command.



  • 17.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Oct 04, 2011 07:45 AM

    Hi Micheal,

    I tried to reconnect my QNAP with ESX. I create a Kernel Port (Binding only on vnic0 because i use a nic teaming) i connect the iSCSI Lun and after some minutes (the scan of new storage is very slow) i launch the command from CLI:

    esxcli storage nmp satp rule add -d "naa.600140550c978cedf241d4b5fda8eedb" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

    However nothing change. In the storage adpter section i see the QNAP Lun flapping up and down to death or error operational state :smileysad:

    In the Log file vmkernel.log i have this error now:

    2011-10-04T07:53:21.907Z cpu14:4110)ScsiDeviceIO: 2316: Cmd(0x412441852f40) 0x9e, CmdSN 0x291d to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
    2011-10-04T07:53:31.907Z cpu16:4112)ScsiDeviceIO: 2316: Cmd(0x412441852f40) 0x25, CmdSN 0x291e to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

    Error code change from 0xb 0x24 0x0 to 0x0 0x0 0x0



  • 18.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Oct 04, 2011 12:22 PM

    Firstly the port group you are binding to can you confirm there is only one active uplink and also no stand by uplinks? All other uplinks must be set to not used. Secondly after applying the SATP rule you should rescan the datastore again. That rule will only help with rescan times, it will not help with boot times. I'm working with VMware on a fix for the slow boot time issue. But the workaround right now is to ensure that all iSCSI initiators and all bound ports have access to log in successfully to the iSCSI targets and to reduce the number or targets used in the environment to as few as possible to ensure faster boot times. If you're seeing the datastore flap something else maybe wrong.



  • 19.  RE: KB Article: 1016106 and vSphere ESXi 5

    Broadcom Employee
    Posted Sep 13, 2011 08:27 AM

    For ESX 5.0:

    On the ESX hosts that are running MSCS VMs, identify LUNs exported as RDMs to VMs

    eg. naa.<lunid>

    For each LUN identified above, perform this configuration from the esx command line:

    esxcli storage core device setconfig -d naa.<lunid> --perennially-reserved yes

    The subsequent ESX reboot should no longer be slow. KB 1016106 will be updated ASAP with this information.

    Thanks.



  • 20.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 13, 2011 12:38 PM

    Thanks. I will try this out and post the results later. 



  • 21.  RE: KB Article: 1016106 and vSphere ESXi 5

    Posted Sep 09, 2011 02:36 PM

    Hi,

    I have the same problem.  Before, in ESXi 4.1 I change the value of Scsi.CRTimeoutDuringBoot to 1 and work for me.

    Right now, in ESXi 5 I dont see this parameter ....

    ¿any idea or solution?