ESXi

 View Only
  • 1.  VMware vSphere ESXi Lockup

    Posted Sep 02, 2009 10:36 PM

    We are running VMware vSphere 4 ESXi on Sun X6450 blades with the Intel six core dunnington processors, boot from Compact Flash, iSCSI storage, 10GbE, and 92GB of RAM. The blades run fine for 10 to 30 days before completely locking up. There is no PSOD. No indication of any log that something has gone wrong. The machine simply stops responding. The issue is intermittant and apparently random. We can't find any correlation to the problem except the hardware. We are running on X6250 blades with similar configuration as well with no problem. Also, we see the problem when there is high load, when there is moderate load, and when there is no load. We have even seen it in blades which have ESXi in maintenance mode.

    Has anyone seen a similar problem? If so, what is your config?



  • 2.  RE: VMware vSphere ESXi Lockup

    Posted Sep 02, 2009 10:37 PM

    Correction: 96GB of RAM



  • 3.  RE: VMware vSphere ESXi Lockup

    Posted Sep 02, 2009 11:10 PM

    I've seen this same issue brought up in the forums before. Turns out it was iSCSI SAN related.

    Here is a related link: http://communities.vmware.com/thread/213710

    The other articles I read mentioned time outs, rebooting the SAN without taking VMs offline, and other issues.

    Regards



  • 4.  RE: VMware vSphere ESXi Lockup

    Posted Sep 03, 2009 05:13 PM

    That link looks similar, but not the same. The post you refer to says that the server is temporarily hanging. In our case, the server is completely locked up and doesn't free up again. Also, the post refers to some indicators prior to the lock up and in the logs to help guide troubleshooting. We have none of that. As we are running other blades in the exact same storage configuration without the problem, I find it highly unlikely that iSCSI is related, however, we have ruled nothing out at this time.



  • 5.  RE: VMware vSphere ESXi Lockup

    Broadcom Employee
    Posted Sep 15, 2009 06:37 AM

    Coud you get vm-support logs using vsphere client after the reboot of the lock up server?

    It's a serious problem and you can submit you logs to vmware.



  • 6.  RE: VMware vSphere ESXi Lockup

    Posted Mar 17, 2010 01:41 PM

    I have exactly the same Symptoms, our config is 4 Blade6000 with 6 to 7

    X6450 each, 2 ST6140 with 2 to 3 jbods each. We use 2 NEM (X4250A) on

    each Blade and for each X6450 a dual ports FC card.

    The only common point with my config is the X6450 module.

    We upgraded all modules to vSphere U1, but this lockup persist.

    Have you found a solution to this lockup ?



  • 7.  RE: VMware vSphere ESXi Lockup

    Posted Mar 17, 2010 05:56 PM

    We are experiencing this issue as well with the Sun x6450 blades. Here's our configuration:

    • 2 Sun 6000 Blade Chasses (1 per datacenter)

    • 10 Sun x6450 Blade Modules (6 in one, 4 in the other)

    • 1 4-Port Gigabit Ethernet PCIe card per blade

    • 1 2-Port Gigabit Ethernet + 2-Port FiberChannel PCIe card per blade

    • VMware ESX 3.5 Update 4, ESX 3.5 Update 5, ESX 4.0, ESX 4.0 Update 1

    Description of the issue:

    • ESX server is marked as Not Responding in vCenter

    • Console of the server is unresponsive; no errors logged to the console

    • Network connectivity to the server stops

    • All virtual machines running on the server are powered onto other hosts (HA host failure response)

    • Hard reset allows the server to rejoin the cluster

    • No errors for the lock-up are logged in ESX

    • No errors are logged in the blade iLOM

    • No errors are logged in the blade chassis iLOM

    • Occurs on all 10 blades in both chasses at each datacenter

    • Have not been able to reliably reproduce the issue

    • Frequency can be from one day to two months; seemingly random

    We have someone at Sun support who may have found a bug with the x6450 and ESX. This has not been confirmed yet.



  • 8.  RE: VMware vSphere ESXi Lockup

    Posted Mar 25, 2010 04:42 PM

    I'm having a very similar problem, do you have any information about the (potential) bug?



  • 9.  RE: VMware vSphere ESXi Lockup

    Posted Mar 26, 2010 02:53 PM

    Sun support has not provided any more information about the potential bug.

    We just noticed that the x6450 blade modules are no longer sold on Oracle's website, which makes resolving this issue moot for us. Even if the issue was resolved, we'd have 10 blades across two chasses that we couldn't expand. Moving all ten x6450 blades to one chassis and putting other blades with potentially different processors in the other chassis would cause issues with Site Recovery Manager. It makes more sense to replace these blades with ones that work than to keep waiting until a blade locks up, generate diagnostic logs, forward those to support, and get the call that the logs don't include any clues as to the cause of the lock-up.

    We've focused on working with our Sun sales rep. to replace the x6450's with their AMD counterpart, the x6440's. From what I can see, the x6440 is the only 4-socket, six-core blade that's both available for sale and supported by VMware. Of course, the x6450 is supported by VMware, too. We'll see how that goes.



  • 10.  RE: VMware vSphere ESXi Lockup

    Posted May 07, 2010 07:37 PM

    More info on this:

    We were able to see that this issue does generate a purple screen error, but only after changing the console's TTY to the vmkernel's diagnostic logging screen (Alt+F12) (see attached screenshot):

    CpuSched: VcpuWaitForSwitch: timed out

    I also found this thread on Oracle'sforum website:

    Looks like it's definitely an issue with the hardware or firmware since this issue occurs on Red Hat and Fedora servers. I guess since ESX is based on Red Hat, there could be a connection.