ESXi

 View Only
  • 1.  bnx2 Purple Screen - BL460c G7

    Posted May 31, 2012 03:13 PM

    Hey guys – I have four identically configured BL460c G7’s (ESX 4.1 Build 348481) – recently one of them purple screened on a bnx2 broadcom issue.  I brought the host back up thinking it was a one-off, but then the following afternoon a couple of the machines lost network connection.  Not all of them on the host but most including a windows and unix machine.

    I’m thinking the card might be going bad because if it was a driver issue I would be having the problem on all four blades?

    Does anyone know how to test the card?  Here’s the ethtool output for the bnx2 cards:

    … ethtool -i vmnic0

    driver: bnx2

    version: 2.0.7d-3vmw

    firmware-version: 5.2.3

    bus-info: 0000:06:00.0

    … ethtool -i vmnic1

    driver: bnx2

    version: 2.0.7d-3vmw

    firmware-version: 5.2.3

    bus-info: 0000:06:00.1

    dump below:

    345:20:05:44.376 cpu22:4118)NMP: nmpCompleteRetryForPath: Retry world recovered device "naa.600508b1001cd1a3cb7fcd11a57c0061"

    345:20:06:11.426 cpu16:4112)Backtrace for current CPU #16, worldID=4112, ebp=0x417f80087558

    345:20:06:11.427 cpu16:4112)0x417f80087558:[0x41802c455685]PanicLogBacktrace@vmkernel:nover+0x18 stack: 0x4100021ac080, 0x417f8

    345:20:06:11.427 cpu16:4112)0x417f80087698:[0x41802c4558c4]PanicvPanicInt@vmkernel:nover+0x1ab stack: 0x3000000010, 0x417f80087

    345:20:06:11.428 cpu16:4112)0x417f80087778:[0x41802c455d46]Panic_ExceptionMsg@vmkernel:nover+0xa5 stack: 0x41802c947f94, 0x80,

    345:20:06:11.428 cpu16:4112)0x417f80087878:[0x41802c455dcc]Panic_Exception@vmkernel:nover+0x83 stack: 0x417f80087ab8, 0x41802c9

    345:20:06:11.429 cpu16:4112)0x417f800878c8:[0x41802c42dd81]IDTReturnPrepare@vmkernel:nover+0x254 stack: 0x417f80087af8, 0x41802

    345:20:06:11.429 cpu16:4112)0x417f800878d8:[0x41802c4ded47]gate_entry@vmkernel:nover+0x46 stack: 0x4018, 0x4018, 0x41000cc07ed0

    345:20:06:11.430 cpu16:4112)0x417f80087af8:[0x41802c947f94]bnx2_poll_work@esx:nover+0x11f stack: 0x417f80087b48, 0x402c4e9321,

    345:20:06:11.430 cpu16:4112)0x417f80087b48:[0x41802c949490]bnx2_poll@esx:nover+0x143 stack: 0x417f80087c64, 0x41000cc075e8, 0x4

    345:20:06:11.431 cpu16:4112)0x417f80087bc8:[0x41802c85319a]napi_poll@esx:nover+0x10d stack: 0x417fecca9f78, 0x41000cc18810, 0x4

    345:20:06:11.431 cpu16:4112)0x417f80087c98:[0x41802c4d77eb]WorldletBHHandler@vmkernel:nover+0x442 stack: 0x417fecbc84a0, 0x0, 0

    345:20:06:11.432 cpu16:4112)0x417f80087cf8:[0x41802c4063b6]BHCallHandlers@vmkernel:nover+0xc5 stack: 0x100410002408000, 0x13767

    345:20:06:11.432 cpu16:4112)0x417f80087d38:[0x41802c4066b0]BH_Check@vmkernel:nover+0xcf stack: 0x417f80087de8, 0x1000000002, 0x

    345:20:06:11.433 cpu16:4112)0x417f80087e48:[0x41802c5cdee5]CpuSchedIdleLoopInt@vmkernel:nover+0x6c stack: 0x417f80087e88, 0x418

    345:20:06:11.433 cpu16:4112)0x417f80087e58:[0x41802c5d3fce]CpuSched_IdleLoop@vmkernel:nover+0x15 stack: 0x10, 0x4, 0x10, 0x4, 0

    345:20:06:11.434 cpu16:4112)0x417f80087e88:[0x41802c432c57]Init_SlaveIdle@vmkernel:nover+0x11e stack: 0x0, 0x200000000, 0x0, 0x

    345:20:06:11.434 cpu16:4112)0x417f80087fe8:[0x41802c6a5668]SMPSlaveIdle@vmkernel:nover+0x45f stack: 0x0, 0x0, 0x0, 0x0, 0x0

    345:20:06:11.435 cpu16:4112)[45m[33;1mVMware ESX 4.1.0 [Releasebuild-348481 X86_64][0m



  • 2.  RE: bnx2 Purple Screen - BL460c G7

    Posted May 31, 2012 03:19 PM

    Hi,

    Could be memory problem.

    Thanks

    Sa



  • 3.  RE: bnx2 Purple Screen - BL460c G7

    Posted May 31, 2012 03:46 PM

    well I guess anything could be a memory problem lol - what makes you think so?



  • 4.  RE: bnx2 Purple Screen - BL460c G7

    Posted May 31, 2012 08:50 PM

    I ran the full HP Diagnostics off the smart start CD and everything passed.  Now i'm running Memtest86+ and sofar so good - 53% through pass 1.  Takes awhile with 64GB :smileyhappy:

    I'm thinking maybe I should put the host in a seperate cluster, move a few test machines there and do some network load testing?



  • 5.  RE: bnx2 Purple Screen - BL460c G7

    Posted May 31, 2012 10:29 PM

    Can you engage VMware support to decode the core dump properly? The log file is only part of the story and it indicates hardware which can include firmware and bios or device driver,  BH = bottom half handler which is part of the device driver.  The real decode will show what the world id 4112 actually was.  If it was idle then that indicates hardware.  If it was a function in the kernel or the device driver that could be a bug.  Don't assume that because it only happened to one host it can't be a driver issue,  it could be a bug in the driver triggered by some set of events only this server running a specific set of VM's experienced.  One thing to note is if you leave the host completely idle over the weekend does it psod again? or does it require a load to crash.  Do all the psods have the same kernel stack trace or does it move around?  that is another way to determine if it is software or hardware.  It could also be something the NIC's are talking to such as a port on the switch or a bad cable or a setting on the network switch. Its too early to tell from this info alone.

    I recommend all my customers who get PSOD's to open support cases with both VMware and with the hardware vendor.  There may be things that only support can tell you such as a firmware or bios bug or a known issue with a device driver or perhaps there really is bad hardware and it isnt showing up in the diagnostics you have.

    In my experience ESX is one of the most sensitive hardware diagnostic tools out there ;-) It is quite sensitive to issues that even true hardware diagnostic tools can't detect or other operating systems.  Makes it harder but the sooner you can get the proper support engaged the more chance you have to getting it resolved.  Make sure to take screenshots or even take a picture with your camera of the actual purple screen of death it has info that you may not be able to get from the logs sometimes.  Also make sure to record them all not just one since it can have detailed information that is not obvious to people who don't look at PSOD's every day like some of us do :-)



  • 6.  RE: bnx2 Purple Screen - BL460c G7

    Posted Jun 01, 2012 02:42 PM

    Hanna, fantastic reply, thanks for that.  I did open a case with VMWare, and I have a screenshot of the PSOD!  I took a shot at decoding the MCE - but not much luck there.  I haven’t opened a case with HP (But I will) and I’ll go back and see if VMWare can provide me with the full decode, and idle / operation info - that's valuable insight.

    I agree ESX exposes the hardware grit in an interesting way.  I’ve ran some ESX 3.5 boxes for a year easy without reboot.  Don’t take that for granted.
    One more thing – I recently spun up a VM for mobile app development including an android emulator from the google sdk…  call me crazy but I had a gut feeling that the emulator had potential to cause problems...  Do you think a google AVD could somehow PSOD an ESX?
    Attaching PSOD screenshots…


  • 7.  RE: bnx2 Purple Screen - BL460c G7
    Best Answer

    Posted Jun 01, 2012 04:00 PM

    You are welcome.  I am glad to help when I have free time.  The screen shot clearly shows the cpu was in the idle loop at the time of the psod.  This means the cpu was not executing any code at the time, this indicates hardware (including firmware and bios). The device driver would be software that would be executed so it may not be the device driver but it is still hard to tell at this point.  Definitely push HP to look at fw/bios as well not just a blatent hardware error, it might be subtle and may not show up right away.