vSAN1

 View Only

 VSAN disk broken or not?

PonF's profile image
PonF posted Oct 09, 2024 08:15 AM

We have a VSAN 6.0  where we got alarm "Error occured on the disk(s) of a Virtual SAN host". But if you check under hardware there is no error, and we cant see any error in the HP gen 9 ILO interface. But there is one place we can see the error, under Cluster->Mange->Disk Management. The Disk group says "Unhealthy" and the disk itself is mounted but says "Permanent disk failure".

But then the error dissapears? And then we get an alarm again. This happened severel times (maybe because we also did Rebalance disk?) but now the error is gone for now.

So what does this mean? Do we have broken disk or not? See attached two pictures of the errors.

Events:
Duncan Epping's profile image
Broadcom Employee Duncan Epping

Difficult to say if the disk is broken, but it definitely isn't very stable. Could be an issue with the diskcontroller as well, difficult to say to be honest. We've had some issues in the past with disks being marked as faulty too soon, but I cannot recall which release that was. I know it was solved in 6.7 or so.

Any particular reason you are still running 6.0 ??

TheBobkin's profile image
TheBobkin

@PonF
Out-of-band management such as iLO don't monitor nor are aware of about 90% of the conditions that can cause ESXi/vSAN to offline a disk or mark it as being in an error state - you should check the vmkernel.log and vmkwarning.log of this host for why the disk is being marked as in an error state, vobd.log can be used to narrow down the timings of these events (e.g. that will show PDL or 'under permanent error' events going back a long time, and then you can look at vmkernel.log and vmkwarning.log at that time).

If you can upload or PM me the vmkernel.log and vmkwarning.log, we can take a look at this (feel free to obfuscate it of IPs, hostnames, anything else considered private).

PonF's profile image
PonF

@TheBobkin

I have uploaded the files you asked for. I think vmkernel was on the limit of showing the error so I also attached vmkernel_old that I copied the same day we had the problem. I don´t think there is any interesting personal information in the logs, please tell otherwise if I show too much...

Can you understand any of the messages related to the disk error?

Attachments  View in library
pcgeek2009's profile image
pcgeek2009

So, I recently had a similar issue with VSAN 7 on a Dell host. I ended up opening a support ticket with Broadcom. They determined it was a bad disk. I thin contacted Dell. There began the whole "the DRAC, BIOS, Adapter, and Disk firmware is out of date". However, on the DRAC the disk showed 0 capacity. I put the host in maintenance mode and updated all of the firmware as they insisted, I do. I still had the problem. At which point support tried to point out that the adapter driver in VMware was slightly behind. I pointed out that this was part of a VCF environment and was trying to do an upgrade to 5.1.1. Unfortunately, I had a FAILED VSAN DISK ERROR and could not proceed. He then relented and sent me a drive, which did fix the problem. 

TheBobkin's profile image
TheBobkin

@PonF
This device was marked as under permanent error due to URE (Unrecovered Read Errors) in metadata region of the disk:
2024-10-08T08:09:26.375Z cpu8:32823)NMP: nmp_ThrottleLogForDevice:3333: Cmd 0x28 (0x43b797de1540, 0) to dev "naa.5000c500c1f84c6b" on path "vmhba3:C2:T9:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. Act:NONE
2024-10-08T08:09:26.375Z cpu8:32823)ScsiDeviceIO: 2652: Cmd(0x43b797de1540) 0x28, CmdSN 0x279a from world 0 to dev "naa.5000c500c1f84c6b" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.
2024-10-08T08:09:26.375Z cpu8:32823)LSOMCommon: IORETRYIOCompletionInt:1195: Throttled:  0x43b757848dc0 IO type 269 (READ) isOdered:NO since 7247 msec status I/O error
2024-10-08T08:09:26.375Z cpu8:32823)LSOMCommon: IORETRYCompleteIO:504: Throttled:  0x43b757848dc0 IO type 269 (READ) isOdered:NO since 7247 msec status I/O error
2024-10-08T08:09:26.375Z cpu8:32823)WARNING: LSOM: RCIOCompletionLoop:72: Throttled: Virsto IO failed. Wake up 0x43b70121b1c0 with status I/O error
2024-10-08T08:09:28.718Z cpu11:43552)ScsiDeviceIO: 2652: Cmd(0x43b7d4733380) 0x28, CmdSN 0x279b from world 0 to dev "naa.5000c500c1f84c6b" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.
2024-10-08T08:09:28.718Z cpu11:43552)LSOMCommon: IORETRYIOCompletionInt:1195: Throttled:  0x43b757848cc0 IO type 269 (READ) isOdered:NO since 9590 msec status I/O error
2024-10-08T08:09:28.718Z cpu11:43552)LSOMCommon: IORETRYCompleteIO:504: Throttled:  0x43b757848cc0 IO type 269 (READ) isOdered:NO since 9590 msec status I/O error
2024-10-08T08:09:28.718Z cpu2:43497)WARNING: LSOM: RCIOCompletionLoop:72: Throttled: Virsto IO failed. Wake up 0x43b7011d8100 with status I/O error
2024-10-08T08:09:28.718Z cpu2:43497)WARNING: LSOM: RCDrainAfterBERead:6100: Changing the status of child state from Success to I/O error
2024-10-08T08:09:28.718Z cpu2:43497)WARNING: LSOM: LSOMEventNotify:6537: Virtual SAN device 52f7d83a-1cc9-8d1e-9b25-21742a23f0c9 is under permanent error.

Human Translation of this sense code:
H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0
Sense Key    [0x3]    MEDIUM ERROR
Additional Sense Data    11/00    UNRECOVERED READ ERROR
https://www.virten.net/vmware/vmware-esxi-scsi-sense-code-decoder-v2/?scsiCode=H%3A0x0+D%3A0x2+P%3A0x0+Valid+sense+data%3A+0x3+0x11+0x0

iLO/iDRAC/Any OOBM won't detect such a thing as they do not monitor SCSI Sense codes or react to such I/O errors.

Older versions of vSAN don't have later-added feature (SSDs only) of reacting to this by: removing the disk, issuing TRIM/UNMAP via the devices firmware, then re-partitioning the device and putting it back in service:
https://core.vmware.com/blog/improving-vsans-resilience-against-unrecovered-read-errors-devices

This feature was possible to be manually enabled from ESXi/vSAN 6.7 P03 onwards, so if on a lower build, I would  advise updating your hosts (all of them) to a minimum of that version (you should really be on latest 6.7 build if you are staying on 6.x for whatever reason) and trying this:
https://knowledge.broadcom.com/external/article/326767/vsan-disk-or-diskgroup-fails-with-medium.html

PonF's profile image
PonF

@TheBobkin

Since we are on 6.0, can we solve this by replacing the disk? Maybe we can´t get HP support to change it since ILO will not see this, but we could just buy a new disk and replace it?

There is no support on 6.0, but not on 6.7 either? So if we upgrade and get problems we do not have any official support? We want to do as little as possible for the moment and pray it will hold one more year. A project to upgrade to version 8 has started now but the actual upgrade will be completed october next year (a new environment). It is in a complex environment in a 24/7 process industry.

PonF's profile image
PonF

There was a few days break but now the alarms keeps coming, but when I log in later there is no error under Disk management, so I guess the alarm is intermittent.

Can someone confirm that I could just remove the old disk and replace with a new one same as this guide. My system is 6.0 but should be similar?

Replace a Capacity Device

The disk in HP ProLiant gen 9 is model "eg001200JWJNQ". Should I buy the exact same model, I think it is an old model? Or is it any newer model I can buy? If I check VMware compability guide I could not choose vSAN 6.0. Anyone know a replacement disk or should I find the exact model to be sure?

PonF's profile image
PonF

So last week I followed the VMware link from my last post and it worked. The guide says it will evacuate all data on the host but I think the correct term would be evacuate all data on the disk. The disk was already empty since it probably did that automatic earlier when the errors occured, and we do not have automatic Disk balance so I think that is why it stays empty. After disk change we did a rebalance and now the disk contains components of the VMs. No new alarms so far. 

I found it a bit hard to know what disk I should replace with since the ones we have is an old model and I dont understand how to find a supported disk, or maybe this is not possible when you are still at 6.0. But we asked a supplier if they had the same disk and they said they have. But when it arrived it was actually not the exact same number.  The old disk was EG001200JWJNQ and I think the new is EG001200JWFVA. I opened the HCL database file in wordpad and could find both numbers in that large file so I hoped that indicated it is supported. No alarms so far, I guess VMware should complain if new disk was not supported...? New disk have higher firmware.

pcgeek2009's profile image
pcgeek2009

So, in your comments, you seem to have posted the same part number;

"The old disk was EG001200JWJNQ and I think the new is EG001200JWJNQ."

That may be a mistake, and the numbers may be different. However, if they were both shown on the HCL for version 6 you should be good. I would think if they were not compatible, you would have received some type of error or warning, but that is not a guarantee. If all has been fine for a week or so, I would say you are probably in good shape now. 

PonF's profile image
PonF

Your are correct, i typed wrong. The new model is EG001200JWFVA.

I dont think there is any special HCL database for 6.0, we regularly follow the link below and it still says 6.x so maybe there is no different depening on versions (or maybe the file contains different list depening om version, i don't know).

Updating the vSAN HCL database manually

pcgeek2009's profile image
pcgeek2009

There used to be a drop down for version 6. However, when Broadcome migrated it over to the new site they removed all of the older versions of software. More than likely the new disk is a directly compatible on to the old one. HP may have had to change a vendor or something which results in a PN change on their end.