vSAN1

View Only

Back to discussions

Expand all | Collapse all

Virtual SAN device is under permanent error

1. Virtual SAN device is under permanent error

Recommend
SebastianGrugel
Posted Apr 27, 2016 01:55 PM

Reply Reply Privately
Hi
We have :
vCenter 5.5 U2 x 1
Two clusters Compute and Management
ESXi 5.5 U2 - 4 hosts in cluster
and VSAN
Today in MGT cluster we have bellow issue:
- Virtual SAN device is under permanent error
- Virtual SAN device has gone offline
We dont see information about used storage:
In MANAGE > Virtual SAN > Disk Management
Disk group looks healthy
but drives inside Disk Groups in this houst dont have information about healthy status:
How can i troubleshoot more this case or try how try fix ?
2. RE: Virtual SAN device is under permanent error

Recommend
zdickinson
Posted Apr 27, 2016 06:56 PM

Reply Reply Privately
It sounds like the SSD in a disk group has failed and need to be replaced. This happened to us and we deleted the disk group, replaced the SSD, re-created the disk group, and let everything re-balance. There might have been some trickiness around deleting the disk group, I cannot remember. Thank you, Zach.
3. RE: Virtual SAN device is under permanent error

Recommend
SebastianGrugel
Posted Apr 29, 2016 07:53 PM

Reply Reply Privately
Thanks Zach for fast reaction.
We have opened SR in Vmware: What we know for now after VMware engineer investigation:
Report after troubleshooting first MGT cluster:
"It appears SSD naa.5001e8200282f398 on host XXXXXXXXXXXX experienced a hardware issue:
### vmkernel.log ###
2016-04-27T07:47:14.165Z cpu2:32803)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x1a (0x412e8089cf00, 0) to dev "naa.5001e8200282f398" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0xcd 0x0. Act:NONE 2016-04-27T07:47:14.165Z cpu2:32803)ScsiDeviceIO: 2363: Cmd(0x412e8089cf00) 0x1a, CmdSN 0x107 from world 0 to dev "naa.5001e8200282f398" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0xcd 0x0.
Sense key 0x4 translates to "Hardware Error".
ASC/ASCQ 0xcd 0x0 is not listed on www.t10.org (http://www.t10.org/lists/asc-alph.htm), so I can't say what further information the RAID controller actually supplied here. I assume this value is vendor specific in this case.
Besides that I only see "I/O error" and " "Disk naa... not found in healthy state" messages in the logs for the disks.
As the SCSI error with sense data 0x4 0xcd 0x0 was only reported for one of the two SSDs, I'm not sure why the 2nd SSD didn't get mounted either.
It might still be related though. Looking at the used HBAs, there is only 1 RAID controller used, correct? So if there is a hardware issue with the controller itself actually, and not just SSD naa.5001e8200282f398, this might have a knock-on affect on the other disks as well.
Hence, my recommendation is to open a ticket with the hardware vendor, Dell, to investigate the hardware error further."
What is interesting day after this we had that same warning in second CMP (compute) cluster.
We found in logs many entries:
========= vmkernel.log ==============
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e803f0180) 0x2a, CmdSN 0x19998ae4 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f07305fc0) 0x2a, CmdSN 0x19998aec from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e8040c940) 0x2a, CmdSN 0x19998aed from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f22113180) 0x2a, CmdSN 0x19998b10 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e8040cbc0) 0x2a, CmdSN 0x19998afc from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f3ce16bc0) 0x2a, CmdSN 0x19998b07 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e80426ac0) 0x2a, CmdSN 0x19998afe from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e803c51c0) 0x2a, CmdSN 0x19998b0d from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e803dbb00) 0x2a, CmdSN 0x19998b11 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e8044ec00) 0x2a, CmdSN 0x19998af9 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x2a (0x412f092e3a00, 0) to dev "naa.5001e8200282656c" on path "vmhba0:C0:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f092e3a00) 0x2a, CmdSN 0x19998aeb from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f0b331d00) 0x2a, CmdSN 0x19998b06 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f0a43c440) 0x2a, CmdSN 0x19998ae3 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f0afaa5c0) 0x2a, CmdSN 0x19998af8 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f3ce14dc0) 0x2a, CmdSN 0x19998b13 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f08f26600) 0x2a, CmdSN 0x19998af1 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
We tried manually mount drives without success:
esxcli vsan storage diskgroup mount -s naa.5001e8200282656c

In this issue helps server reboot....

Unfortunately still i dont know why "Health status" is not showing up in one Disk Group. Maybe somebody know ?
After this we receive short description from VMware engineer:
"H:0x5 (Aborts) on the affected host XXXXXXXXXX:
2016-04-28T09:27:31.280Z cpu42:27106261)ScsiDeviceIO: 2363: Cmd(0x412f4b776ac0) 0x28, CmdSN 0x7bcc36e9 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:27:31.280Z cpu48:27106244)ScsiDeviceIO: 2363: Cmd(0x412f4b773280) 0x28, CmdSN 0x7bcc36e2 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:27:31.280Z cpu42:27106261)ScsiDeviceIO: 2363: Cmd(0x412f4b771fc0) 0x28, CmdSN 0x7bcc36e7 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2016-04-28T09:27:31.280Z cpu29:27106294)ScsiDeviceIO: 2363: Cmd(0x412f4b775a80) 0x28, CmdSN 0x7bcc36e6 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
Prior to that we could see megasas aborts in the vmkernel logs.
This particular issue is described in the following KB article: http://kb.vmware.com/kb/2109665
One of the main steps to resolve this on a long term basis, is to increase the values for /LSOM/diskIoTimeout and /LSOM/diskIoRetryFactor (exact steps are also described in the mentioned KB article):
esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor"
-------------------------------------------------------------------------------------------------
Case is still open because now we will create additional case in DELL to investigate hardware in first cluster which still have issue(reboot dont helps).
And in second cluster we still dont see Health state of disks in one Disk group.
I will inform about additional investigation.
4. RE: Virtual SAN device is under permanent error

Recommend
elerium
Posted Apr 30, 2016 12:15 AM

Reply Reply Privately
Sandisk/Seagate combo, are you by chance using Dell H730 or FD332-PERC raid controllers? If so, you've run into the LSI 3108 firmware/driver issues that have been plaguing this discussion VSAN Node Crashed - R730xd - PF Exception 14 in world 33571: Cmpl-vmhba0- IP 0x41802c3abd44 addr 0x50

The issue is worked around by adding the timeouts settings that VMware support provided to you. There were also new firmwares/drivers released yesterday that are supposed to be a permanent fix to the problem, but you may not want to apply those yet as VSAN HCL only has VSAN 6.0+ versions tested with them so far.
5. RE: Virtual SAN device is under permanent error

Recommend
elerium
Posted May 02, 2016 09:06 PM

Reply Reply Privately
Looks like new H730 firmware/drivers certified for all 6.* versions now:
VMware Compatibility Guide - vsanio
6. RE: Virtual SAN device is under permanent error

Recommend
SebastianGrugel
Posted May 03, 2016 08:15 PM

Reply Reply Privately
We use controller PERC H730 Mini (Embedded) with disks SANDISK(ssd) and Seagate(hdd).
I will read your post today. We have similiar issue in other our location... Restart host is not solution...
We will thinking about serious solution. This timeouts is some kind of workaround...
7. RE: Virtual SAN device is under permanent error

Recommend
SebastianGrugel
Posted May 03, 2016 08:38 PM

Reply Reply Privately
Update after next step:
Here's a summary of what we've done during troubleshooting with Vmware engineer:
1) Executed mount command for diskgroup with fronting SSD naa.5001e82002826808 on host XXXcmp001:
esxcli vsan storage diskgroup mount -s naa.5001e82002826808
Afterwards the Health Status was correctly displayed in the Web Client again.
2) Executed mount command for both diskgroups on host XXXmgt002:
esxcli vsan storage diskgroup mount -s naa.5001e8200282efa0
esxcli vsan storage diskgroup mount -s naa.5001e8200282f398
Again, afterwards the Health Status was correctly displayed in the Web Client again.
AFTER - disk inside Disk Group back to healthy
Information about "Capacity" back:
We will try reboot again for those host for check if after reboot those disk groups will be Healthy again.
After those mounting back capacity to our datastores:
Before manually mount:
After manually mount:
For now issue is resolved but we will check what can we do to avoid similar situation in future.
8. RE: Virtual SAN device is under permanent error

Recommend
SebastianGrugel
Posted May 03, 2016 08:50 PM

Reply Reply Privately
Hi All
For your information
And last update after VMware troubleshooting:
"I've looked through the syslog files from host XXXmgt002 that you uploaded last week and noticed that it contains the same abort messages that we've seen on host XXXcmp001 before the diskgroup failed:

grep ABORT messages-2016-04-2*
messages-2016-04-26:Apr 26 23:46:06 192.168.110.12 vmkernel: cpu12:33105)megasas: ABORT sn 12997396290 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:09 192.168.110.12 vmkernel: cpu15:27933954)megasas: ABORT sn 12997396548 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:11 192.168.110.12 vmkernel: cpu9:27933955)megasas: ABORT sn 12997396506 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:13 192.168.110.12 vmkernel: cpu27:27933984)megasas: ABORT sn 12997396556 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:15 192.168.110.12 vmkernel: cpu6:27933985)megasas: ABORT sn 12997396475 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:17 192.168.110.12 vmkernel: cpu25:27933986)megasas: ABORT sn 12997396507 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:19 192.168.110.12 vmkernel: cpu19:27933987)megasas: ABORT sn 12997396521 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:35 192.168.110.12 vmkernel: cpu4:27934057)megasas: ABORT sn 12997396513 cmd=0x28 retries=0 tmo=0

That was on the night from 26/04/2016 to 27/04/2016, so just before the morning when you noticed the issue.

I've checked the mentioned configuration settings from KB article http://kb.vmware.com/kb/2144936 and they haven't been applied on that host either yet:

/config/LSOM/intOpts/> get diskIoTimeout Vmkernel Config Option {
   Default value:20000
   Min value:100
   Max value:120000
   Current value:20000
   hidden config option:1
   Description:Disk IO timeout in msec
}
/config/LSOM/intOpts/> get diskIoRetryFactor Vmkernel Config Option {
   Default value:3
   Min value:1
   Max value:100
   Current value:3
   hidden config option:1
   Description:Disk IO retry factor
}

So that KB article has to be applied on that host too (and any other on which those settings are still on their default values and use that RAID controller).

Nevertheless, I think it's still a good idea to have Dell check for hardware errors on the RAID controller (due to sense key 0x4 reported on the morning of 27/04/2016), just to be on the safe side."
9. RE: Virtual SAN device is under permanent error

Recommend
elerium
Posted May 03, 2016 09:50 PM

Reply Reply Privately
Sounds like you were able to remount the disk groups without a restart, if so then it's not the raid controller/firmware issue I mentioned (these cases result in a crash where only a host restart corrects the problem, remount wouldn't work).
You may indeed have a problematic or failing disk in your disk group, you can read more on why VSAN would unmount a diskgroup here and possible options:
VMware KB: VMware Virtual SAN 6.1 or 5.5 Update 3 Disk Groups show as Unmounted in the vSphere Web Client (DDH)
VSAN 6.1 New Feature - Handling of Problematic Disks - CormacHogan.com
VSAN 6.2 Part 10 - Problematic Disk Handling - CormacHogan.com
10. RE: Virtual SAN device is under permanent error

Recommend
elerium
Posted May 03, 2016 09:51 PM

Reply Reply Privately
Actually, I just noticed you're on 5.5 U2 so these links may not apply as they are for 5.5 U3 changes to unmount handling.
11. RE: Virtual SAN device is under permanent error

Recommend
SebastianGrugel
Posted May 04, 2016 11:19 AM

Reply Reply Privately
I doubled check this and on begining was my mistake:
According to information on site: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1014508
vCenter build: 21821111 - vCenter: vCenter Server 5.5 Update 2b
ESXi build 3116895 - ESXi 5.5 Update 3a (Express Patch 8)

vSAN1

Virtual SAN device is under permanent error

SebastianGrugelApr 27, 2016 01:55 PM

zdickinsonApr 27, 2016 06:56 PM

SebastianGrugelApr 29, 2016 07:53 PM

eleriumApr 30, 2016 12:15 AM

eleriumMay 02, 2016 09:06 PM

SebastianGrugelMay 03, 2016 08:15 PM

SebastianGrugelMay 03, 2016 08:38 PM

SebastianGrugelMay 03, 2016 08:50 PM

eleriumMay 03, 2016 09:50 PM

eleriumMay 03, 2016 09:51 PM

SebastianGrugelMay 04, 2016 11:19 AM

1. Virtual SAN device is under permanent error

2. RE: Virtual SAN device is under permanent error

3. RE: Virtual SAN device is under permanent error

4. RE: Virtual SAN device is under permanent error

5. RE: Virtual SAN device is under permanent error

6. RE: Virtual SAN device is under permanent error

7. RE: Virtual SAN device is under permanent error

8. RE: Virtual SAN device is under permanent error

9. RE: Virtual SAN device is under permanent error

10. RE: Virtual SAN device is under permanent error

11. RE: Virtual SAN device is under permanent error