I have a 2 node Windows Server Failover Cluster (WSFC). The nodes are running Windows 2012 R2. They are running on different ESXi 6.5U3 hosts.
Each node is configured with two local disks C: and D: and a shared disk E:.
Disks C: and D: are on a VMFS volume connected via FC and attached to a LSI Logic SAS SCSI Controller (SCSI0) set to Not Shared, these are C: SCSI0:0, D: SCSI0:1
Disk E: is an RDM accessed over FC and connected to a Paravirtual SCSI Controller (SCSI1) configured with Physical SCSI Bus Sharing. The RDM was attached to one node in the cluster and the other node was configured by adding an existing disk and selected the RDM pointer from the 1st node. Disk E: is SCSI1:0 on both VMs.
There is a Role configured on the Windows Cluster containing the E: drive as a Storage resource and a number of Windows Services.
When I increase the size of the non-shared D: drive on the node that is currently running the role the E: drive resource fails on the cluster, taking the services offline as they are dependent on the E: drive.
[HKLM]\SYSTEM\CurrentControlSet\Services\disk\TimeoutValue is set to 190 on both nodes.
I also have the following 2 settings on both VMs
scsi0.returnNoConnectDuringAPD = “TRUE”
scsi0.returnBusyOnNoConnectStatus = “FALSE”
The sequence of events are: -
Increase the size of Hard Disk 2 (D:) on the VM.
In the vmkernel.log of the ESXi host the VM is running on I see
2020-04-24T16:38:42.655Z cpu1:73226)VSCSI: 6590: handle 8192(vscsi0:0):Destroying Device for world 73030 (pendCom 0)
2020-04-24T16:38:42.655Z cpu1:73226)VSCSI: 6590: handle 8193(vscsi0:1):Destroying Device for world 73030 (pendCom 0)
2020-04-24T16:38:42.655Z cpu1:73226)VSCSI: 6590: handle 8194(vscsi1:0):Destroying Device for world 73030 (pendCom 0)
2020-04-24T16:38:42.905Z cpu2:73226)VSCSI: 3801: handle 8195(vscsi0:0):Creating Virtual Device for world 73030 (FSS handle 5640191) numBlocks=125829120 (bs=512)
2020-04-24T16:38:42.905Z cpu2:73226)VSCSI: 273: handle 8195(vscsi0:0):Input values: res=0 limit=-1 bw=-1 Shares=-1
2020-04-24T16:38:42.906Z cpu2:73226)VSCSI: 3801: handle 8196(vscsi0:1):Creating Virtual Device for world 73030 (FSS handle 4067328) numBlocks=71303168 (bs=512)
2020-04-24T16:38:42.906Z cpu2:73226)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
2020-04-24T16:38:42.910Z cpu2:73226)VSCSI: 3801: handle 8197(vscsi1:0):Creating Virtual Device for world 73030 (FSS handle 5574657) numBlocks=62926605 (bs=512)
2020-04-24T16:38:42.910Z cpu2:73226)VSCSI: 273: handle 8197(vscsi1:0):Input values: res=0 limit=-1 bw=-1 Shares=-1
2020-04-24T16:38:42.912Z cpu14:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
2020-04-24T16:38:42.914Z cpu14:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
2020-04-24T16:38:42.914Z cpu14:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
2020-04-24T16:38:42.919Z cpu20:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
2020-04-24T16:38:42.920Z cpu20:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
2020-04-24T16:38:42.920Z cpu20:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000
In the Windows System Event Log I see the following
FailoverClustering Event ID 1038 Physical Disk Resource
Ownership of cluster disk ‘Cluster Disk’ has been unexpectedly lost by this node. Run the Validate a Configuration Wizard to check your storage configuration
FailoverClustering Event ID 1069 Resource Control Manager
Cluster Resource ‘Cluster Disk’ of type ‘Physical Disk’ in clustered role ‘MyRole’ failed.
All of the services configured on the cluster role stop
And then about 30 seconds later I get the following in the Windows System Event Log
Ntfs (Microsoft-Windows-Ntfs) Event ID 98
Volume E: (\Device\HarddiskVolume5) is healthy. No action needed
The Cluster Disk resource comes back online and the services start up again, but I have had an outage in my services. I have been able to repeat this 100% of the time. Any ideas why this is happening. I have a couple of environments, one based on ESXi6.0 and the other ESXi6.5. I get the same symptoms on both.