VMware vSphere

 View Only
Expand all | Collapse all

WSFC Failing when expanded non-shared disk

  • 1.  WSFC Failing when expanded non-shared disk

    Posted Apr 27, 2020 05:07 PM

    I have a 2 node Windows Server Failover Cluster (WSFC). The nodes are running Windows 2012 R2. They are running on different ESXi 6.5U3 hosts.

    Each node is configured with two local disks C: and D: and a shared disk E:.

    Disks C: and D: are on a VMFS volume connected via FC and attached to a LSI Logic SAS SCSI Controller (SCSI0) set to Not Shared, these are C: SCSI0:0, D: SCSI0:1

    Disk E: is an RDM accessed over FC and connected to a Paravirtual SCSI Controller (SCSI1) configured with Physical SCSI Bus Sharing. The RDM was attached to one node in the cluster and the other node was configured by adding an existing disk and selected the RDM pointer from the 1st node. Disk E: is SCSI1:0 on both VMs.

    There is a Role configured on the Windows Cluster containing the E: drive as a Storage resource and a number of Windows Services.

    When I increase the size of the non-shared D: drive on the node that is currently running the role the E: drive resource fails on the cluster, taking the services offline as they are dependent on the E: drive.

    [HKLM]\SYSTEM\CurrentControlSet\Services\disk\TimeoutValue is set to 190 on both nodes.

    I also have the following 2 settings on both VMs

    scsi0.returnNoConnectDuringAPD = “TRUE”

    scsi0.returnBusyOnNoConnectStatus = “FALSE”

    The sequence of events are: -

    Increase the size of Hard Disk 2 (D:) on the VM.

    In the vmkernel.log of the ESXi host the VM is running on I see

    2020-04-24T16:38:42.655Z cpu1:73226)VSCSI: 6590: handle 8192(vscsi0:0):Destroying Device for world 73030 (pendCom 0)

    2020-04-24T16:38:42.655Z cpu1:73226)VSCSI: 6590: handle 8193(vscsi0:1):Destroying Device for world 73030 (pendCom 0)

    2020-04-24T16:38:42.655Z cpu1:73226)VSCSI: 6590: handle 8194(vscsi1:0):Destroying Device for world 73030 (pendCom 0)

    2020-04-24T16:38:42.905Z cpu2:73226)VSCSI: 3801: handle 8195(vscsi0:0):Creating Virtual Device for world 73030 (FSS handle 5640191) numBlocks=125829120 (bs=512)

    2020-04-24T16:38:42.905Z cpu2:73226)VSCSI: 273: handle 8195(vscsi0:0):Input values: res=0 limit=-1 bw=-1 Shares=-1

    2020-04-24T16:38:42.906Z cpu2:73226)VSCSI: 3801: handle 8196(vscsi0:1):Creating Virtual Device for world 73030 (FSS handle 4067328) numBlocks=71303168 (bs=512)

    2020-04-24T16:38:42.906Z cpu2:73226)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    2020-04-24T16:38:42.910Z cpu2:73226)VSCSI: 3801: handle 8197(vscsi1:0):Creating Virtual Device for world 73030 (FSS handle 5574657) numBlocks=62926605 (bs=512)

    2020-04-24T16:38:42.910Z cpu2:73226)VSCSI: 273: handle 8197(vscsi1:0):Input values: res=0 limit=-1 bw=-1 Shares=-1

    2020-04-24T16:38:42.912Z cpu14:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    2020-04-24T16:38:42.914Z cpu14:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    2020-04-24T16:38:42.914Z cpu14:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    2020-04-24T16:38:42.919Z cpu20:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    2020-04-24T16:38:42.920Z cpu20:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    2020-04-24T16:38:42.920Z cpu20:73029)VSCSI: 273: handle 8196(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

    In the Windows System Event Log I see the following

    FailoverClustering Event ID 1038 Physical Disk Resource
    Ownership of cluster disk ‘Cluster Disk’ has been unexpectedly lost by this node. Run the Validate a Configuration Wizard to check your storage configuration

    FailoverClustering Event ID 1069 Resource Control Manager
    Cluster Resource ‘Cluster Disk’ of type ‘Physical Disk’ in clustered role ‘MyRole’ failed.

    All of the services configured on the cluster role stop

    And then about 30 seconds later I get the following in the Windows System Event Log

    Ntfs (Microsoft-Windows-Ntfs) Event ID 98
    Volume E: (\Device\HarddiskVolume5) is healthy. No action needed

    The Cluster Disk resource comes back online and the services start up again, but I have had an outage in my services. I have been able to repeat this 100% of the time. Any ideas why this is happening. I have a couple of environments, one based on ESXi6.0 and the other ESXi6.5. I get the same symptoms on both.



  • 2.  RE: WSFC Failing when expanded non-shared disk

    Posted Jun 13, 2022 03:58 PM

    Hi  ,

    Did you solve the problem?

    Thanks

    Lorenzo