vSAN1

 View Only
  • 1.  vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 29, 2023 12:34 PM

    A Dell PowerEdge R730xd is part of vSAN cluster, pretty recent version (7.0.3). Problem is that it reports a permanent disk failure from VSAN interface but when we check the iDRAC (oob) interface of the server, it does not report any issue with any disks.
    Is it known possible problem related to hardware or a bug from vSAN ? or any document to share that could explain this behaviour ? Thanks



  • 2.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 29, 2023 12:39 PM

    , iDRAC doesn't detect nor react to anything like medium errors in metadata region, issues detected in PSA, vSAN DDH, impending failure in SMART fields etc., whereas ESXi and/or vSAN does and marks the disk(s) offline accordingly - can you please retrieve and attach the vmkernel.log from this host so we can see why it was marked as offline? if it rolled over then it should be in /var/run/log/vmkernel.0.gz (or vmkernel.1.gz etc. if longer ago).



  • 3.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 29, 2023 03:01 PM

    Just checked but unfortunately. the vmkernel.log gz is already gone (happened 3 days ago).
    I think a server reboot will reset the alarm and as it will occur again we can grab the vmkernel.log quicker.
    So what is the proposed fix from VMware in such cases because hardware provider, they don't change the disks for a bad sector.
    Is there anyway to demonstrate disk issue so that we can get a replacement ? or is the vmkernel.log the only way to go ?
    or is there option to prevent disk to be marked offline by vSAN until it is really dead (from idrac/ilo/oob point of view) if we prefer it this way ?



  • 4.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 29, 2023 06:51 PM

    Hi,

    beside keeping the vSphere SW on a current version, did you do the same with HW related stuff?
    I.e., which version of iDRAC do you run on that server, and have you also updated the firmware used by those SSDs?

    When I run a quick check on Dell R730xd Download Site filtering for SAS Drives it would offer 88 packages.

    Here's a snippet from the most current one's.

    kastlr_0-1693334889585.png

    You should verify which FW your drive in question does use and check the release notes if a newer firmware is available from Dell.



  • 5.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 30, 2023 08:07 AM

    Hi,

    Yes, at almost latest iDRAC8 (2.83.83.83) firmware, just one behind. Last version updates is only recommended with unrelated fixes.
    For disk firmware this server has only SEAGATE model type ST2000NX0463 on firmware level NT31, there is a recommended upgrade to NT32 for it from 2018, so not really recent but well can try the upgrade it this helps failing the disk officially ;o)
    No vSAN option to prevent the disk from being marked "dead" when not yet the case or any vSAN tool result that can be shown to manufacture to force or let him understand that disk is faulty and needs replacement

    Thank you for your feedback so far.



  • 6.  RE: vSAN + iDRAC do not agree on disk failure or not...
    Best Answer

    Posted Aug 30, 2023 08:34 AM

    Hi,

    additionally you could use the following command to collect the SMART status of the SAS drive in question.

    esxcli storage core device list
    esxcli storage core device smart get -d <device identifier of SSD in question>

     



  • 7.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 30, 2023 03:20 PM

    FYI, this is what I get. Does vSAN remove the disk as soon as there are write/error count ?
    Is there a threshold ? showing N/A below as you can see

    1000ouzh_0-1693408715415.png

     



  • 8.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Aug 30, 2023 03:59 PM

    Looks like this SSD reports more than 1,2 Billion Read Errors.

    When you compare these numbers with the stats of the remaining SSDs in your host/cluster, do they look similar?
    As I assume that most of your SSDs are bought at the same date comparing the numbers would allow you to see if this SSDs is way above the others.



  • 9.  RE: vSAN + iDRAC do not agree on disk failure or not...

    Posted Sep 01, 2023 07:35 AM

    The 2 first you see below are SSD (boot) but all the rest is still standard HDD but yes bought all at the same time.
    Definitely seing some odd SMART values for the drive in question compared to the others, so will try to make it replaced even though it is not showing dead, let's see.
    Thank you !
    BRA.disks.2.png