ESXi

 View Only
  • 1.  How to identify a failing physical SSD device

    Posted Aug 16, 2023 04:33 PM

    As part of my planned preventative maintenance, I'm looking to be able to identify SSD devices that will need to be replaced, prior to a failure.  In essence, I'd like to access predicted failure information.

    Our ESXi installation is running RAID-1, but the main VM host area has no RAID

    I'm running ESXi 7.0U3 on Dell XR12.

    Any pointers to documents or KB articles would be welcomed as I've not had that much success finding anything.



  • 2.  RE: How to identify a failing physical SSD device

    Posted Aug 17, 2023 02:00 AM

    You may use hardware tab of ESXI to check status of the underlying hardware. Or better to use idrac to check the status of the underlying hardware.

    Regards,

    Sachchidanand



  • 3.  RE: How to identify a failing physical SSD device

    Posted Aug 17, 2023 05:47 AM

    Hello,

    you can check the HW health status for ESXi servers from GUI or CLI, below are some references for your support:

    From vSphere Client: Monitor Hardware Health Status in the vSphere Client 

    From CLI: KB 2040405 



  • 4.  RE: How to identify a failing physical SSD device

    Posted Aug 17, 2023 07:54 AM

    Hey! Identifying a failing SSD is crucial to ensure data safety and maintain a smooth operating environment, especially in an ESXi environment.

    For Dell servers, the iDRAC interface is a valuable tool. It often provides predictive failure alerts for storage devices. Here's what you can do:

    Dell iDRAC: Log into the iDRAC web interface and navigate to the hardware section to check the status of the SSDs. Any issues are typically flagged, including predictive failures.

    ESXi: From your ESXi host, you can utilize the esxcli command to fetch storage device information. Here's a quick command:

    Code
    esxcli storage core device smart get -d=device_id

    Look for attributes such as 'Media Wearout Indicator', 'Reallocated Sectors Count', 'Program Fail Count', etc. A significant deviation from their usual values can hint at an impending SSD failure.

    vCenter Server: If you're using vCenter, it might provide alerts and notifications related to hardware health, including SSD status.

    Dell OMSA (OpenManage Server Administrator): This tool provides a comprehensive health status of Dell server components, including SSDs. If it's installed on your ESXi host, it can be used to monitor hardware health.

    Finally, for detailed procedures and potential alarms, check Dell's official documentation or VMware's Knowledge Base articles. Dell's community forums can also be a valuable resource, as many administrators share their experiences and solutions there.

    Remember, while predictive failures give you a heads-up, it's always a good idea to maintain regular backups of crucial data.

    Hope this helps and wishing you a seamless maintenance!
    Cheers,
    Ansar



  • 5.  RE: How to identify a failing physical SSD device

    Posted Aug 17, 2023 09:05 AM

    Hello,


    As Sachchidanand (and other) already told you, also in my opinion, the better option is to use the iDRAC, possibly setting it up to send you alarms/alerts based on the occurrence of a whole range of events related to the underlying hardware, there are more than one methods available.


    However, in practice, it is information that could be somehow misleading because the intervening time from the detection of the conditions that "could lead to a malfunction" and the "malfunction" can be so short as not to have time to intervene proactively, It has concretely happened to me on a couple of occasions that not even ten minutes have passed between the "alarm" and the subsequent "fault". In a RAID array (or other by design inherently reliable solution) it is different.


    Regards,
    Ferdinando



  • 6.  RE: How to identify a failing physical SSD device

    Posted Aug 17, 2023 04:31 PM

    Better to use esxcli commands of course



  • 7.  RE: How to identify a failing physical SSD device

    Posted Aug 17, 2023 08:14 PM

    Hello,


    Sorry but I don't agree so much on this, receiving a warning that something "may go wrong" is quite different than realizing it when perhaps the "failure has already occurred". I don't go to consult every day and repeatedly at regular intervals via command line what a HOST running ESXi could tell me or not, I do it, eventually, when my monitoring systems warn me of a (possible) unfortunate event.


    Then, everyone manage their infrastructure as they see fit (for their good reasons) and I will never question that.


    Regards,
    Ferdinando