VMware vSphere

 View Only
  • 1.  ipmi sensor number mapping with DIMM module

    Posted Jul 07, 2021 06:18 AM

    Hi

    I am trying to detect correctable memory errors in the DIMM modules my servers. It has ESXi 6.5 running on it.

    I ran following esxcli to detect the errors

    -----------------------

    esxcli hardware ipmi sel list | grep -B5 -A 3 -i -E "memory|correctable"

    Record:390
       Record Id: 390
       When: 2019-02-28T01:08:16
       Event Type: 111 (Unknown)
       SEL Type: 2 (System Event)
       Message: Assert + Memory Correctable ECC
       Sensor Number: 83
       Raw:
       Formatted-Raw:
    --
    Record:393
       Record Id: 393
       When: 2019-04-25T06:29:14
       Event Type: 111 (Unknown)
       SEL Type: 2 (System Event)
       Message: Assert + Memory Correctable ECC
       Sensor Number: 83
       Raw:
       Formatted-Raw:

    -------------------------

    It shows 2 events that happened with sensor number: 83. How can I use this information to find out which memory module (actual slot  number) it happened in?

    So basically how can I map the sensor number from the command output above with a DIMM slot information e.g DIMMA1 etc..

    Thank you

    Dee



  • 2.  RE: ipmi sensor number mapping with DIMM module
    Best Answer

    Posted Jul 07, 2021 02:05 PM

    Hello.
    A standard server has a hardware management interface that is generically known as IPMI. In different masks it is called IMM, BMC, XClarity, ILO and more.
    The IPMI has a port assigned (labeled) and in standard form is configured to obtain an IP from a DHCP service, it can also be configured with a fixed IP, entering the UEFI (BIOS) of the Server.

    If you have access to the IPMI of your server, there you can have more details of the reported memory event.

    What make/model of server do you have?
    If it is IBM or Lenovo Server you can get a lot of Hardware data online using the DSA tool.

    Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit defined by the manufacturer it is recommended to plan the change.

     



  • 3.  RE: ipmi sensor number mapping with DIMM module

    Posted Jul 08, 2021 12:40 AM

    Hi e_espinel,

    Thank you for the response.

    I have Dell Power edge and Hp servers. 

    | Re:  Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit      |  defined  by the manufacturer it is recommended to plan the change.

            Yes, exactly that's what I am trying to monitor to see how many times the correctable error was reported. To do that I run the command 

    esxcli hardware ipmi sel list 

    Record:390
       Record Id: 390
       When: 2019-02-28T01:08:16
       Event Type: 111 (Unknown)
       SEL Type: 2 (System Event)
       Message: Assert + Memory Correctable ECC
       Sensor Number: 83
       Raw:
       Formatted-Raw:

    There were more events like this....

    This tells me that ECC correctable memory event happened on the given date and time. But I don't know which memory module it happened in. It only says Sensor Number: 83 . So is there any command or cli tool that can tell me which memory module this sensor number belongs to as I have multiple DIMM modules on my server.

    Thank you so much