DX Unified Infrastructure Management

  • 1.  HW monitoring

    Posted Jan 11, 2008 03:37 AM

    Hi There

     

     


    A while ago I put a question to the Nimsoft support regarding HW monitoring

     

     


    on servers:

     

     


     

     

     


    Hello
    What is the best way to monitor HW according to Nimsoft.....i chosed Component Cim_trap since this seems to be one way using HP compaq servers Haven´t tried it yet though.......but in general...since we might use other brands
    Is there any other effective ways to do this and still maintain coop with nimbus
    We need to detect Raid errors disk problems, fans etc

     

     


     

     

     


    This was the answer

     

     


    The best way to monitor H/W devices according to Nimsoft, is to use the snmp based probes. The probes using snmp are SNMPGET, SNMPGTW, SNMPTD and INTERFACE_TRAFFIC. You may also use cim_trap, which converts traps from Compaq messages into NimBUS alarms.

     

     


     

     

     


     

     

     


    Well i have now tried the Nimbus way for HW monitoring 

     

     


    snmptd with the cim_trap extension!

     

     


    ok i must say that the problem descriptions added in the alams are a bit short!!   :smileysad:

     

     


    for eg. i removed a disk in the raid on a lab server ....and Yes! an alarm was created

     

     


    critical saying:  "Status is now 3"            !!!!!!!!  ????? hrrm ok!

     

     


     

     

     


    Let´s say this alarm is received by a  “stressed tech” or viewed by the operation Noc

     

     


    what is status 3 ?? on what ???

     

     


     

     

     


    after a bit of investigation and one hint from the alarm seen throgh the nimbus manager

     

     


    as: Suppression key snmptd/cpqDa6LogDvrStatusChange..

     

     


    ok! they might identify this to be Drive or storage related

     

     


    hmm ok lets look at the eventviewer

     

     


     

     

     


    This is what i can see

     

     


    “Drive Array Logical Drive Status Change.  Logical drive number 1 on the array controller in Slot 4 has a new status of 3.

     

     


    (<street>

    Logical Drive
    </street> status values: 1=other, 2=ok, 3=failed, 4=unconfigured, 5=recovering, 6=readyForRebuild, 7=rebuilding, 8=wrongDrive, 9=badConnect, 10=overheating, 11=shutdown, 12=expanding, 13=notAvailable, 14=queuedForExpansion)

     

     


     

     

     


     

     

     


    This is a bit more informative 

     

     


     

     

     


    Well i know the Mib itself doesn´t provide all the detailed info as in the eventlog

     

     


    and from what i can see there is some more variables who can be added to the alarm text

     

     


    but to interpret this you really have to go through each and every mib possible and manually edit and add every profile in order to get  a understandable alarm text

     

     


     

     

     


    if we where to use another server brand what then ?

     

     


    using the eventlog instead isn´t gonna help us on a linux server either

     

     


     

     

     


    Ok this is better than nothing but :/

     

     


     

     

     


    Well im a bit novice on how to interperet the traps so I do rely on the monitoring software

     

     


    to do this for me .....so please share youre knowledge

     

     


     

     

     


     

     

     


    Any tips and trix? Other tools or gadgets but still keep the cooperation to  Nimbus to create alarms ….has someone else already done this

     

     


    Created your own extentions to the snmptd or ….?

     

     


    What do you guys out there use to secure  HW monitoring?

     

     


     

     

     



  • 2.  HW monitoring
    Best Answer

    Posted Jan 12, 2008 07:13 AM
    We primarily expect the vendor-provided hardware agents to generate meaningful log messages when there is something wrong, and we would alert on those.  In your example, you could use the ntevl probe to get that message from the event log.  I think we have a pretty solid setup of this for Windows.  I am not sure on our Unix servers if the hardware agents are writing to syslog, but that would be the way it should work here.

    Keith


  • 3.  HW monitoring

    Posted Jan 16, 2008 10:59 PM
    Hi Keith
    and thank U for your input.
    Yes this seems to be the best way to do it! creating a profile in the ntevl probe filtering on source of the
    vendor provided agents......
    in our case we did a test on our lab server using reg expression
    /^(N100|Storage Agents|NIC Agents|Server Agents|CPQDAEN|cpqasm)+$/
    then we did did some controlled HW damage :smileyhappy:  on disk, nic and fan ...seems to work fine.

    Pehaps logmon is the option for linux? even though i have´t seen how any HW related errors are presented in the messages log. 

    so is there any linux gurus out there who knows...?   hmmm perhaps time to shedule another lab.

    //J.L 


  • 4.  Re: HW monitoring

    Posted Feb 19, 2013 11:54 AM
    Try the next implementation: HPSIM - > (logfile / snmp traps) -> nimsoft snmptd/logmon.

    You will have a very strong hardware monitoring environment.