DX Unified Infrastructure Management

 View Only
  • 1.  Clearing alarm 'x' when alarm 'y' is received

    Posted Jun 28, 2019 08:25 AM

    Hi all,

    Another noob question inbound!

    I am monitoring a Linux service with logmon (checking the exit code of a command).

    If the service fails the alarm triggers, I would like to be able to clear this alert if the service recovers. 

    Could anyone point me in the right direction ?

    Thanks for your time and help! 



  • 2.  RE: Clearing alarm 'x' when alarm 'y' is received
    Best Answer

    Posted Jun 28, 2019 09:14 AM
    For something like that the processes probe may be a better fit. 
    https://docops.ca.com/ca-unified-infrastructure-management-probes/ga/en/alphabetical-probe-articles/processes-process-monitoring

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 3.  RE: Clearing alarm 'x' when alarm 'y' is received

    Posted Jun 28, 2019 09:14 AM
    Oh and this page shows the compatibility between probes and OS versions. 
    https://docops.ca.com/ca-unified-infrastructure-management/9-0-2/en/files/490068425/537402493/6/1561451411753/Platform_Support_Availability_current.pdf

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 4.  RE: Clearing alarm 'x' when alarm 'y' is received

    Posted Jun 28, 2019 09:26 AM

    Hi David!

    Thanks for your quick response,

    So I have the process probe running also, but I was running logon to verify the service is up, but also working as expected.

    is it possible to  use auto-operator to close out the alert generated by logmon? say if it returns a '0' on the next run, or if the server is rebooted?

    Thanks again!




  • 5.  RE: Clearing alarm 'x' when alarm 'y' is received

    Posted Jun 28, 2019 11:58 AM
    You need to script this (suggest using Lua) in an AO.

    You need to find the alarm you want to close with something like:

    alarm1=alarm.list("where","robot = '" .. robot .. "' and supp_key = '" .. supp .. "'")

    You'd adjust the where criteria to match what you need - here I already knew the supp_key for the alarm I was looking for.

    alarm1 then has a list of the matching alarms. Identify the one you need to close and then

    action.close (a.nimid)

    will close it.




  • 6.  RE: Clearing alarm 'x' when alarm 'y' is received

    Posted Jun 28, 2019 12:11 PM
    So we can have the logmon probe run a verification command every 5 minutes, and check the exit code. If the exit code is not 0 it generates an alert, but when the service is later recovered and the verification command succeeds with exit code of 0, it requires custom scripting to close out the alert?
    Since this is how nagios and sensu operate, I'd expect this to be a fairly normal feature. Is it just not supported by the logmon probe when running commands, or is there a better probe to use for this?


  • 7.  RE: Clearing alarm 'x' when alarm 'y' is received

    Posted Jul 01, 2019 11:01 AM
    My apologies - I understood the question to be that you wanted to detect the problem with logmon but clear the issue with the processes probe. For something like that you need scripting because you are crossing the boundary between probes. If you are doing all the testing with logmon then, generally speaking, all you need is to have a logmon watcher that returns clear that has the same suppression id as the watcher that created the alarm. For the return code checking I believe that the clear happens automatically when you have a zero return code.

    From my experience though, I'd suggest that you avoid the return code checking and instead have the script return some filterable text value - (OK or FAIL for instance) as, if nothing else, it makes debugging the whole thing easier.


  • 8.  RE: Clearing alarm 'x' when alarm 'y' is received

    Posted Jul 09, 2019 11:02 AM
    If I'm correct in thinking, you need a heartbeat monitor.  We monitor text is written to a log file and if so - all is good.  If no text is written - alert.

    Here's a profile which will alert you to no activity within a log file, and clear itself when activity is detected:

       <MS27 - DEV2>
          active = yes
          interval = 10 min
          scanfile = /home/rs.log
          fileencoding =
          scanmode = updates
          alarm = yes
          qos = no
          message = no
          subject =
          user =
          reccur_directory = no
          reccur_directory_level = 10
          resetFile = no
          initialfileptr = 2
          resumefileptr = 4
          command_timeout_active = no
          command_timeout =
          command_severity = 2
          command_timeout_alarm = 0
          alarmFOpenFail = no
          clearFOpenFailRestart = no
          monitor_exit_code = No
          max_alarm_sev = 5
          max_alarms =
          max_alarm_msg =
          password =
          <watchers>
             <Heartbeat>
                active = yes
                match = *
                level = minor
                subsystemid = 1.1
                message = Heartbeat outage detected
                i18n_token =
                restrict =
                expect = yes
                abort = no
                sendclear = no
                count = no
                separator =
                suppid = DEV2
                source =
                target =
                qos =
                runcommandonmatch = no
                alarm_on_first_match = yes
                commandexecutable =
                commandarguments =
                pattern_threshold_severity = information
                pattern_threshold_message =
                timeout = 1
                pattern_threshold =
                expect_message = Heartbeat detected - DEV2
                expect_level = information
                regexfromexternalfile = no
                patternfilepath =
                token =
                variable_threshold =
                variable_threshold_message =
                variable_threshold_severity = information
                variable_threshold_supp =
             </Heartbeat>
             <Heartbeat Clear>
                active = yes
                match = *
                level = clear
                subsystemid = 1.1
                message = Heartbeat detected
                i18n_token =
                restrict =
                expect = no
                abort = no
                sendclear = no
                count = no
                separator =
                suppid = DEV2
                source =
                target =
                qos =
                runcommandonmatch = no
                alarm_on_first_match = yes
                commandexecutable =
                commandarguments =
                pattern_threshold_severity = information
                pattern_threshold_message =
                timeout = 1
                pattern_threshold =
                expect_message =
                expect_level =
                regexfromexternalfile = no
                patternfilepath =
                token =
                variable_threshold =
                variable_threshold_message =
                variable_threshold_severity = information
                variable_threshold_supp =
             </Heartbeat Clear>
          </watchers>
       </MS27 - DEV2>


    ------------------------------
    CA - UIM administrator
    ------------------------------