Idea Details

Robot Down and Ping fail alert suppression

Last activity 07-13-2018 05:57 AM
Anon Anon's profile image
04-02-2015 12:47 PM

When robot goes down then Ping fali and robot Inactive alerts are triggered.
Please define a new process such that whenever robot goes down only Ping fail alert is triggered not Robot inactive.


Comments

07-13-2018 05:57 AM

This might help:

 

-- Find inactive robots, ping them to see if just the robot which is down or the server.

-- The script assumes robot inactive alarms from the hub have been changed to major, this could always be handled by the script of course

-- just insert the following lines after line 26

-- a.level = 4

-- a.severity = major

 

--Find inactive robot alarm(s)

al=alarm.list("message","Robot % is inactive")

if al ~= nil then

   for i = 1,#al do

      -- Place current row al[i] into a (for readability)

      a = al[i]

      -- Print nimid, hostname, severity and message for troubleshooting

      printf("%02d %s %s %s",i,a.source,a.severity,a.message)

      -- Get the ip of the robot from the alarm

      ip_addr = a.source

      -- Print for troubleshooting

      print(ip_addr)

      -- Ping the ip

      ping_success = action.ping(ip_addr)

         if ping_success then

            -- Print the status for troubleshooting

            print("Ping success "..ip_addr)

            -- Edit the alarm message to to assist ops

            message_add_OK = "but server responds to ping OK"

            a.message = a.message.." "..message_add_OK

            -- Change severity to major

            --a.level = 4

            --a.severity = major

            alarm.set (a)

         else

            --Print the status for troubleshooting

            print("Ping fail "..ip_addr)

            -- Edit the alarm message to assist ops

            message_add_fail = "and no response to ping!"

            a.message = a.message.." "..message_add_fail

            -- Change the severity to critical

            a.level = 5

            a.severity = critical

            alarm.set(a)

         end

   end

end  

06-09-2017 10:26 AM

Hi,

 

That's a very bad idea. First "Robot is inactive" mean that the Nimsoft agent is probably down (but the VM/Physical Server can be ok).

 

The monitoring is not the same at all (And returned QOS are not reporting the same metrics). Right here this is how i see these alarms : 

 

Robot is inactive : The monitoring is probably down.

Ping failed : The server is probably down.

 

The difference between them is important and for big customer that change how the incident is created and which team is contacted at first.

 

Best Regards,

Thomas

09-11-2015 06:01 AM

If i'm not mistaken "Robot is Inactive" alarms are generated from/because the hub probe has not received a message (actively or passively) from the robot within a specified time period. A ping fail is generated because the the icmp (or net_connect) probe not getting an ICMP reply from the server.


They are two separate probes and therefore unaware of each other's alarms - I don't think it's possible to have only the ping when both conditions exist (with the current architecture).


You might be able code something in nas (or use fault_correlation ?) that would produce the desired result you are looking for.