This might help:
-- Find inactive robots, ping them to see if just the robot which is down or the server.
-- The script assumes robot inactive alarms from the hub have been changed to major, this could always be handled by the script of course
-- just insert the following lines after line 26
-- a.level = 4
-- a.severity = major
--Find inactive robot alarm(s)
al=alarm.list("message","Robot % is inactive")
if al ~= nil then
for i = 1,#al do
-- Place current row al[i] into a (for readability)
a = al[i]
-- Print nimid, hostname, severity and message for troubleshooting
printf("%02d %s %s %s",i,a.source,a.severity,a.message)
-- Get the ip of the robot from the alarm
ip_addr = a.source
-- Print for troubleshooting
-- Ping the ip
ping_success = action.ping(ip_addr)
if ping_success then
-- Print the status for troubleshooting
print("Ping success "..ip_addr)
-- Edit the alarm message to to assist ops
message_add_OK = "but server responds to ping OK"
a.message = a.message.." "..message_add_OK
-- Change severity to major
--a.level = 4
--a.severity = major
--Print the status for troubleshooting
print("Ping fail "..ip_addr)
-- Edit the alarm message to assist ops
message_add_fail = "and no response to ping!"
a.message = a.message.." "..message_add_fail
-- Change the severity to critical
a.level = 5
a.severity = critical
That's a very bad idea. First "Robot is inactive" mean that the Nimsoft agent is probably down (but the VM/Physical Server can be ok).
The monitoring is not the same at all (And returned QOS are not reporting the same metrics). Right here this is how i see these alarms :
Robot is inactive : The monitoring is probably down.
Ping failed : The server is probably down.
The difference between them is important and for big customer that change how the incident is created and which team is contacted at first.
If i'm not mistaken "Robot is Inactive" alarms are generated from/because the hub probe has not received a message (actively or passively) from the robot within a specified time period. A ping fail is generated because the the icmp (or net_connect) probe not getting an ICMP reply from the server.
They are two separate probes and therefore unaware of each other's alarms - I don't think it's possible to have only the ping when both conditions exist (with the current architecture).
You might be able code something in nas (or use fault_correlation ?) that would produce the desired result you are looking for.