I came across the following problem at a number of customers recently –
When we get a robot inactive alert ops don’t know how to prioritise the alarm as they don’t know if it’s just the robot or the server which is down.
I wrote the following simple script as a solution, it uses action.ping rather than a callback to net_connect to run a profile, mainly because the customers I’ve dealt with on this don’t want to set a profile in net_connect for each and every robot. Should be pretty straightforward to change it to use a callback to net_connect, I’ll give it a go.
The script runs on robot inactive alarms, gets the ip of the offending robot, pings it and updates the alarm message and severity. The sript assumes that robot inactive alarms have been set to major on the hub, mainly as the customers I spoke to wanted it this way.
We’re assuming network connectivity from the primary hub to the robot too, in multi-site tunnelled environments we’d have to go with a nas on remote hubs, I’ll set up a test environment and make any required additions to the script later.
I don’t know about anybody else but in my opinion this is the sort of functionality we should be incorporating into NMS in future.
I’d welcome any feedback good or bad (I’ll just ignore the bad stuff :-), only joking)
David HigginbothamCA TechnologiesSr Consultant, Pre-Sales
Works great! displaying the helpful message BUT then moments later all inactive alarms are converted back to the original default message. Haven't figured out what is triggering that to happen. Seems like the script works on the alarm count is = 1, but once the alarm count increases above 1 on the next check is when the script does not fire and reverts to default message. Still playing around with AO settings.
Found it, but there is a secondary issue. First make sure that the following is set as greater than or equal in your AO.
Second, after setting the above correctly now the alarm console has gone mad. Basically it appears AO is too slow to respond and does not process before initial alarm display. So when the alarm comes in it is displayed with the default message, then a couple seconds later AO comes through and changes it with the script, on next alarm count increase the message is changed back to default, then seconds later the AO comes through and updates the message with the script, repeat this scenario nonstop.
The AO needs to be processed before alarm is displayed.
Did you try setting this up as a pre-processing script rather than an AO?
We have a similar situation but we took the AO pre-processing route that Gene suggested above. The challenge with that is you lose some of the CA custom extensions to Lua like the action.* functions. The code we utilize is available in GitHub at the link below.
robot-inactive/robot-inactive.lua at master · adgayle/robot-inactive · GitHub
Our pre-processing setup is below. We change the down server alarms to informational but you can set the level to 0 which will throw them away. The code is commented to tell you how to do that.
Looks like you changed the Hub setting for Robot Alarms (Major by Default) to Critical, since you have the Critical filter checked on the Pre-Processing Rule screen shot.
Your Pre-Processing Rule is leaving the Alarm at Critical if the Ping Test fails to ping the server.
Otherwise if the problem is only communication with the Robot (really Inactive), then the script itself changes the Alarm to Informational [or whatever Severity # you put into the line: event.level = # ].
Looks good , especially since the HUB has no built-in Ping Test. Utilizing the Pre-Processing Rules is a better solution that the Profile Rules, as they process Before the Alarm is published. The trick is to work with limited commands, as only a small subset of the normal LUA methods are supported on Events. The os.execute initiating the PING command was a good work-around.
The UIM Developers should consider building this Ping Test solution into the product, so it's part of UIM Out-Of-The-Box, rather than everyone having to reinvent the wheel (or Copy & Paste from here if they manage to find this posting).
I don't think they understand how many people are having this problem with these ambiguous "Robot server_name is inactive" Alarms.
Care needs to be taken with these types of actions. Using Alquin's example we have seen the os.execute
executing a ping -n xx cause a delay in nas processing. The ping against a device that is down can take
nearly 10secs to complete (~1 sec when the device responds). This delay holds up the nas probe processing
and will cause it to fall behind processing its alarm queue if there are multiple robots inactive.
The original script attached to this case calls `alarm.list()` which reads all alarms and tries to match for
the pattern. If your AO rule is matching off the alarm message then it would be advised to use the alarm.get()
instead. I am not sure if the `action.ping` used in the initial script causes the same amount of delay or not.
Greg is 100% right on this. nas is single threaded so any task you give it will potentially cause the queue to back if it is not completed in a timely manner. Coupled with the frequency of robot inactive alarms this can take a significant toll on nas. I should have mentioned this for this I apologize.
For our installation we are looking at alternative ways of doing this but it based on the information above it has to be external to nas using a polling method via the API to manipulate the alarms after arrival.