DX Infrastructure Management

 View Only

Tech Tip: UIM - correlating several alarms in the NAS 

Jun 22, 2017 10:33 AM

This document explains how to correlate alarms in the NAS via an Auto Operator rule and a LUA script.

The use case is the following: as a customer I want to be alerted when I have 2 simultaneous alarms in the console. For instance, a high CPU load and a large CPU queue length size.

 

As both alerts come from the CDM probe, we could start by creating a NAS AO rule to catch any of the 2 alarms:

 

 

The Action Type in this case is set to "script" because we want to check in the Backend if we have 2 alerts (CPU load and queue) active at the same time for the same robot.

The content of the script is:

(Note you will need to edit your database server and user/password)

database.open("Provider=SQLOLEDB;Initial Catalog=CA_UIM;Data Source=<databaseserver>,1433;User ID=sa;Password=<databasepassword>;Network Library=dbmssocn;Language=us_english")
local a = alarm.get()
local rs = database.query("select * from nas_alarms WITH (NOLOCK) where robot = '"..a.robot.."' and (message like '%total cpu is now%' or message like '%processor queue length%')")
if #rs == 2 then
  new_alarm = {}
  new_alarm.nimid = a.nimid
  new_alarm.message = "ATTENTION: "..a.message
  new_alarm.sid = a.sid
  new_alarm.level = 5
  new_alarm.severity = "critical"
  new_alarm.user_tag1 = "Detected high CPU load and high CPU queue simultaneous alarms"
  alarm.set(new_alarm)
end
database.close()

 

This script is quite simple and straight forward. The avg runtime is 5 ms in an environment with low load.

The script will update the existing alert to:
a. Raise the severity of the CPU alerts to critical.
b. Update the alarm message to bring attention to the operators
c. Edit the user_tag1 field with “Detected high CPU load and high CPU queue”

You can adapt it to your needs and change the query to detect other active alarms in the environment.

 

Keep in mind that LUA scripts can pose a load on the NAS when dealing with a high volume of alarms so we should tune the Auto Operator rule to execute the script for a minimum number of alerts.

Note that there are other ways to accomplish alarm correlation (e.g. via ems probe, NAS triggers).

 

HTH,

Nestor

Statistics
0 Favorited
17 Views
0 Files
0 Shares
0 Downloads

Tags and Keywords

Comments

Jun 26, 2018 07:04 AM

Thanks Garin it really help. 

 I am not sure why u are not able to see my post

Jun 21, 2018 11:08 AM

Don't seem to have the authority to view the location you posted your script at.

 

The nas tech doc is at https://c.na53.content.force.com/servlet/fileField?id=0BE60000000PBnT 

 

database.open ( [ FileName | ConnectionString , [BreakOnError]] )
Opens a database handle to the specified file or database. Subsequent database operations will now be
reference through this handle, until it is closed using the database.close or through an implisit close when
opening another database using database.open. Set BreakOnError to false if you want to catch the error
instead of letting the script halt. The default database is called user.db , See examples below:
Opens a separate SQLite database file:
database.open("my_private.db")
Opens the NiS:
database.open("provider=nis;database=nis;driver=none")
Opens MS Access database:
database.open ("Driver={Microsoft Access Driv

 

As far as accessing the NIS database I use, 

 

rc, err = database.open("provider=nis;database=nis;driver=none")

 

will do it.

 

Sometimes you can get the same information faster by accessing the local NAS databases in the nas probe install directory - database.db, transactionlog.db, and user.db will hold most of the interesting local stuff.

 

also note that you might be able to get the info you need from alarm.query() or alarm.list(). You have to know the column names ahead of time but you can craft a pretty specific query. In one script I use this:

 

alarm1=alarm.list("where","origin = '" .. origin .. "' and supp_key = '" .. supp .. "'")

 

Regarding the differences in behavior, the environment that the script runs in as a AO has some data already provided - like the alarm that caused the script to fire is available by default using:

 

al = alarm.get ()

 

The trick with the correlation scripts is not to overthink the process. And to remember that when you modify an alarm it gets processed again like it was new.

 

-Garin

Jun 21, 2018 10:26 AM

Hello Garin,

 

Could you share an example to querying the local nas database.

 

I have create a script to query the UIM DB and its running fine when i run it manually from the editor but when i run the script from a AO profile it doens´t function has it should  

So i am thinking it is related with the time it takes to Sync the UIM database.

You can check my script on my question:

Set an alarm invisible based on correlation  

 

Thanks in advance.

 

Cheers,

Rui Dinis

Apr 23, 2018 12:42 PM

Couple other performance things you could do:

 

You are only concerned with the count of matches so why not do "select count(*) from..." in your SQL instead of returning the result set and then counting in LUA the number of records returned?

 

Set up a trigger that matches the where clause on your SQL query. Then you can use trigger.alarms() to get the list of alarms. Faster than going out and querying the SQL database.

 

You can also use the alarm.list() function like:

 

alarm.list("where","robot = '" .. robot .. "' and supp_key = '" .. supp .. "'")

 

And finally, how about querying the local nas database instead? Tables are the same as in the NiS but you again, don't have the overhead of logging into a remote DB. It's SQLite based so the syntax will be a little different but similar enough.

 

Regarding the LUA performance tuning, wrap all the code you can in a do/end pair. That will keep the variables out of the global scope. 

 

And from a cleanliness standpoint, you should always be testing the return codes - alarm.get() will sometimes return nil even though the script is launched.

 

And finally, the script will bump the priority and message of only one of the two matching alerts - it will leave the other one as it was. Then when the next CDM update comes in it will revert the priority and message back. This is because the AO is set to "overdue age" of 5 s. That will guarantee it runs once - only. Only way it would affect both is if both alerts were stored within 5 seconds of each other so that at the specified overdue age, the SQL query returned 2.  

 

So, this creates kind of a rabbit hole to go down because you need to make sure that your updates to the alert don't get overwritten by subsequent updates. 

 

-Garin

Apr 18, 2018 11:22 AM

Hi Nestor,

 

We are using Oracle DB. What are the equivalent statements for below to connect to ORACLE DB and run the query?

database.open("Provider=SQLOLEDB;Initial Catalog=CA_UIM;Data Source=<databaseserver>,1433;User ID=sa;Password=<databasepassword>;Network Library=dbmssocn;Language=us_english")
local rs = database.query("select * from nas_alarms WITH (NOLOCK) where robot = '"..a.robot.."' and (message like '%total cpu is now%' or message like '%processor queue length%')")

Thank you.

Rajashekar

Jun 26, 2017 03:44 AM

Thanks for the tips Thomas.

 

edited the original post to reflect the changes.

Nestor

Jun 24, 2017 03:48 PM

Hi,

 

 

Dont forget if you are on MS SQL to put "WITH (NOLOCK)" in your SQL request. Locking nas_alarms table table can be catastrophic.

 

For you lua code, declare "a" and "rs" variables with the "local" keyword (No global assignment, keep it local to speed up the execution). And an assignment for new_alarm is not very useful (just send the hash with all values to the alarm.set method). That's a best practices for every compiler (JIT or AOT).

 

Best Regards,

Thomas

Related Entries and Links

No Related Resource entered.