Since i have my new UIM lab, i started today to work a complete and high performance rafale mode probe.
What's rafale mode ?
Rafale mode is the name i choosed two years ago for a script created to answer a real production need : The Customer wanted to catch all logmon alarms, and only trigger a new alarm if they had received 'X' occurrences in less than 'X' seconds with a severity 'X'.
Some of the service monitored by this Customer generally trigger few alarms (that mean that was not critical). But when the service trigger a lot a alarms this meant that they have a problem. So the purpose was to not trigger alarm for Nothing.
Old Script : https://github.com/fraxken/rafale_mode
So what's the problem ?
- Lua Script will block the NAS (and the SQLite is not a very good solution for performance and high availability).
- Putting a pattern in the alarm message is not very cool (and not possible for every probes).
The solution !
The solution is to create a probe like alarm_enrichment. We attach the probe to a queue (rafale) that subscribe to a subject 'alarm1' and we update AE route subject to 'alarm1' and our probe will post to 'alarm2' for NaS.
Alarm_enrichment > rafale > NAS
The probe is multithreaded (pool of threads). From an old identical probe i can say it will be capable to handle around 300 alarms/second for each thread (more if i found a way to post in bulk alarms). But on this side we will need a real production benchmark with the whole stack !
This time no SQLite database for performance and high availability reason (will be directly hosted on the UIM database).
<setup> loglevel = 1 <!-- classical nimsoft loglevel --> logsize = 1024 <!-- logsize in KB --> debug = 0 <!-- advanced debug mode --> post_subject = alarm2 <!-- subject where pds are posted when enrichment is done --> pool_threads = 3 <!-- number of threads in the pool --> <!-- queue_attach = queueName --> <!-- login = administrator --> <!-- password = password --></setup><rafale-rules> <!-- Break on the first rafale rule matched, set to 'yes' by default --> exclusive_rafale = yes <100> <!-- Alarm field to match --> match_alarm_field = udata.message <!-- regexp to match on the field (like alarm_enrichment) --> match_alarm_regexp = .*Your\salarm\smessage\shere.* <!-- Trigger an alarm if we have 2 alarm in less than 60 seconds with a severity of 5. Put no will reverse the behavior. Default value = yes --> trigger_alarm_on_match = yes <!-- Number of alarm rows we want to have the alarm before triggering a new one! --> required_alarm_rowcount = 2 <!-- The interval where we want to check alarm rowcount (in second) --> required_alarm_interval = 60 <!-- Alarm severity, if no value is entered it will leave no severity check --> required_alarm_severity = 5 </100></rafale-rules><database> provider = MSSQL connectionString =</database>
I saw many integrator/customers with the same kind of need (and people are making weird rule in NAS to handle this).
Common mistakes to avoid
The biggest mistake to avoid is to implement "custom" case that are not really needed (that can bring performance issues). The goal is defined and we have to stay on a fix implementation (every steps is mastered and know).
My goal is to support 1000 alarms/s with 10-20 rules on ~5 threads (maybe less).
I work for a beta stage begining of the next week. If you have any ideas or Something to tell dont hesitate
Some hard situations/challenges :
- Rafale with multiple prid (Ex: A scenario where we have to trigger multiple alarm from multiple probes).
- Rafale with multiple sources. (Ex: A cluster monitoring).
- Rafale correlation on field (Ex: Robot is inactive + ping down correlate on hostname).
- Keep one SQL row for one rule.
- Work with shared memory for rules (and update the SQL table every 30s in an separated thread). The memory cost is low so that seem to be the best idea.
- Keep fields from the first alarm matched by the rule (source, etc...)
- Let people write some kind of "alarm Template" to trigger custom alarm when it's needed. (useful for the cluster case).
Correlation seem to be impossible right now without performance degradation.