Idea Details

SOI Alarm functionality to detect Alarm Storms

Last activity 02-12-2019 06:19 AM
Britta Hoffner's profile image
07-07-2017 06:25 AM

I create this idea as a request for one of our customers who ran several times into the problem that an Alarm Storm from one of the connectors caused the SOI Manager Job Queues increasing dramatically and this causes very slow processing of Alarms. As a result Alarms appear delayed at the console until the Job Queues are finished processing the high amount of Jobs. There are two requests for this type of behavior:

 

1. implement some type of monitoring in SOI to get alerted when an Alarm Storm is detected.

 

2. improve the job queue processing in case of Alarm Storms so that the SOI Alarm processing to the SOI Console is not delayed.


Comments

02-12-2019 06:19 AM

So this means it will not be implemented in SOI, but only in DOI?

11-15-2018 05:36 AM

Great idea and really needed in SOI. If this is not going to be added to SOI you could look at using a custom enrichment (Possibly a java function) to identify storm events (Based on time and message) at the connector and mark them. Then use a separate policy to filter them. 

 

Alan 

05-02-2018 06:54 AM

This idea is taken care as part of next generation SOI i.e. Digital OI.

12-20-2017 10:25 AM

This idea is really gratefull. Also, it could event performance problems in the SOI environment.

10-10-2017 01:06 PM

Bumping as we just has this occur. We had an alert trap storm from one of our Spectrum landscapes and no new alerts on the console for hours until I was looped in. It turned out to be 1000's of alerts thrown from a device and that device had AEP policy on it. Thus the queue was in the 40k range...

We had to block that devices traffic to spectrum to stop it from alerting and finally after a SOI restart went back to normal. 

So yeah there needs to be some fail safe mechanism to prevent these situations his from totally killing the SOI MGR's ability to process new alarms.

10-10-2017 12:59 AM

Hello Ashay,

 

It is really good to see you here in the SOI community.

 

Yan

08-11-2017 09:39 AM

I agree that a "fail safe" mechanism in the connectors would be helpful in case of a alert storm, however the actual storm must preferably be managed at the domain manager side of things. This would prevent a storm of alerts towards SOI in the first place. The "fail safe" would then only have to come into effect in case everything else fails so to say.

07-20-2017 02:21 AM

Hello Yan, usually the restart of the SOI environment helps to resolve the situation but in some cases you have to clear all files from the activemq data folder. <SOI_HOME>\tomcat\webapps\activemq-web\activemq-data.

 

Kind regards,

Britta Hoffner

CA Support

07-19-2017 12:15 PM

Right now, restarting sam application service is the only way to clear the queue.

 

Regards

 

Andre

07-19-2017 11:15 AM

I agree on this enhancement. And btw I think I also face this problem recently, the alarm delay is about 2 hours late. If I may ask, is there any workaround to reset the SOI job queue when the alarm processing is delaying? Restarting the SOI and connector services didn't do any help.

 

Regards,

Yan

07-12-2017 09:36 AM

Basically when alarm storm occurs, connector / IFW may see process / heap dump which can be controlled by changing IFW to 64 bit and allocating more memory. 

 

But main problem is at SOI manager (AMQ) which will die / slow down processing this alert storm. I can think 2 options here.

1. Let connector send alerts in chunk when storm happens.

May be 100-200 alerts at a time and next chunk will go after 1-1.5 second delay, allowing some time to SOI manager to process all previous alerts.

 

2. SOI should have individual AMQ for all connectors and from where alert goes in the main AMQ. SOI Manager will process alerts from main AMQ allowing SOI manager is not dead / impacted completely when storm happens. Only alerts from that particular connector is impacted.

 

Thanks,

Ashay

07-09-2017 08:52 PM

This would be a good enhancement. Adding to what Darryl mentioned, we should be able to set threshold based on the customer requirements. Ex - Upper Threshold, AlarmStorm length, Alarms Per second. The application should be able to detect this and maybe suppress the alarms until the alarms come back to normal rate.

07-07-2017 08:28 AM

It is a great idea. If we can set an upper threshold so that when these alarm storms occur the connector is automatically disabled then it will be icing on the cake.