DX Infrastructure Management

 View Only

Tech Tip: Alarm Enrichment probe not processing alarms 

Feb 21, 2017 05:32 AM

This issue has been detected in UIM 8.4 SP2.

In some cases, the Alarm Enrichment probe stops processing alarms and messages start being queued.

The AE logs show the following repetitive message:

Jan 25 02:59:00:962 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count waiting for alarms to be flushed
Jan 25 02:59:20:963 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count done waiting for alarms to be flushed time: 20000

 

A workaround to get the AE probe back to normal operation is to restart the probe.

The solution is to upgrade nas to 8.42 or later, robot to 7.80 HF21 and hub probe to 7.80 HF22
And add the following keys in the nas and hub probes.

Nas:
<setup> overwrite
lower_memory_usage_threshold_percentage = 0.90
upper_memory_usage_threshold_percentage = 0.90
memory_usage_exceeded_threshold = 3
</setup>

 

 

Hub:
<hub> overwrite
post_max_age = 60
</hub>

Statistics
0 Favorited
8 Views
0 Files
0 Shares
0 Downloads

Tags and Keywords

Comments

Aug 02, 2017 05:00 AM

Just for note: with nas 8.42 the problem occurs.

 

Aug 02 09:24:06:737 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count done waiting for alarms to be flushed time: 20000
Aug 02 09:24:08:755 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count waiting for alarms to be flushed
Aug 02 09:24:28:755 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count done waiting for alarms to be flushed time: 20000
Aug 02 09:24:30:767 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count waiting for alarms to be flushed
Aug 02 09:24:50:769 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count done waiting for alarms to be flushed time: 20002
Aug 02 09:24:52:775 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count waiting for alarms to be flushed
Aug 02 09:25:12:782 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count done waiting for alarms to be flushed time: 20007
Aug 02 09:25:14:790 [attach_clientsession, alarm_enrichment] AlarmQueueReader alarm count waiting for alarms to be flushed

..............

Jul 06, 2017 05:18 AM

Hi Dan,

That key post_max_age = 60 sets a timeout for message from the message bus. By default, messages older than 2 seconds are dropped, and this increases the timeout to 60 sec. So rather than causing messages to be lost (such as alarms), it should do the opposite.

If you are slightly unsure about its impact you could increase it more slowly (maybe start with 10 or 15)

cheers

Rowan

May 11, 2017 10:51 AM

Hi Daniel,

 

Yes, the post_max_age setting should also be added into the hub probe via Raw Configure as well.  I was mainly pointing out in my last comment that the extra three lines have broken several customer environments.  The following are the correct settings that should be added via Raw Configure in the probe setup section.

 

*  nas

lower_memory_usage_threshold_percentage = 0.90
upper_memory_usage_threshold_percentage = 0.90
memory_usage_exceeded_threshold = 3

 

*  hub

post_max_age = 60 

 

Thank you also for your time in pointing this out.

 

Regards,

 

Ryan Currey

May 11, 2017 10:39 AM

Hi Yu, I know that.. I'm asking Ryan if we need that additional line added to the hub's CFG or do we just need the 3 lines added to the nas's cfg?

May 11, 2017 01:42 AM

Hello.

 

post_max_age is a key in hub.cfg (in under <hub> section)

 

Regards,

Yu Ishitani

May 10, 2017 11:11 AM

Hi Ryan,

what about that extra hub.cfg line? Is that needed or just the 3 lines in the nas.cfg <setup> section?

Is this needed:

Hub:
<hub> overwrite
post_max_age = 60
</hub>

 

Thanks,

Dan

May 09, 2017 04:04 PM

Thanks Ryan, I have edited the original post to reflect your comments.

May 09, 2017 01:50 PM

In order to correct this issue, ONLY the following keys need to be added:

 

lower_memory_usage_threshold_percentage = 0.90
upper_memory_usage_threshold_percentage = 0.90
memory_usage_exceeded_threshold = 3

 

Please refer to the steps in the below KB article:

 

*  alarm_enrichment probe become unresponsive 

 

I have had several customers say that when adding the following extra keys, it broke alarm_enrichment:

 

restart_memory_usage_threshold_percentage = 0.85
force_flush_active_alarms_timeout = 300
force_flush_active_alarms_before_shutdown = yes

 

Regards,

 

Ryan Currey

UIM Support

May 08, 2017 08:12 AM

Hi Daniel,

I follow-up the ticket you opened and it seems CA support provided you a fix (not sure if you installed it or not) and the issue no longer happens. Can you confirm so we can put some light on this thread?

Thanks!

Nestor

Apr 18, 2017 09:27 AM

This did not help my issue. I have the exact problem and tried running nas versions 4.94 and 8.43 and with these options set still hitting the alarm_enrichment queue backup stoppage. 

Have a case open.

Feb 28, 2017 10:56 AM

Second Daniel's statement. Experienced similar case interaction several months back when last requested CA assistance on the topic.

Feb 27, 2017 03:54 PM

Honestly this is another blatant example how CA needs an announcement board for discovered defects w/in this product.

I opened the same exact case a few weeks ago and was being handled by another engineer. I got a sub-set of these suggestions only for the nas probe and it seemed to have fixed this exact issue. Overall since upgrading to 8.4 Sp2 our nas has had many issues. I shared this topic with him and he had no idea. 

 

CA how about communicating w/in your own support teams issues that pop up so its not a run around from scratch each time. 

Feb 27, 2017 03:19 PM

CA, is this fixed in 8.5? We are running 8.4 SP2 and hit this exact issue many times. I added the hot fixes and the options mentioned above but just want to know, in 8.5 is this fixed and will it work w/o the additional options added to the nas and hub probe? 

 

Can someone explain what exactly does the   "post_max_age=60" mean? I like to know what this does in case we start seeing odd behavior.

Feb 21, 2017 09:03 PM

Feb 21, 2017 01:12 PM

Not really, i would use the settings during implemenations to avoid this problem

Feb 21, 2017 01:02 PM

Hi Luc,

If you do encounter the repetitive message as above, then yes, go ahead and make the above changes.

Thanks.

Regards

-Sayeed

Feb 21, 2017 12:16 PM

Can we add these settings as a default?

Related Entries and Links

No Related Resource entered.