Hello fellow friendly community of happy users!
Every now and then, we are running into a problem with alarm_enrichmentwhere for no obvious reason queue process speed halts to a slow crawl.
Need to restart alarm_enrichment probe, and it quickly processes thequeue, and is back on track for a seemingly random amount of days.
Just now it has slowed down, and queue had built up to ~50k messages.After restart, queue was processed within 1-2 min, and is not buildingup anymore, so it's not that we have such high message flow that itisn't able to keep it. It just.. chokes.
Anyone else experiencing this?
Yes, I am/was.
What I have gelaned from comments from support is that the probe makes the incorrect assumption that as the backlog gets larger, it's because it's using too large a bulk size and so reduces it's bulk size. This introduces more overhead and so the probe goes slower and slower and the backlog grows and the probe continues to reduce it's bulk size.
UIM 8.1 seems not to be affected by this in my install.
I've also got my bulk read size set to 250.
That sounds like a very clever feature, but could explain it. Every now and then, there might be a burst of alarm messages, and that could very well trigger the behavior you described, making it worse. I wonder why someone thought reducing bulk size in any way would make it better...
And a restart would reset the bulk size setting again, causing it to quickly chew through the queue, and everything is ok again. Until the next time it happens. My currently bulk size is 100, but guess I could increase it a bit more.
Glad to hear it might be fixed 8.1. Got a 7.6 -> 8.1 upgrade scheduled in 2 weeks.
So, we've run into a bit of a problem over the weekend related to this.
After reading https://na4.salesforce.com/articles/Case_Summary/alarm-enrichment-probe-become-unresponsive?popup=true we attempted to add these config keys so it would auto restart when needed:
lower_memory_usage_threshold_percentage = 0.90
upper_memory_usage_threshold_percentage = 0.90
memory_usage_exceeded_threshold = 1
Last night, it ended up restarting so often that controller stopped retarting it.
So, I guess we have the choice between it stopping to process, and it restarting so often it dies. Good choices
It's "functioning" at the moment, but still restarting quite often (every 10-20min). We get the following in the logfile:
feb 22 10:18:47:900 [attach_clientsession, alarm_enrichment] AlarmQueueReader: Upper capacity check : memory (free/used/total): 10139736/102057896/112197632 OR 0.9096261140342071% used
feb 22 10:18:47:900 [attach_clientsession, alarm_enrichment] AlarmQueueReader upperCount: 1
feb 22 10:18:48:016 [attach_clientsession, alarm_enrichment] Nas alarm_enrichment INTERNAL_RESTART checkForCapacityToAddMoreAlarms
feb 22 10:18:48:058 [attach_clientsession, alarm_enrichment] TimeOverThesholdService shutting down
feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] CmdbMessageEnricherError closing : 'by_source' problem: null
feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] Nas: Shutdown complete in 610ms
So, it's clearly the 0.90 threshold restart that is being triggered.
Naturally, these config options aren't documented, so does anyone know what memory_usage_exceeded_threshold actually is? Could it be number of times it needs to breach the memory threshold before restarting?
Also, when it's restarting, it's using 112 MB memory, which is quite far away from the 1024MB max it has configured as max:
../../../../jre/jre7/bin/java.exe -Xms64m -Xmx1024m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar
Increasing -Xms might at least increase the time between restarts, as it hopefully would have more memory allocated initially, but hard to ****. How to tune the memory settings in alarm_enrichment isn't really documented either. Since it's config is a part of the nas config, I'm guessing it might be different than the other java probes? Does anyone know?
Adding the usual:
<startup> <opt> java_mem_init = -Xms512m </opt> </startup>
to nas.cfg doesn't do any good at least.
So forgive me for asking the obvious but the parameters and text refer to percent using a fraction.
What happens if you use
lower_memory_usage_threshold_percentage = 90
upper_memory_usage_threshold_percentage = 90
memory_usage_exceeded_threshold = 1
Good question! It's logging "0.9096261140342071% used", so it seems to be using %/100 internally. And from my testing, it seems to be correct to use the 0.<percent> fractions. As we all know.. percent is hard.
However, after some playing around with it, memory_usage_exceeded_threshold seems to be how many times it can go above the threshold before it should trigger a restart. Counter is based on normal upper/lower mechanics it seems. So it needs to go above upper to trigger, but under lower to reset.
After increasing memory_usage_exceeded_threshold from 1 to 3, we stopped getting the restarts every 10-20 min.
We currently have bulk_read_size set to 250, but I'm thinking we perhaps should reduce to. So memory footprint after each bulk read is smaller, cause clearly there is something fishy in the timing of when it increases heap size, and then the memory usage testing is performed.
Or if someone figured out how to configure the initial heap size. Increasing it from 64M to 512M or something. And/or possibly tuning related to gc.
Looks like there is another article that might help you
That article doesn't seem to provide a solution either. First of all, alarm_enrichment queue only subscribes to alarm subject, so one can wonder how the QOS_MESSAGE subjects are getting in there in their example.
Secondly, it doesn't stop processing it. Sent value on the queue is steadily increasing. So it's not a broken message that is blocking alarm_enrichment. It's just not processing and grabbing them fast enough anymore.
Secondly, it does not trigger the restart mentioned earlier in this thread when this happens, so I guess that setting did nothing good (for this problem at least).
After restarting alarm_enrichment, it processes the queues that it was previously unable to handle in less that 20 sec (from 70k messages to 0). So there is something else going on here.
I'm running 8.31 and just experienced this issue with the alarm_enrichment queue backing up this morning. I opened a case with support then found this thread. Interesting read to say the least. Restarting the alarm_enrichment probe drained the queue and alarms started flowing again. From a longer term solution, support provided 2 tec articles which seems most have a ready discussed in this thread -
http://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec000004902.html andhttp://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec000004293.html I followed the steps in document 4293 increasing the memory allocation and increased the queue bulk size, but I am hesitant to add the following keys based on others experience: lower_memory_usage_threshold_percentage=0.90upper_memory_usage_threshold_percentage=0.90memory_usage_exceeded_threshold=1Appreciate everyone's feedback on this topic!-Mike
http://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec000004293.html I followed the steps in document 4293 increasing the memory allocation and increased the queue bulk size, but I am hesitant to add the following keys based on others experience: lower_memory_usage_threshold_percentage=0.90upper_memory_usage_threshold_percentage=0.90memory_usage_exceeded_threshold=1Appreciate everyone's feedback on this topic!-Mike
I followed the steps in document 4293 increasing the memory allocation and increased the queue bulk size, but I am
hesitant to add the following keys based on others experience:
There is an internal memory usage checker in alarm_enrichment probe.
These 3 parameters allow us to go self-restart of the probe when memory usage is near its maximum.
Setting those value will make enrichment way more stable and reliable.
No reason not to do it.
Make sure that you also increase the Java VM allocations. I am running with the options:
-Xms1024m -Xmx4096m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar