DX Infrastructure Manager

Expand all | Collapse all

Problems with alarm_enrichment queue buildup

  • 1.  Problems with alarm_enrichment queue buildup

    Posted 02-02-2015 12:41 PM

    Hello fellow friendly community of happy users!

     

    Every now and then, we are running into a problem with alarm_enrichment
    where for no obvious reason queue process speed halts to a slow crawl.

     

    Need to restart alarm_enrichment probe, and it quickly processes the
    queue, and is back on track for a seemingly random amount of days.

     

    Just now it has slowed down, and queue had built up to ~50k messages.
    After restart, queue was processed within 1-2 min, and is not building
    up anymore, so it's not that we have such high message flow that it
    isn't able to keep it. It just.. chokes.

     

    Anyone else experiencing this?



  • 2.  Re: Problems with alarm_enrichment queue buildup

    Posted 11-02-2016 11:09 AM

    I'm running 8.31 and just experienced this issue with the alarm_enrichment queue backing up this morning.  I opened a case with support then found this thread.  Interesting read to say the least.  Restarting the alarm_enrichment probe drained the queue and alarms started flowing again.  From a longer term solution, support provided 2 tec articles which seems most have a ready discussed in this thread -

    http://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec000004902.html and
    http://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec000004293.html

     

    I followed the steps in document 4293 increasing the memory allocation and increased the queue bulk size, but I am

    hesitant to add the following keys based on others experience:

     

    lower_memory_usage_threshold_percentage=0.90
    upper_memory_usage_threshold_percentage=0.90
    memory_usage_exceeded_threshold=1


    Appreciate everyone's feedback on this topic!

    -Mike


  • 3.  Re: Problems with alarm_enrichment queue buildup

    Posted 11-05-2016 03:04 PM

    Setting those value will make enrichment way more stable and reliable.

     

    No reason not to do it.

     

    Make sure that you also increase the Java VM allocations. I am running with the options:

     

    -Xms1024m -Xmx4096m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar

     

    -Garin



  • 4.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-02-2015 11:22 PM

    Yes, I am/was.

     

    What I have gelaned from comments from support is that the probe makes the incorrect assumption that as the backlog gets larger, it's because it's using too large a bulk size and so reduces it's bulk size. This introduces more overhead and so the probe goes slower and slower and the backlog grows and the probe continues to reduce it's bulk size.

     

    UIM 8.1 seems not to be affected by this in my install.

     

    I've also got my bulk read size set to 250.

     

    -Garin



  • 5.  Re: Problems with alarm_enrichment queue buildup

    Broadcom Employee
    Posted 11-03-2016 07:57 PM

    Hi.

    There is an internal memory usage checker in alarm_enrichment probe.

    These 3 parameters allow us to go self-restart of the probe when memory usage is near its maximum.

     

    Regards,

    Yu



  • 6.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-03-2015 01:03 PM

    That sounds like a very clever feature, but could explain it. Every now and then, there might be a burst of alarm messages, and that could very well trigger the behavior you described, making it worse. I wonder why someone thought reducing bulk size in any way would make it better...  

     

    And a restart would reset the bulk size setting again, causing it to quickly chew through the queue, and everything is ok again. Until the next time it happens. My currently bulk size is 100, but guess I could increase it a bit more.

     

    Glad to hear it might be fixed 8.1. Got a 7.6 -> 8.1 upgrade scheduled in 2 weeks.



  • 7.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-22-2015 02:42 PM

    So, we've run into a bit of a problem over the weekend related to this.

     

    After reading https://na4.salesforce.com/articles/Case_Summary/alarm-enrichment-probe-become-unresponsive?popup=true we attempted to add these config keys so it would auto restart when needed:

     

        lower_memory_usage_threshold_percentage = 0.90

        upper_memory_usage_threshold_percentage = 0.90

        memory_usage_exceeded_threshold = 1

     

    Last night, it ended up restarting so often that controller stopped retarting it.

     

    So, I guess we have the choice between it stopping to process, and it restarting so often it dies. Good choices :smileyhappy:

     

    It's "functioning" at the moment, but still restarting quite often (every 10-20min). We get the following in the logfile:

     

    feb 22 10:18:47:900 [attach_clientsession, alarm_enrichment] AlarmQueueReader: Upper capacity check : memory (free/used/total): 10139736/102057896/112197632 OR 0.9096261140342071% used

    feb 22 10:18:47:900 [attach_clientsession, alarm_enrichment] AlarmQueueReader upperCount: 1

    feb 22 10:18:48:016 [attach_clientsession, alarm_enrichment] Nas alarm_enrichment INTERNAL_RESTART checkForCapacityToAddMoreAlarms

    feb 22 10:18:48:058 [attach_clientsession, alarm_enrichment] TimeOverThesholdService shutting down

    feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] CmdbMessageEnricherError closing : 'by_source' problem: null

    feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] CmdbMessageEnricherError closing : 'by_source' problem: null

    feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] CmdbMessageEnricherError closing : 'by_source' problem: null

    feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] CmdbMessageEnricherError closing : 'by_source' problem: null

    feb 22 10:18:48:627 [attach_clientsession, alarm_enrichment] Nas: Shutdown complete in 610ms

     

    So, it's clearly the 0.90 threshold restart that is being triggered.

     

    Naturally, these config options aren't documented, so does anyone know what memory_usage_exceeded_threshold actually is? Could it be number of times it needs to breach the memory threshold before restarting?

     

    Also, when it's restarting, it's using 112 MB memory, which is quite far away from the 1024MB max it has configured as max:

     

        ../../../../jre/jre7/bin/java.exe -Xms64m -Xmx1024m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar

     

    Increasing -Xms might at least increase the time between restarts, as it hopefully would have more memory allocated initially, but hard to ****. How to tune the memory settings in alarm_enrichment isn't really documented either. Since it's config is a part of the nas config, I'm guessing it might be different than the other java probes? Does anyone know?

     

    Adding the usual:

        <startup>
            <opt>
                java_mem_init = -Xms512m
            </opt>
        </startup>

    to nas.cfg doesn't do any good at least.



  • 8.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-23-2015 04:24 AM

    So forgive me for asking the obvious but the parameters and text refer to percent using a fraction.

     

     

    What happens if you use

     

    lower_memory_usage_threshold_percentage = 90

    upper_memory_usage_threshold_percentage = 90

    memory_usage_exceeded_threshold = 1

     

    -Garin



  • 9.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-23-2015 01:24 PM

    Good question! It's logging "0.9096261140342071% used", so it seems to be using %/100 internally. And from my testing, it seems to be correct to use the 0.<percent> fractions. As we all know.. percent is hard.

     

    However, after some playing around with it, memory_usage_exceeded_threshold seems to be how many times it can go above the threshold before it should trigger a restart. Counter is based on normal upper/lower mechanics it seems. So it needs to go above upper to trigger, but under lower to reset.

     

    After increasing memory_usage_exceeded_threshold from 1 to 3, we stopped getting the restarts every 10-20 min.

     

    We currently have bulk_read_size set to 250, but I'm thinking we perhaps should reduce to. So memory footprint after each bulk read is smaller, cause clearly there is something fishy in the timing of when it increases heap size, and then the memory usage testing is performed.

     

    Or if someone figured out how to configure the initial heap size. Increasing it from 64M to 512M or something. And/or possibly tuning related to gc.



  • 10.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-23-2015 08:00 AM
    Heh that was my first comment when i read the article. Why is it called a percentage if it doesnt use one as a value.

    -jon


  • 11.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-23-2015 10:44 PM


  • 12.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-25-2015 01:58 PM

    That article doesn't seem to provide a solution either. First of all, alarm_enrichment queue only subscribes to alarm subject, so one can wonder how the QOS_MESSAGE subjects are getting in there in their example.

     

    Secondly, it doesn't stop processing it. Sent value on the queue is steadily increasing. So it's not a broken message that is blocking alarm_enrichment. It's just not processing and grabbing them fast enough anymore.

     

    Secondly, it does not trigger the restart mentioned earlier in this thread when this happens, so I guess that setting did nothing good (for this problem at least).

     

    After restarting alarm_enrichment, it processes the queues that it was previously unable to handle in less that 20 sec (from 70k messages to 0). So there is something else going on here.



  • 13.  Re: Problems with alarm_enrichment queue buildup

    Posted 02-25-2015 04:44 PM
    Did anyone else notice that 112 is 1.something % of 1024 which is above .90%? Maybe try 90? I gave up on it and took it out of the alert stream. Good luck, and note that you'll have to run one and a nas on every hub where you want predictive alerting along with the prediction and baseline engines as per change logs in snmpcollecter and ICMP.