DX Application Performance Management

Expand all | Collapse all

Historical metrics limit

  • 1.  Historical metrics limit

    Posted 04-09-2013 12:02 PM
    Hi,

    I am seeing below error in EM logs and because of this introscope is not capturing any new metrics.

    4/09/13 02:09:30.237 PM GMT [WARN] [Harvest Engine Pooled Worker] [Manager.Agent] [The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents.  Current count = 1,825,036. Max count = 1,200,000

    I have checked in APM status console. There is no clamp applied so far for historical metrics limit. Also I have set introscope.enterprisemanager.metrics.historical.limit=5000000 to increase the historical limit.

    But even after that the same error is coming. I am wondering from where this max count = 1,200,000 is coming when i set the limit to 5000000.

    Can someone please clarify and help me in fixing this issue?

    Thanks,
    Karthik


  • 2.  RE: Historical metrics limit

    Posted 04-09-2013 06:59 PM
    nkarthik,

    introscope.enterprisemanager.metrics.historical.limit=5000000 takes into account live plus historical metrics and not just historical. Having said that, did you restart your MOM as well as Collectors?

    take a look at the following properties:

    transport.override.isengard.high.concurrency.pool.max.size=
    transport.override.isengard.high.concurrency.pool.min.size=

    transport.outgoingMessageQueueSize=

    You might want to tweak these numbers a bit and restart your cluster to see if that helps.

    Also, increase your incroscop.enterprisemanager.metrics.live.limit to 1000000.

    Let me know if this helps.


  • 3.  RE: Historical metrics limit

    Posted 04-10-2013 04:01 AM
    Hi,

    I have updated those settings mentioned below and did restart everything. But even after that it did n't work. My question is how it is telling the max count is 1,200,000. From where this value is taken?

    If I know that, I will tweak that value.

    Thanks,
    Karthik


  • 4.  RE: Historical metrics limit

    Posted 04-10-2013 11:35 AM

    The file you are looking for is the apm-events-thresholds-config.xml file in the <EM_HOME>/config directory. It contains the below entry, by default:

            <clamp id="introscope.enterprisemanager.metrics.historical.limit">

                 <description>

                    Per EM limit. Takes into account metrics with Smartstor data (i.e. live and historical metrics)

                 </description>

                 <threshold value="1200000"/>

            </clamp>

    EDIT: As of, I believe, version 9.1, the previous properties in the IntroscopeEnterpriseManager.properties file, such as introscope.enterprisemanager.metrics.historical.limit, no longer apply; this file is where such limits are now stored.



  • 5.  RE: Historical metrics limit

    Posted 04-10-2013 11:53 AM
    Hi,

    Please see the below entry from that file

    <clamp id="introscope.enterprisemanager.metrics.historical.limit">
    <description>
    Per EM limit. Takes into account metrics with Smartstor data (i.e. live and historical metrics)
    </description>
    <threshold value="5000000"/>
    </clamp>

    I have set the limit to 5000000 here as well. But still it's taking 1200000 from somewhere. That is why I am confused and want to get clarified.

    Thanks,
    Karthik


  • 6.  RE: Historical metrics limit

    Posted 04-10-2013 12:00 PM
    This is probably a silly question, but have you A) restarted the EM(s) since updating the apm-events-thresholds-config.xml file(s) and B) made that change for all EMs in the cluser (if you're using one)?

    Unfortunately, I can only find information in the documentation that you already are familiar with: https://support.ca.com/cadocs/1/CA%20Application%20Performance%20Management%209%201%201-ENU/Bookshelf_Files/HTML/APM_Config_Admin_EN/1826604.html#o1478550

    It's possible that you may need to open a Support ticket. From the sound of it, you're doing everything correctly.


  • 7.  RE: Historical metrics limit

    Posted 04-10-2013 12:03 PM

    Oh, and just in case, make sure that you don't have an extra introscope.enterprisemanager.metrics.historical.limit property floating around in the IntroscopeEnterpriseManager.properties file.



  • 8.  RE: Historical metrics limit

    Posted 04-22-2013 12:16 PM
    Hi,

    Finally we found why we are getting this error.

    4/15/13 12:03:15.108 PM GMT [WARN] [Harvest Engine Pooled Worker] [Manager.Agent] [The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents.  Current count = 1,821,838. Max count = 1,200,000

    We increased the table space of APM database and this error started occuring after we increased the table space. But wondering how this is affecting historical metrics clamp.

    Can some one please help me to understand how table space is causing that issue?

    Thanks,
    Karthik


  • 9.  RE: Historical metrics limit

    Posted 04-22-2013 07:38 PM
    How many agents do you have? What this the average metric load per agent?

    You should target your metric load to ~5000. Any more than that, you may have a metric explosion happening.

    You can certainly adjust the thresholds, but consider the above first.


  • 10.  RE: Historical metrics limit

    Posted 06-14-2013 07:37 AM
    This week I also had this problem. It is a known error in 9110 and 9111. So you have to upgrade.

    Bug#7868 Collector Threshold missed with large smartstor
    Symptoms: When bouncing a collector, sometimes it does not pick up the
    thresholds from apm-events-thresholds-config.xml.
    Only way we found to make it to work was to recreate the smartstor DB -
    upon doing that it started working.
    Analysis:
    The spring framework depedency injection, which was used for clamp setting
    read/notification, has failed because the dependency waiter timed out,
    which is due to the long EM startup time caused by large SmartStor here.
    Solution: Fixed in code, increased the spring dependency waiter timeout
    from the default of 5 minutes to 15 minutes
    ****


  • 11.  Re: Historical metrics limit

    Posted 09-09-2014 05:04 PM

    We are facing this problem in our 9.5.2 environment. Was this bug reintroduced with the 9.5.2 upgrade? We used to have collectors with well over 1.2 million metrics in our 9.1.4 environment. 1.2 million metrics seems to be a small number considering we have over 1000 agents, 5 collectors, and all collectors are only running at about 25% utilization. We are not going to add additionall collectors simply because the CA default value is 1.2 million.



  • 12.  Re: Historical metrics limit

    Posted 09-09-2014 06:09 PM

    There seems to be a lot of threads here... but the IMMEDIATE solution is to simply "drop" the smartstors - and get APM back in business.  Simply stop the collectors, rename the ~/data directory, and restart - and the historical metrics are eliminated.

     

    Next, let's address some apparent confusion amongst "historical" and "live" metrics.

     

    "Live" metrics are those that are being generated by the agent and reporting every 7.5 seconds to the Collector.  This is later aggregated to give the 15 sec reporting interval.  Check out the APM Performance and Capacity Management Guide... on the BookShelf.

     

    Historical metrics are ANY metric that has been "aged out" of the current display.  This happens after 60 minutes, when a metric goes from "live" to "historical".  This is simply metadata about the agent, that is preserved, so that IF and WHEN an agent reconnects with that metric, that the Collector 'knows' that metric and reuses the metadata.  This scheme works great... until it doesn't... and then excess METADATA (historical metrics count) accumulates until the working memory of the Collector is compromised... and bad things happen.

     

    The SmartSotr also stores metric data (no strings) for up to one year.  This is, technically, a "historical" data... but this IS NOT the same thing as the historical-metric-meta-data we are talking about.

     

    So why don't just add... like - unlimited RAM???  Simple.  The metadata is a DATA STRUCTURE which is navigated whenever a metric arrives (in the simplest sense).  The bigger the data structure, the longer the the time spent 'walking' that data structure in order to find the metadata.  Of course, there is a cache - which is what the 60 minute interval is for - so that 'active' metrics are found quickly.... but the problem is the growing size of the "historical metrics" - those which have been inactive for at lest 60 minutes.

     

    So how do we end up with excess historical metrics?  Lots of ways...

    #1 offender - unique SQL statements.  Basically a statement that is encountered once... AND NEVER AGAIN.  We set aside (5) metrics every time we hit one of those puppies... which simply flood the metadata.  The solution is to do a "SQL NORMALIZATION" which effectively  'wild-cards' those variable SQL metrics into a SINGLE statement - and thus only (5) metrics will be collected... no matter how many statements are encountered.

     

    #2 offender - lots of variety in agent naming.  This is often the case in the initial deployment, when agent names might change.  Since the metric has a FDN (fully distinguished name) of PLATFORM|PROCESS|AGENT - this can create huge piles of metadata (historical metrics) - which are never seen again.

     

    #3 offender - web services which themselves are just tons of unique calls - imagine a unique transaction ID carried by a single web service call - spews historical metrics like mad.  This problem is harder to deal with but the quick solution is always TURN IT OFF - until you have time to correct the configuration.

     

    Figuring out which type of problem you have.. it means actually looking at the workstation-Investigator, and looking for huge, steaming piles of 'greyed-out' metrics - these are the source of the problem.  Anything else... and you will likely be chasing your tail for weeks and weeks.  The easy place to start is to simply look for agents that have excessive number of LIVE metrics (>8k)... and then check the usual suspects.  This works pretty much EVERY TIME ;-) No exotic settings required.



  • 13.  Re: Historical metrics limit

    Posted 09-10-2014 09:10 AM

    Spyrderjacks, Thanks for this great write up. All of these suggestions are good options for reducing the number of live and historical metrics, however, most of them are not viable or practical for us and do not address the underlying problem that the Enterprise Managers are simply overriding custom historical metric thresholds with the default of 1.2 million. Also, we are regularly asked for historical data so simply dropping the smartstor DB and starting fresh is not an option. To me it's pretty obvious that this is a bug within Introscope because our limit has been set at 2m for over a year and recently, without changing any settings (in fact after pruning the DB on all 5 collectors), defaulted back to 1.2m after our collectors and MOM were restarted. After another restart of the EMs, who knows, they will probably go back to accepting our custom threshold.

     

    I personally appreciate the suggestions you've made, and I'm sure they will help some customers, but we are not going to make a ton of changes for something CA should be held responsible for. We pay a ton of money for this software, and we expect it to work as advertised. Why have an ability to change a threshold if it's not going to make a difference?

     

    I have opened a priority 2 case with CA for this and expect them to provide a resolution that does not require dropping smartstor, turning off unique SQL statement (which we have already done for certain apps), or turning off Web Services data (which is of high importance to our company).

     

    And spyderjacks, please don't think I'm upset with you or your suggestions (I think they are great options), there have just been several things that have been highly frustrating of late with this product, especially the whole implementation of domains and the need to restart the entire environment to add new ones or update existing ones. That is an entire different issue altogether that I could rant about all day. The feeling that we get whenever we open cases with CA is that the problem is isolated to our environment and no other customers are experiencing it. Well this thread is proof that that isn't true, and at some point we need to stop fixing things for CA when they should be truly fixing them in software releases. OK I'm done ranting. Thanks again, spyrderjacks for your suggestions!



  • 14.  Re: Historical metrics limit

    Posted 09-10-2014 09:50 AM

    Hi,

     

    Sorry to hear of your frustration. With regard to the threshold values being ignored – then this does appear to be a problem that has been experienced and there is a workaround that you can employ

     

    Once the EM has started if you manually change the value in the xml file then that will force a reload and that seems to always be picked up – the loading at startup seems to sometimes not pick up all of the threshold values from the file and takes some as default – I only know of this having been reported in 9.5.2 or 9.5.3.

     

    Thanks

    Mike



  • 15.  Re: Historical metrics limit

    Posted 09-10-2014 10:08 AM

    Thanks, Mike!

     

    I updated the threshold from 2m to 2.1m on all collectors and the MOM. As you mentioned, the change was hot detected and we are no longer dropping metrics. Thanks so much for providing a viable, simple solution. After collectors are restarted from now on, we'll just make sure to update this threshold from 2.1m to 2m or vice versa. We never let the collectors actually get to that many historical metrics, we usually prune them around 1.7 million, but we don't like dropping metrics because it can burn us.

     

    Thanks again!

    Brad



  • 16.  Re: Historical metrics limit

    Posted 09-10-2014 10:41 AM

    Hi Brad,

     

    Glad it worked for you – will get a Tech Doc published for this – thought I had done one already but evidently not

     

    Cheers

    Mike



  • 17.  Re: Historical metrics limit

    Posted 11-07-2014 07:47 PM

    This is a great thread with lots of information. We have struggled with historical metrics for quite some time. I have a question.

     

    If historical metrics are truly metadata, is it possible to delete all historical metrics every night or every week? I understand there might be some overhead the first time any agent sends metrics but we won't face all the issues we see in this thread.

     

    Is this feasible?



  • 18.  Re: Historical metrics limit

    Posted 11-09-2014 10:22 AM

    There are the historical metrics (*.data files in <EM_Home>/data/archive) and the (historical) metrics meta data (metrics.metadata). You can stop the MOM and all collectors, delete all the *.data files from the archive directory you don't need anymore and also delete the metrics.metadata (including backups). The EMs will automatically rebuild the metrics.metadata file after restart. The EM is built to be very resilient and handle (or clamp on) all kinds of abuse!

     

    DON'T DO THIS!

     

    But again: this only cures the symptoms, not the root cause of the issue. Try to get you metric count way below 5000, maybe 8000 for portals. The number of meaningful/useful metrics usually is again at least one order of magnitude lower than that. How you identify those KPIs? Read @spyderjacks' book, his blog/posts or attend his session here at #caworld.



  • 19.  Re: Historical metrics limit

    Posted 01-06-2015 01:35 PM

    Hi,

     

    My apologies: this was a misunderstanding on my part! The EM will NOT rebuild metadata for already stored metrics. They will still be in SmartStor but not accessible.

     

    You can only move the SmartStor + metadata file to a "read only" archive EM/Cluster and start your live Cluster from scratch. The archive cluster needs a lot less resources because there is no real-time crunching and storing of the metrics only query access. It can run on "normal" VMs that doesn't need reserved resources. You might even shut it down and only power up if you really need to query data. Track how often you really access the data and after a few months take a look at your records. Then you will most probably convinced that a few screenshots or reports suffice for most of the cases.

     

    Again my apologies for causing confusion here,

    Guenter