The file you are looking for is the apm-events-thresholds-config.xml file in the <EM_HOME>/config directory. It contains the below entry, by default:
Per EM limit. Takes into account metrics with Smartstor data (i.e. live and historical metrics)
Per EM limit. Takes into account metrics with Smartstor data (i.e. live and historical metrics)
EDIT: As of, I believe, version 9.1, the previous properties in the IntroscopeEnterpriseManager.properties file, such as introscope.enterprisemanager.metrics.historical.limit, no longer apply; this file is where such limits are now stored.
Oh, and just in case, make sure that you don't have an extra introscope.enterprisemanager.metrics.historical.limit property floating around in the IntroscopeEnterpriseManager.properties file.
We are facing this problem in our 9.5.2 environment. Was this bug reintroduced with the 9.5.2 upgrade? We used to have collectors with well over 1.2 million metrics in our 9.1.4 environment. 1.2 million metrics seems to be a small number considering we have over 1000 agents, 5 collectors, and all collectors are only running at about 25% utilization. We are not going to add additionall collectors simply because the CA default value is 1.2 million.
There seems to be a lot of threads here... but the IMMEDIATE solution is to simply "drop" the smartstors - and get APM back in business. Simply stop the collectors, rename the ~/data directory, and restart - and the historical metrics are eliminated.
Next, let's address some apparent confusion amongst "historical" and "live" metrics.
"Live" metrics are those that are being generated by the agent and reporting every 7.5 seconds to the Collector. This is later aggregated to give the 15 sec reporting interval. Check out the APM Performance and Capacity Management Guide... on the BookShelf.
Historical metrics are ANY metric that has been "aged out" of the current display. This happens after 60 minutes, when a metric goes from "live" to "historical". This is simply metadata about the agent, that is preserved, so that IF and WHEN an agent reconnects with that metric, that the Collector 'knows' that metric and reuses the metadata. This scheme works great... until it doesn't... and then excess METADATA (historical metrics count) accumulates until the working memory of the Collector is compromised... and bad things happen.
The SmartSotr also stores metric data (no strings) for up to one year. This is, technically, a "historical" data... but this IS NOT the same thing as the historical-metric-meta-data we are talking about.
So why don't just add... like - unlimited RAM??? Simple. The metadata is a DATA STRUCTURE which is navigated whenever a metric arrives (in the simplest sense). The bigger the data structure, the longer the the time spent 'walking' that data structure in order to find the metadata. Of course, there is a cache - which is what the 60 minute interval is for - so that 'active' metrics are found quickly.... but the problem is the growing size of the "historical metrics" - those which have been inactive for at lest 60 minutes.
So how do we end up with excess historical metrics? Lots of ways...
#1 offender - unique SQL statements. Basically a statement that is encountered once... AND NEVER AGAIN. We set aside (5) metrics every time we hit one of those puppies... which simply flood the metadata. The solution is to do a "SQL NORMALIZATION" which effectively 'wild-cards' those variable SQL metrics into a SINGLE statement - and thus only (5) metrics will be collected... no matter how many statements are encountered.
#2 offender - lots of variety in agent naming. This is often the case in the initial deployment, when agent names might change. Since the metric has a FDN (fully distinguished name) of PLATFORM|PROCESS|AGENT - this can create huge piles of metadata (historical metrics) - which are never seen again.
#3 offender - web services which themselves are just tons of unique calls - imagine a unique transaction ID carried by a single web service call - spews historical metrics like mad. This problem is harder to deal with but the quick solution is always TURN IT OFF - until you have time to correct the configuration.
Figuring out which type of problem you have.. it means actually looking at the workstation-Investigator, and looking for huge, steaming piles of 'greyed-out' metrics - these are the source of the problem. Anything else... and you will likely be chasing your tail for weeks and weeks. The easy place to start is to simply look for agents that have excessive number of LIVE metrics (>8k)... and then check the usual suspects. This works pretty much EVERY TIME ;-) No exotic settings required.
Spyrderjacks, Thanks for this great write up. All of these suggestions are good options for reducing the number of live and historical metrics, however, most of them are not viable or practical for us and do not address the underlying problem that the Enterprise Managers are simply overriding custom historical metric thresholds with the default of 1.2 million. Also, we are regularly asked for historical data so simply dropping the smartstor DB and starting fresh is not an option. To me it's pretty obvious that this is a bug within Introscope because our limit has been set at 2m for over a year and recently, without changing any settings (in fact after pruning the DB on all 5 collectors), defaulted back to 1.2m after our collectors and MOM were restarted. After another restart of the EMs, who knows, they will probably go back to accepting our custom threshold.
I personally appreciate the suggestions you've made, and I'm sure they will help some customers, but we are not going to make a ton of changes for something CA should be held responsible for. We pay a ton of money for this software, and we expect it to work as advertised. Why have an ability to change a threshold if it's not going to make a difference?
I have opened a priority 2 case with CA for this and expect them to provide a resolution that does not require dropping smartstor, turning off unique SQL statement (which we have already done for certain apps), or turning off Web Services data (which is of high importance to our company).
And spyderjacks, please don't think I'm upset with you or your suggestions (I think they are great options), there have just been several things that have been highly frustrating of late with this product, especially the whole implementation of domains and the need to restart the entire environment to add new ones or update existing ones. That is an entire different issue altogether that I could rant about all day. The feeling that we get whenever we open cases with CA is that the problem is isolated to our environment and no other customers are experiencing it. Well this thread is proof that that isn't true, and at some point we need to stop fixing things for CA when they should be truly fixing them in software releases. OK I'm done ranting. Thanks again, spyrderjacks for your suggestions!
Sorry to hear of your frustration. With regard to the threshold values being ignored – then this does appear to be a problem that has been experienced and there is a workaround that you can employ
Once the EM has started if you manually change the value in the xml file then that will force a reload and that seems to always be picked up – the loading at startup seems to sometimes not pick up all of the threshold values from the file and takes some as default – I only know of this having been reported in 9.5.2 or 9.5.3.
I updated the threshold from 2m to 2.1m on all collectors and the MOM. As you mentioned, the change was hot detected and we are no longer dropping metrics. Thanks so much for providing a viable, simple solution. After collectors are restarted from now on, we'll just make sure to update this threshold from 2.1m to 2m or vice versa. We never let the collectors actually get to that many historical metrics, we usually prune them around 1.7 million, but we don't like dropping metrics because it can burn us.
Glad it worked for you – will get a Tech Doc published for this – thought I had done one already but evidently not
This is a great thread with lots of information. We have struggled with historical metrics for quite some time. I have a question.
If historical metrics are truly metadata, is it possible to delete all historical metrics every night or every week? I understand there might be some overhead the first time any agent sends metrics but we won't face all the issues we see in this thread.
Is this feasible?
There are the historical metrics (*.data files in <EM_Home>/data/archive) and the (historical) metrics meta data (metrics.metadata). You can stop the MOM and all collectors, delete all the *.data files from the archive directory you don't need anymore and also delete the metrics.metadata (including backups). The EMs will automatically rebuild the metrics.metadata file after restart. The EM is built to be very resilient and handle (or clamp on) all kinds of abuse!
DON'T DO THIS!
But again: this only cures the symptoms, not the root cause of the issue. Try to get you metric count way below 5000, maybe 8000 for portals. The number of meaningful/useful metrics usually is again at least one order of magnitude lower than that. How you identify those KPIs? Read @spyderjacks' book, his blog/posts or attend his session here at #caworld.
My apologies: this was a misunderstanding on my part! The EM will NOT rebuild metadata for already stored metrics. They will still be in SmartStor but not accessible.
You can only move the SmartStor + metadata file to a "read only" archive EM/Cluster and start your live Cluster from scratch. The archive cluster needs a lot less resources because there is no real-time crunching and storing of the metrics only query access. It can run on "normal" VMs that doesn't need reserved resources. You might even shut it down and only power up if you really need to query data. Track how often you really access the data and after a few months take a look at your records. Then you will most probably convinced that a few screenshots or reports suffice for most of the cases.
Again my apologies for causing confusion here,