I've found the following information about the Historical Metrics limitation inside the Bookshelf version 9.5.x
Note about introscope.enterprisemanager.metrics.historical.limit:
The default value of the introscope.enterprisemanager.metrics.historical.limit property is set to 1.2 million. This value can be increased to 5 million without any significant performance impact. (74765, 76890)
This info means that the EM supports 5 million instead of 1.2 million OR the value only can be increased into 5 million?
I've installed 9.7 version and each Collector have 2.5 million historical metrics and I have peaks of Smartstor Duration constantly ( >3500ms).
I'm doing the parallel investigation about the I/O but one of the problems with Smartstor duration is Metrics overload.
I need to be worry about the 2.5 million historical metrics using the 9.7 version?
All of the recommendations expect that Smartstor is on a dedicated disk with dedicated I/O and that traces folder would be in a separate location to Smartstor - however documentation does say that traces data can be collocated for a few collectors.
Without the ideal configuration, you are going to get issues with I/O contention which is probably one reason the Smartstor Duration is getting higher.
Another thing that would make Smartstor Duration higher is that the metrics that are reporting are constantly changing - and that would also explain to a certain extent how the historical metric count could get so high.
It sounds like the data is really not under control and there are one of more agents with metric explosions.
More historical data is going to make queries take longer to complete and that will then have an impact on Harvest Duration.
The 5 million is a possibility but if performance is not good at 2.5 million, I wouldn't let the value raise any higher, the value could be anywhere up to 5 million. Indeed it could be raised even higher than 5 million but then you can definitely expect the system to grind to a halt eventually. Some systems cope with 2.5 million historical, some can struggle at 1 million.
Apart from maybe reducing the data retention period for tier3 Smartstor, as mentioned above, it's worth reviewing agents that are reporting a lot of metrics and seeing if they can be better controlled. That is likely to help not only the Enterprise Manager but will probably help application performance as well. It also helps to increase collector heap if it isn't anywhere near 4GB.
I would agree with David on this that you have several issues that need to be researched to find why you have smartstor duration above the 3.5 second level.
The clamp of 1.2 million is typically per agent and the default limit for metrics is 50,000 so if an agent is producing 24 times the metric generation, really good chance you have issues with metrics. Typically your historic metric load will vary in the re-periodization configuration you have set. Our settings are 15 seconds for 7 days, 60 seconds for 23 days (30 days) then 10 minutes for 60 days (90 days). This has worked pretty well and our historic metric count stays below 1.2.
We have worked with the application team to use more prepared statements that are ordered so that the "select a, b" where does not get called with "select b, a where". We have also worked with the mainframe cross enterprise agent admin to turn off metrics that were storing PID values as metadata which made the historic metric count grow very large over a month. There is also an SQL normalization within the Java/.NET agents to try to trim down the noise and provide actionable metrics.
There are several EM metrics that might be able to help with your research in this matter.
Number of Metrics
Number of Historic Metrics
Graph both over 24 hours, 7 days and 30 days and you are looking for large (50% to +100%) growth rates with the historic metrics should be in line with the number of metrics.
Historical Metric Count
Error Snapshot Events Per Interval
Transaction Tracing Events per Interval
The agent metrics would help you figure out which agent(s) to target, go after the agents reporting the largest metrics and work your way down.
On an agent, the error and transaction traces have quite a bit overhead. For error traces, you might want to triage the errors with the application development team and try to get the error events down. Transaction traces can be controlled within the agent configuration on the polling period. If you have a very high number of applications within your environment, you might want to increase the period between traces.
Your total collector capacity will be very dependent on your I/O channel to write/read from the physical storage and the amount of data (metric values, metric meta data (labels), transaction and error traces) is being stored within the 7.5 second data window.
Hope this helps,