I have a MOM with two collectors. One collector has about 900K traces and the other collector about 600K traces in traces.db. This causes harvesting performance issues. According to CA, no collector should have more than 500K traces. I have turned off random sampling, yet the traces database quickly fills up. If I empty (remove) the traces database, the number of traces quickly fill up again. So something is sending a lot of traces. I am attempting to determine where the traces are coming from and what they are, but no luck. Any ideas?
Traces are stored in traces.db not SmartStor.
You can easily delete the traces by deleting this file.
You need to look for the problem agent(s) and determine if someone has enabled some PBDs that are generating too many metrics. Clamping should take care of this, but not the historical data in SmartStor.
You will need to take this Collector offline and remove metrics using the KB article on CSO.
Sorry, my bad. I got the two confused. I have deleted the traces.db and it simply fills back up to the 900K and 800K traces within a day. These high number of traces cause the EM to "drop out" when harvesting. I'll have a look at the KB article. Hopefully it will point me in the direction of determining who is sending/generating the traces.
here's the KB article: http://goo.gl/idlubM
I had a similar issue witha huge number of records in traces.db. It turns out that errors are also stored in traces.db. I had an application constantly having SQL errors. I found which application it was by looking at the Backends>Errors Per Interval metric
I had a suspicion that errors were recorded in traces.db. Not just SQL errors, but other errors as well. Is there any way to shut error reporting off and prevent them from dumping into traces,db?
Not sure I would want to shut off all error detection, but the java agent guide has how to turn off and filter in the ErrorDetector section
Yes. All types of "events" are stored in traces.db: traces, errors and stalls.
You can use the "Live Error Viewer" or "Query Historical Events" in the Webview (under Tools) or workstation to identify the agents generating the events and then look in the traces and errors tab of the agent in the Investigator metric tree.
But in general 800K o 900K in traces should not impact the EM.
You can also set a maximum storage amount or number of days (default 14) for the traces.db with the introscope.enterprisemanager.transactionevents.storage properties in IntroscopeEnterpriseManager.properties. But the traces.db will only be truncated once every night and the storage limit may be exceeded during the day.
Thanks. I see we are generating a large amount of MQ errors.
CA recommends not having more than 500K in the traces.db. It impacts the harvesting.
I have a feeling our db would fill up in the day, so truncating the storage would not work. But excluding some errors would help.
You're welcome. Yes, investigating where these errors come from and how to avoid them is the best way to go. Fight the root cause, not the symptoms. This is a good showcase that CA APM gives you visibility into problems you had no idea that they even existed
I can see you've already had good information about this but just adding something.
The traces information is stored using Apache Lucene and there are tools that can be used to review the traces data outside APM itself.
If you take a copy of the traces folder and open it up with the lukeall jar file (linked below), it will show the data in a text context that may give you another way to sort where so much data is coming from
luke - Luke - Lucene Index Toolbox - Google Project Hosting
This is a screenshot from the first page when the tool loads up, by selecting a name, in this case 'type', then clicking on Show top terms, it shows you what type of trace is prevalent. If it is a case with a lot of MQ traces, you would possibly see a lot of the MQ correlation values as a value to search on.
I would only do this on an offline copy of the traces data.
For the types, errorsnapshot is any type of error (including stalls), whatsinteresting are the what's interesting baseline type views where a process CPU might have changed, for example, sampled are traces taken from automatic sampling, and normal are traces taken by manual transaction trace sessions.
Thanks for the Lucene tip. Very cool. When I copy the traces folder over I get an error when using Luke. Luke complains about an index file (in this case segments_z8m6) not found.
Are these index files being created and dropped as the EM runs? Am I somehow getting a copy snapshot that is incomplete?
Have you tried copying the index folder, too?
Yes, it contains a number of cfs files, a segments.gen file and an incomplete segments_z8m6 index file (0 bytes). Some files appear to be transient as the EM runs.
Yeah. You should always stop the EM when copying the files!!!
Stopping the EM worked. Thanks.
Yeah,as Guenter says, it's best to take a copy of this when the EM is not running.
You could try to use the IndexRebuilder.bat/sh in the /tools folder of the EM with the copy you already have to see if that fixes it for you as well.