Posted 07-31-2015 02:10 AM

We are not happy with the performance of CA APM.

It's almost unusable in our environment.

I opened a support case to CA (00151073) but I received a list of things to check.

We have done those steps 8 months ago and nothing changed.

CPU usage is %1, 8 GB of free memory, no disk IO and CA APM response time is more than 3 minutes.

Why?

Posted 07-31-2015 05:15 AM

Hello,

I can see the case, to investigate further support would need the below information, please attach it to the case:

- Copy of the logs directory from all the EMs (Mom and collectors), they will need VERBOSE logs.

- Screenshot of the "Custom Metric Host (Virtual) | Custom Metric Process (Virtual) | Custom Metric Agent (Virtual) | Enterprise Manager | Data Store | Smartstor | Metadata | Metrics with Data” supportability metric from all Collectors.

- Screenshot of MOM > “Status console”

- Collect a series of threadump from MOM and collector to find out the root cause.

Confirm if you are using a physical or virtual environment.

I have checked quickly the logs, you can find below some observations:

From the MoM perflog, I Performance.MetricDataManager.QueryMemory is ~ 27,151,588

From the Mom log, I see:

- a huge amount of these error: "data seems to be OK - not sure what to do with append inputs",

if possible restart the MoM with a refresh new smartstor db

-[WARN] [PO:client_main Mailman 1][Manager.AsyncQueryResultStateMachine] Received tardy historical data from slow collector

This other error indicates a performance issue in the cluster (MOM and collector communication)

-[WARN] [Collector ***@5001] [Manager.Cluster] Collector clock is too far skewed from MOM. Collector clock is skewed from MOM clock by 56,085 ms. The maximum allowed skew is 3,000 ms. Please change the system clock on the collector EM.

Make sure clocks of all EMs are in synch, you must configure a NTP server.

- [Collector xxxxx@5001] [Manager.Cluster] The Introscope Enterprise Manager will continue to attempt to re-connect to the Introscope Enterprise Manager at xxxx@5001.  Further failures will not be logged

- [WARN] [pool-1-thread-1] [Manager.Cluster] The Collector 10.0.6.122@5001 is responding slower than 10000ms and may be hun

The above indicates that the Mom keep disconnecting from collectors.

Check how the individual collectors are performing, check the logs,  key words to search: outgoing, capacity, WARN, ERROR, reached, CancelledKeyException, java.io.IOException, outofmemory,

Also, I noticed you are using 914, if possible I suggest you to upgrade to a latest release as many issues affecting the EM, clustering, loadbalancing mechanism, etc has been fixed in latest releases. Below link to the master list

http://www.ca.com/us/support/ca-support-online/product-content/knowledgebase-articles/tec1075326.aspx?intcmp=searchresultclick&resultnum=4

I hope this helps,

Regards,

Sergio

Posted 07-31-2015 06:39 AM

Hi Sergio,

I don't know this product so I need detailed instructions.

- What is verbose mode and how will I enable it?

- if possible restart the MoM with a refresh new smartstor db / How will I do this? Do you mean delete all data?

How will I restart MOM? By the weay, what is MOM?

- Make sure clocks of all EMs are in synch, you must configure a NTP server. // MY colleague says things get worse when NTP is enabled. He has a script to sync time.

Since I do not know the product, it is not possible to follow your instructions.

Is EM the same thing as Collector?

What is MOM?

Posted 07-31-2015 08:22 AM

Hi:

It may be helpful to review the APM Overview document (part of official documentation.) which contains an architectural overview, glossary and more.

Thanks

Hal German

APM Support

Posted 07-31-2015 09:01 AM

These are just copy/paste instructions which do not have anything towards a solution.

I've received all this info many times but the result is the same: poor performance, unusable product.

You wrote:

-[WARN] [PO:client_main Mailman 1][Manager.AsyncQueryResultStateMachine] Received tardy historical data from slow collector

This other error indicates a performance issue in the cluster (MOM and collector communication)

Yes, I'm aware of the performance problem and that's the reason I opened a case.

Posted 07-31-2015 09:11 AM

Hi:

I mark as assumed answered since a case is opened .If you are asking what is a MOM and common terms, this is covered in the APM Glossary. At this point please work through the case for further assistance.

Thanks

Hal German

APM Support

Posted 07-31-2015 09:20 AM

Hi,

I've also looked at your case, you have a MOM and two collectors, they are all Enterprise Managers but they have different roles.

The MOM (manager of managers) is the server that you connect to to see data, this Is where you load your dashboards,

You have two collectors that are the ones that the agents send the data to, the MOM then connects to the collectors so you get to see all data from your collectors in one place.

I noticed in the MOM configuration that the live metric limit for the MOM was raised to 1 million and the historical metric limit for the MOM was raised to 2.5 million.

The default values are 500,000 for live data and 1.2 million for historical data, so they have been doubled from default values.

The more historical data you have, generally the slower will be the response from the product - I'm mentioning his because someone clearly decided to raise those values in the apm-events-thresholds-config.xml file in the config folder of the Enterprise Manager.

We can guess that you managed to hit a metric clamp on the server, when that happens you would see data missing in the system and data looking greyed out, so it is normal to raise the values.

However it is vitally important in those scenarios that you look at the data coming into the system as well - if your agents are reporting a large amount of metrics, not only does that give you a large amount of data, generally agents that report a large amount of metrics also affect the performance of the monitored application.

As Sergio said, we need to see the logs from the collectors and the other data he has requested, there is a very old saying in our product that your APM cluster (you have a cluster because you have a MOM and collectors) is only as good as your worst performing collector.

Based on the data that Sergio has requested, it would be much easier to suggest something concrete to improve the performance of the system.

Thanks,

David

Posted 07-31-2015 10:09 AM

Posted 07-31-2015 10:13 AM

If someone can show me what parameter should be changed, then it will be great

Posted 08-01-2015 06:15 AM

Hello,

As most of the people already suggested, I would like to add few more points since we faced the same type of issues on our environment.

1) Please go through APM overview guide like Hal suggested. If you have a short time, please understand MOM, collector, smartstor, workstation, APM DB, agents and how these are interacting with each other.

2) Once you are well good about APM, please do end to end health check. Don't apply any config changes unless you understand the behavior of your environment. Below link will help you to do a comprehensive health check.

Cookbook - EM HealthCheck v20.pdf

3) Capture the observations/behavior of your environment once you have finished the health check. By this time,you will be able to determine the cause of slow performance

4) Now apply the config changes or do necessary steps based on the pattern you noticed.

We faced the similar issue and noticed there were lot of agents added which increased the capacity and collectors were not able to handle the load. We had to add new collector to the cluster to resolve performance issues.

Hope this helps.

Thanks,

Karthik

Posted 08-12-2015 06:25 AM

Hi Gökhan,

Nice to catch up with you. And I'm indeed sorry to hear that you're still having Performance issues.

Aside the advise you have been given (I'm sorry I have been on vacation so I have not had the opportunity to step in lately) on this thread as well in the support case I would like to take a step back to our previous collaboration or resolving your problem.

As you know heap memory was increased on your EMs and you reported back that that did in fact improve performance.

However, as we indeed agreed during one of our WebEx sessions you are having way too many database metrics (in Instroscope lingo known as "metric explosion"). These metrics originate from

1) data exchange using Excel spreadsheets as data sources (which all have different names - hence the explosion); and

2) from your very high number of database queries using distinct names

As we talked about the number of these metrics can be vastly reduced by "normalization" (i.e. aggregating related metrics thereby eliminating individual metrics in favor of aggregated metrics). You mentioned that

1) You would simply want to get rid of these metrics all together.

My offer of assisting you achieve this of course still stands: If you can please from your developers get information of how your application is connecting to the Excel spreadsheets we can eliminate these metrics.

2) You would like to aggregate these metrics as well utilizing normalization as well. However, you were not sure what aggregation would make sense. You would investigate this with your peers and come back.

My offer of assisting you achieve this of course still stands: If you can please with your peers decide the normalization that would make sense we can resolve this too.

Getting these metrics numbers vastly reduced should have a noticeable positive impact on your performance of retrieving data which is your problem.

This is indeed the preferred approach before eventually adding EMs to your cluster.

I'm looking forward to work with in resolving your performance issue.

Please feel free to reach out to me directly via e-mail.

Regards

Henrik

Posted 08-12-2015 11:30 AM

Hi Henrik,

Excel files are used when loading/saving some data and they are deleted.

Since we are using stored procedures, how will you apply normalization?

Normalization docs talk about ad-hoc queries, but there are only SPs in our applications.

Your proposals did not sound feasible to me.

Regards,

// Gokhan

• 13.  Re: CA APM Very Poor Performance

Hi Gokhan,

please continue to work with CA Support on the ticket and with ravhe01. In parallel you should follow nkarthik's advice from above: check the EM supportability metrics. From your description above I would primarily check Historical Metric Count (should not be visibly increasing in 30 day view) and Harvest duration (should be well below 3.5s, ideally below 1s during normal operation, e.g. 6hr taken at late afternoon).

What exactly do you mean by "CA APM response time is more than 3 minutes"? What are you trying to do? Open the workstation, open webview, display a certain dashboard?

Ciao,

Guenter