DX Application Performance Management

Expand all | Collapse all

CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance issues

  • 1.  CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance issues

    Posted 03-29-2013 11:24 AM
    CA Wily Tuesday Tip by Sergio Morales, Principal Support Engineer for 3/29/2013

    Hi Everyone,
    Here is an update of my prevoius post sent last 2011. Below a checklist of the points you must review whenever you see: Performance issue, Missing data points in graph and dasbhoard, frequent MOM/Collector/Agent/Workstation disconnections, OutOfMemory, logging takes long time or Clock-sync issues:

    Checklist:

    1.
    Outgoing message delivery queue/thread pool size needs to be increased:
    Make sure the following settings are set in ALL EMs (MOM and Collectors) properties files:
    transport.outgoingMessageQueueSize=6000
    transport.override.isengard.high.concurrency.pool.min.size=10
    transport.override.isengard.high.concurrency.pool.max.size=10

    A restart of the EMs is required for the changes to take effect.
    Increasing the outgoing message queue allows you to have a bigger buffer. Increasing the thread pool size allows you to have more worker threads to send outgoing messages. These important adjustments are required when, sending messages, usually between collector and MOM, becomes a bottle neck for performance.

    2.
    Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx) in the Introscope Enterprise Manager.lax or EMService.conf. Since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

    3.
    EM heap Sizing:
    a) Not enough heap allocated to Collector, especially when serving CEM services
    b) Not enough heap allocated to MOM, especially when with huge amount of MMs, calculators, alerts.

    4.
    If EM is running on UNIX: Make sure nohup mode has been configured correctly. The property "lax.stdin.redirect" in Enterprise Manager.lax file should be empty. From ConfigAdminGuide.pdf:

    " Do not run the Enterprise Manager in nohup mode without performing the configuration described above. Otherwise, the Enterprise Manager might not start, or might start and consume excessive system resources."

    5.
    Make sure DEBUG logging is disabled in the IntroscopeEnterpriseManager.properties, depending on the queries you perform, it could cause serious performance issue to the Introscope EM.

    6.
    Make sure smatstor db is pointing to a dedicated hd/disk controller. Once smartstor is reconfigured to have its own disk, you should change the EM property introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting. From SizingGuide.pdf

    “When the dedicated controller property is set to false, the Collector assumes that there is one disk for all Enterprise Manager operations, and therefore uses one disk-writing lock. This means that only one area at a time is written. For example, the Collector will write only to SmartStor or only to the heuristics database that supports the Investigator Overview dashboard.
    Performance disadvantages to having the dedicated controller property set to false are:
    a.
    Only one I/O task can be running at a time.
    b.
    SmartStor writes are in shorter segments.
    c.
    The disk's seek pointer is invalidated after each context switch.
    If there is a second disk for SmartStor, but the property is set to false, there is no performance gain by having a second disk for SmartStor.
    d.
    Collector sizing recommendations are reduced by 50%.”

    7.
    Huge metadata causing EM performance problem or new metrics not showing up. Check the "Custom Metric Host (Virtual) | Custom Metric Process (Virtual) | Custom Metric Agent (Virtual) | Enterprise Manager | Data Store | Smartstor | Metadata | Metrics with Data" supportability metric and verify if it is higher than 300K for v8.x and 600K for v9.x. Solutions:
    a) Historical metric count limit on EM can be increased or
    b) SmartStor data can be pruned, use the Smartstor Tool utility and reduce the historical metric count.

    8.
    Are you running multiple collectors on the same server? From SizingGuide:

    “a) Run the OS in 64-bit mode to take advantage of a large file cache.
    The file cache is important for the Collectors when doing SmartStor maintenance, for example spooling and reperiodization. File cache resides in the physical RAM, and is dynamically adjusted by the OS during runtime based on the available physical RAM. CA Wily recommends having 3 to 4 GB RAM per Collector.
    b) There should not be any disk contention for SmartStor, meaning you use a separate physical disk for each SmartStor instance. If there is contention for SmartStor write operations, the whole system can start to fall behind, which can result in poor performance such as combined time slices and dropped agent connections.

    c) The Baseline.db and traces.db files from up to four Collectors can reside on a separate single disk. In other words, up to four Collectors can share the same physical disk to store all of their baseline.db and traces.db files.”

    9.
    Check if virtual agents have been defined, if so, disable them from the EM\config\agentdomains.xml .

    10.
    Are the Collectors and MOM on the same subnet? From SizingGuide:

    “Whenever possible, a MOM and its Collectors should be in the same data centre; preferably in the same subnet. Even when crossing through a firewall or passing through any kind of router, the optimal response time is difficult to maintain. If the MOM and Collector are across a router or, worse yet, a packet-sniffing firewall protection router, response time can slow dramatically.”

    For transatlantic agent->Em connections or any frequently interrupted networks, HTTP would work better. You should configure Agent->EM communications to use HTTP tunnelling instead.

    11.
    If you use SAN for SmartStor storage, then each logical unit number (LUN) requires a dedicated physical disk. If you have configured two or more LUNs to represent partitions or subsets of the same physical disk, this does not meet the requirements needed for SmartStor dedicated disk.

    12.
    Check how big is the tracers database. Rename the perflog.txt to change its extention from txt to csv and open it using excel, review the "Performance.Transactions.Num.Traces" column. If the value is higher than 500K and increasing rapidly , then this could be the cause of the problem. If possible start the EM with a fresh new Tracers database to isolate the problem, disable transaction sampling on the EM side by setting introscope.agent.transactiontracer.sampling.perinterval.count=0 and set introscope.enterprisemanager.transactionevents.storage.max.data.age=1.

    13.
    Incorrect or bad Management Module definition:
    For testing purpose, start the EM without any Management module(MM): Rename EM_HOME\config\modules to modules-original, restart the EM.
    This will allow us to confirm if the problem is related to an incorrect design of one of the MMs.
    If the problem doesn’t persist, you will need to re-introduce the Management modules 1 by 1 until you identify the problematic one(s).

    14.

    Is the Introscope EM configured with a different JVM version?
    EM with a supported JVM version: For v8, JVM 1.6u15. For v9, it is recommended to use 1.6u34 or later.

    15.
    If the problem only applies when connecting to the MOM and not to the collector, it's most likely caused by some feature specific to Workstation, and specific to MOM. Try disabling the new v9 feature AppMap by adding introscope.apm.feature.enabled=false to the IntroscopeEnterpriseManager.properties and restart EM.

    16.
    If SOA Performance Management is enabled:

    a) SOA Deviation Calculator needs to be turned off to prevent hourly harvest duration spikes:
    Set com.wily.introscope.soa.deviation.enable=false in all the EMs (collector and MOM). If this change resolves the issue, the problem could be related to bug# 76056, we have partially fixed this issue in latest 9.1 releases, we are planning to this isue in the next major release. For now, you can try also to lower the refresh rate and mean days:
    com.wily.introscope.soa.deviation.dependency.refreshrate=24
    com.wily.introscope.soa.deviation.mean.days=1
    com.wily.introscope.soa.deviation.datapoints.cached.mean=240

    b) EM/WS OOM, caused by 8.x - 9.x Agent compatibility: - see Bug# 74797 – Fix in 9.1.2. To enable the fix new SOA caller name nominalization property needs to be turned on in all collectors: com.wily.introscope.soa.dependencymap.normalizecallername.enable=true.
    c) SOA boundary tracing can be turned off (too many traces or too huge trace sent to EM can cause the Agent OOM sometimes, EM crash/OOM): com.wily.introscope.agent.transactiontrace.boundaryTracing.enable=false

    17.
    Query returned/retrieved data points can be clamped (too huge historical query can cause EM OOM):
    If you notice the “memory in use” starting to increase and the collector became unusable, you can try setting the clamps for historical queries to 100k to prevent huge queries from increasing the memory footprint of the collector & mom:
    introscope.enterprisemanager.query.datapointlimit=100000
    introscope.enterprisemanager.query.returneddatapointlimit=100000

    18.
    Poor network performance:
    In a cluster the “Ping Time” on the MOM is an indicator of a:
    a) Poor network times between the MOM and collectors, or
    b) Overloaded collectors unable to respond to the ping request.
    To view the ping metric, use the Search tab to view the metric named "ping" in the supportability metric section of the Investigator tree. You will find a ping metric reported for each Collector. If the ping time exceeds 60 seconds, the MOM disconnects from the Collector. This is normal and prevents the entire cluster from hanging but indicates a network issue.

    19.
    Collector disconnects from MOM throwing java.nio.channels.CancelledKeyException: This problem is related to a random communication problem on EM without obvious causes. Nio transportation can be turned off on EM by adding following in EM properties file transport.enable.nio=false. You need to restart.
    20.
    “Collector clock is skewed from MOM clock by” messages in the EM logs:
    a) Set up the clustered systems so that machines running Enterprise Managers synchronize their system clocks with a time server such as an NTP server
    b) VMware should be tuned up to avoid clock skew: If virtual environment, please note that there are some known clock-sync issues with VMWare, especially with Linux. The below docs from VMWare site describe the issues:
    http://www.vmware.com/pdf/vmware_timekeeping.pdf
    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427
    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1318
    c) This could be due to a Sun JVM bug. Add the following JVM flag: -XX:+ForceTimeHighResolution
    Refer to the links below for more information: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6464007


    What to do if the problem persists:

    Collect the following information from ALL Introscope EMs (MOM and collectors) and open an incident with CA Support.
    1.
    Zipped content of EM_HOME\logs
    2.
    EM_HOME\config\agentdomains.xml – will help us confirm if there are virtual agents defined.
    3.
    Hardware specs of the servers and a general overview of the implementation indicating where the collectors and MOM are
    4.
    Screenshot of the "Custom Metric Host (Virtual) | Custom Metric Process (Virtual) | Custom Metric Agent (Virtual) | Enterprise Manager | Data Store | Smartstor | Metadata | Metrics with Data” supportability metric from all Collectors.
    5.
    From the investigator, use the Search tab to view the metric named “ping”in the supportability metrics section of the investigator tree. You will find a ping metric reported for each Collector, take a screenshot.

    Make sure to remove all existing introscope log files to another location before starting the tests.

    Regards,
    Sergio


  • 2.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 04-01-2013 09:14 AM
    Hi Sergio,

    I have a question regarding smarstor tools. There is a historical metric explosion in our EM's and I am thinking to remove unnecessary agents and excess metrics which are not required. I had gone through
    "Using SmartStor tools to tune SmartStor data" in configuration and administration guide. When I was reading about usage of remove_metrics and remove_agents, it says like below.

    <EM_Home>\tools\SmartStorTools.bat remove_metrics –dest C:\SWDump\destination -metrics ".*Socket.*" -src <EM_Home>\data
    When the command executes successfully, all the metrics except the socket metrics are present in the destination directory and the source directory has all the metrics intact.

    As per my understanding, it is removing metrics and storing metrics without socket metrics in destination directory. But source directory will have all the metrics. This looks like it's not deleting the metrics permanantly from source directory. It's storing the metrics without socket metrics(as per above example) in destination directory and source directory will have all.

    For your better understanding, We use two collectors which are reporting to MOM. I did n't see any destination directory and we don't have any destination EM's. Can you explain me clearly about these tools?

    Thanks,
    Karthik


  • 3.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 05-30-2013 07:08 AM
      |   view attached
    Hi, this is normal, each time you run SS tool it will create a new SS db, once you finish cleaning up your SS, you need to set the temporal SS db as the primary.
    Please see attached document explaining the process. I created it in the past for a customer, it illustrates the steps you need to take, for more details refer the Config Admin guide. Althougth, it is for 907, it can be used with any 9x version.

    Regards,
    Sergio


  • 4.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 04-01-2013 10:01 AM
    Thank you Sergio for sharing this information with the community.

    Mary


  • 5.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 04-01-2013 02:31 PM
    hi,
    about 14) didn't we discover that the EM version 8.2.x.y on solaris can be maximum 1.5.0_22 and that 1.6.0_xy does not work?
    Regards - Fred


  • 6.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 06-01-2013 05:47 AM
    yes i think this is very important 14) 8.2.x.x only supports at max 1.5.0_22, it will start up with higher versions but you will have werid behaivours with freezing collectors at some point.

    and this was "reproduced" on 15 different Enterprise Manager Clusters:

    - we had 15 clusters setup with different java versions 1.5.0_31+ and all of them had from time to time freezing collectors
    (they where not completely dying, so we where loosing agents from the cluster untill we found out that the collector was unhealthy and we manualy restarted it, as the agents still stayed connected with the collector but the MOM was not capable anymore to talk to the collectors)
    (the collector freeze, if to long undetected, even causes the MOM to get out of memory because it still somehow tries to talk to the collectors but fails somehow half way trough..)
    - after downgrading all Enterprise manager Clusters to 1.5.0_22 the problem disappeared and didn't show up anymore since about a year now.


  • 7.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 06-03-2013 04:12 AM
      |   view attached
    Hi Aaron, Fred,

    Yes, I remember the issue. For this particular case, our analayis showed that there was a JVM related problem when using 1.5.0_31+.
    Please remember, v8.x has been out there for years, we haven't seen or heard of a similar issue again, however, thank you for the reminder.


    To all customers:
    If you have the need to upgrade the JVM, here are my suggestions:

    a) Make sure to use a certified Jvm version (for most of the installers it is bundled with it).
    For exact details, see APM compatibility guides > EM tab, column B:
    https://support.ca.com/irj/portal/anonymous/phpsupcontent?contentID=883df031-705e-425b-9a0e-73130da8a204&productID=5974

    b) Make sure to validate the new setup in a QA/Test environemnt.
    In paralell, you can open a support incident with CA support. We will do our best to help you verify that you will not run into any issue.

    c) If you experiece a crash, OOM, hang issues after upgrading to the JVM, it will most proabaly a JVM problem. However, if you need further analysis, raise a support ticket with both JVM Third party verndor and CA support,
    To start with the analysis , we would need a series of threadumps and if possible the third party vendor report analysis for reference.

    Now, several customers have reported the unavailability to take threaddumps when the EM hangs. You can find attached a script that will help you collecting this information: HowToFoceThreadumpGeneration.zip

    Thanks,
    Sergio

    Attachment(s)



  • 8.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 04-01-2013 04:42 PM
    Sergio
    Excelent document. I´m sharing it with my APM customers.
    Regards,
    Pilar Lara


  • 9.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 04-03-2013 01:05 PM
    Thanks for sharing


  • 10.  RE: CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance iss

    Posted 09-27-2013 09:05 AM
    Very interesting tip Sergio.
    Thanks for posting it.
    Regards,
    Ollivier