DX Application Performance Management

 View Only
  • 1.  MOM Harvesting process

    Posted Jul 15, 2015 01:43 PM

    Hello All,

     

    I can see MOM's harvest capacity is high in my environment. Also I can see "Calculator harvest time" is high which matches with harvest capacity graph.

     

    So is it high harvest capacity(somehow high due to some other reason) causing calculators to run with high harvest time?

    or is it increase in harvesting of calculators making the harvest capacity to grow high? Which way it works?

     

    Also what are the work done by harvesting process with respect to MOM? Is it alert processing, metric group regex matches, queries, data points returned, calculators, etc..?

     

    Basically I get lot of questions like this when I want to go into very granular level? Where can I get to know about each and every process in a very granular level?

     

    Thanks,

    Karthik



  • 2.  Re: MOM Harvesting process

    Broadcom Employee
    Posted Jul 15, 2015 06:08 PM

    Hi Karthik,

     

    it's the other way round: high calculator harvest time leads to high harvest duration which results in much of the harvest capacity being used. The target is for Harvest duration (and SmartStor duration, too) to stay well below 3.5s. Then there is enough time to satisfy ad-hoc queries.

     

    The harvest cycle is a complex real-time loop with many dependencies most of which you have listed above: MM calculators, js calculators, built-in calculators, alert processing, ...

     

    Some of it is broken down in the metrics under Internal and Internal|Harvest. Last week we identified at one customer that one js calculator raised the harvest time to above 10s which resulted in many aggregated time slices and totally unresponsive EM cluster. To check this you can remove all js calculators from the scripts and if harvest and calculator time drop significantly add them one by one to identify the culprits. It could also be too many queries, e.g. from integrations.

     

    In general there are those probable causes of EM performance degradation and indicator metrics:

    Agents Sending Too Many Metrics and/or Leaks

    • Number of Metrics Handled
    • Harvest: Metrics From All Agents     (note: before clamping)
    • Number of Historical Metrics (going up)

    Calculators/Alerts matching too many metrics

    • Calculators: Total Number of Evaluated Metrics
    • Alerts: Total Number of Evaluated Metrics

    Too Many Ongoing Queries or Transaction Trace Events

    • Number of Registered Async Data Queries
    • Connections: Number of Events Processed

    Too Many Broad Historical Queries

    • SmartStor Queries Per Interval  / Cached Queries Per Interval
    • Metric Matches per Interval

     

    Ciao,

    Guenter



  • 3.  Re: MOM Harvesting process

    Posted Oct 13, 2015 07:59 AM

    Hi Guenter,

    I read your post a couple times today and thinking about it with what I was told the other day. I start to understand what the Harvest time is from a collector point of view, but I struggle with what makes the MoM's Harvest time: I can understand calculators, alerts, query processing, but the number of metrics, number of historical metrics, and smartstor writing all that is likely negligeable on MoMs.

    So, my question:  are there some "MoM" specific KPI metrics to watch that would differentiate if the MoM needs more CPU/Mem or its poor performance is due by one or more sick collectors.

     

    Rgds.- Fred



  • 4.  Re: MOM Harvesting process

    Broadcom Employee
    Posted Oct 13, 2015 07:38 PM

    Hi Fred,

     

    harvest time on the MOM is mostly determined by the number of calculators (internal and those in MM and JS) and alerts. And of course by the performance of the downstream collectors (their harvest and SmartStor times + Ping time).

     

    Harvest Time and CPU/Heap metrics are good indicators for MOM performance. We had one case where unnecessary full GCs were severely impacting MOM performance. Disabling explicit fullGCs in the startup properties solved that.

     

    Ciao,

    Guenter



  • 5.  Re: MOM Harvesting process

    Posted Oct 14, 2015 06:25 PM

    Thanks - still thinking - how do you differentiate the MoM's (native) performance from the impact of its collectors:

    - does the value of the collectors ping metrics somewhat include the collectors' harvest time?

    - when it is said that the MoM operates as fast as the slowest collector is there some approximate formula that would suggest that the sum of all the collectors ping values must be less than ( 15 sec -MoM's harvest cycle)? and thus the worst ping value at the time would be the "bad" collector to fix?

    - I'll have to remember about DisableExplicitGC was it v9.7 (related to the use of Java 7 nio)?



  • 6.  Re: MOM Harvesting process

    Broadcom Employee
    Posted Oct 14, 2015 08:34 PM

    Fred.K wrote:

     

    Thanks - still thinking - how do you differentiate the MoM's (native) performance from the impact of its collectors:

    1. does the value of the collectors ping metrics somewhat include the collectors' harvest time?
    2. when it is said that the MoM operates as fast as the slowest collector is there some approximate formula that would suggest that the sum of all the collectors ping values must be less than ( 15 sec -MoM's harvest cycle)? and thus the worst ping value at the time would be the "bad" collector to fix?
    3. I'll have to remember about DisableExplicitGC was it v9.7 (related to the use of Java 7 nio)?
    1. No. Ping time is just network plus some protocol time. IT should be less than 500ms.
    2. No. This relates statement relates primarily to queries but also to the "harvest cycle". But metrics are sent to the MOM after the first stage of the harvest cycle and therefore should not severely impact the MOM harvest time.
    3. In the case I mentioned an external script was querying the EM every 10s or so which included an LDAP authentication which triggered a manual System.gc() in the EM code which caused totally unnecessary full stop-the-world GCs. So just in case you are seeing an EM doing lots of GCs, having an unusually high GC time despite having lots of free heap space then start the EM with -XX:+DisableExplicitGC. I don't remember a case related to Java 7 nio.

     

    Ciao,

    Guenter



  • 7.  Re: MOM Harvesting process

    Posted Jul 16, 2015 09:03 AM

    excerpt from troubleshooting guide, credit to Sergio:

    Harvest duration spikes:

    a) If harvest duration is high all the time (> 7,500), the EM is overloaded, you need to review the hardware capacity and reduce the # of metrics and calculators.

    b) If harvest duration spikes regularly at 5-min interval, it’s related to a volume polling thread which runs every 5 minute to get the disk/dir/file size of data, data/archive, traces.db, log file, and baselines.db set introscope.enterprisemanager.supportability.volumespace.enable=false.

    c) If harvest duration spikes regulary at 1-hour interval and SPM is enabled, you need to disable SOA Deviation: set com.wily.introscope.soa.deviation.enable=false

    Other possible symptoms: Your created calculators will start reporting zero hourly.

    d) If harvest duration spikes at the same time GC duration also spikes: this might be due to a lack of memory: check heap size is enough, huge query or large amount of queries are happening, too many traces.