DX Application Performance Management

Duane_Nielsen

Apr 07, 2015 04:18 PM

Main thing to look out for for calcs is CPU utilization and metric explosion (generating a lot of historical metrics).

Slow EM's are usually slow due to large amounts of historical metrics, and long "harvest durations", the amount of time it takes to read data from the spool and store it into smartstor.

This generally is caused by too many metrics coming in and not enough disk IO speed.

What I'm trying to say is that the APM cluster is generally IO bound, not CPU bound.. so unless you write a super crazy calculator, there is generally a lot of resources available for calculators.

Legacy User

Apr 07, 2015 04:05 PM

Hi Duane,

very interesting, thank you, I am bookmarking this for a quiet day.

I also have to learn to monitor the manager metrics so I can detect the impact (or lack thereof) of the scripts.

Duane_Nielsen

Apr 07, 2015 03:44 PM

Hi Fred, here is an example script that calculates a difference between the current and previous value.

diffNetStat_calculator.txt

Legacy User

Apr 07, 2015 03:06 PM

I found an old email reporting that there were gaps in the data due to a metricData.length == 0 so I am adding this

if( metricData.length <= 0 ) { log.info("metricData.length is 0"); }

would this happen - only - when the MoM is too busy?

is there an elegant way to remember the previous values and use them when the script fails to evaluate the current metrics values?

Duane_Nielsen

Feb 27, 2015 03:02 PM

Ok that makes sense, in a funny kind of way... Looks like we may need to handle unmounting of dead agents in the script somehow.

Let me throw it on the to-do list...

Legacy User

Jan 22, 2015 03:15 PM

Update 1:

I got a few log.info "ZZC timeslicedValue.value is empty undefined or null" on the mom ... interestingly also on the collector where I observed that this might be related to prior (we set the timeout to 100080)

1/22/15 11:24:41.757 MET [INFO] [TimerBean] [Manager.Agent] Automatically unmounting Agent "SuperDomain|foo|bar|server" after no communication for 10080 minute(s).

1/22/15 11:24:41.758 MET [INFO] [TimerBean] [Manager.Agent] Unmount completed of Agent "SuperDomain|foo|bar|server"

In this case there were no false alert as we set to only trigger alert notifiation "when severity increases"

Duane_Nielsen

Jan 14, 2015 12:57 PM

Awesome, let’s see how it goes!

Duane Nielsen

CA Technologies

Principal Consultant, Presales

Tel: 480 760 1559

Mobile: 480 760 1559

Duane.Nielsen@ca.com

<mailto:Duane.Nielsen@ca.com>

Legacy User

Jan 13, 2015 06:41 PM

Thanks. I am going to push the script at the same time as the other and see if the corresponding alerts behave differently.

Since I wonder if it is possible to have a timesliceValue undefined or null and if that matters at all, I made two temporary corrections for my testing - unfortunatelly it is going to take a couple weeks to see anything

if( ! metricData[i].timeslicedValue.value ) { log.info("ZZC timeslicedValue.value is empty undefined or null"); }

if ( agentStatus[thisUid] < metricData[i].timeslicedValue.value ) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; }

Duane_Nielsen

Jan 05, 2015 06:19 PM

Hi Fred, Basically what the change does is...

1. If we see the agent connected anywhere.. then set it connected, and leave it set as connected for that interval.

That way if the agent is reporting in twice, connected to one collector and not connected to another, then it will show connected.

2. In the case where the agent shows up more than once in two non-connected statuses.. then just take the highest status number. (if it only shows up once, then it just takes the current number).

I suspect 2. might be causing your false alerts. Maybe try reversing the > in the script to a < , (see below) so it takes the lowest value, and see if the alerts go away. If that works.. then we will need to study the problem in more detail and see if the alerts are really false or what...

                                // if the agent is connected, set it connected
                                                if ( metricData[i].timeslicedValue.value == 1 ) {
                                                                agentStatus[thisUid]= metricData[i].timeslicedValue.value;
                                                }

// else if we have already seen this agent connected to any collector, then leave the status as connected (1 means connected)
else if (agentStatus[thisUid] != 1) {

// else, just use the highest error value we have seen for this agent

                                                                if ( metricData[i].timeslicedValue.value < agentStatus[thisUid] ) {
                                                                                agentStatus[thisUid]= metricData[i].timeslicedValue.value;
                                                                }
                                                }

Haruhiko Davis

Dec 18, 2014 06:09 PM

I posted it originally, so I can do that.

Chris_Kline

Dec 18, 2014 05:44 PM

Good idea, Fred. @Hiko, wanna add this to the list for us to share?

Legacy User

Dec 18, 2014 05:40 PM

To everyone who installed this script - do you get false alerts ?

Btw, is the script in GIT in case someone wants to add comments :-)

- From time to time we are getting alerts during the night that are false. We set the alert to Danger when threshold is equal 3 over 20 periods (5 min).

- one interesting difference between the script posted here and the one I received a year ago is the section
This script:

                                                // if the agent is connected, set it connected
                                                if ( metricData[i].timeslicedValue.value == 1 ) {
                                                                agentStatus[thisUid]= metricData[i].timeslicedValue.value;
                                                }

// else if we have already seen this agent connected to any collector, then leave the status as connected (1 means connnected)
else if (agentStatus[thisUid] != 1) {

// else, just use the highest error value we have seen for this agent

                                                                if ( metricData[i].timeslicedValue.value > agentStatus[thisUid] ) {
                                                                                agentStatus[thisUid]= metricData[i].timeslicedValue.value;
                                                                }
                                                }

whereas I was given

         if(metricData[i].timeslicedValue.value == 3) {
            //log.info("ZZZ Agent Status Value is now 3 and previous one is: " + agentStatus[thisUid]);
            if(agentStatus[thisUid]== 3) {
               agentStatus[thisUid]= metricData[i].timeslicedValue.value;
            }
            else if (agentStatus[thisUid]== 0) {
               agentStatus[thisUid]= metricData[i].timeslicedValue.value;
            }

         } else if (metricData[i].timeslicedValue.value == 1) {
            agentStatus[thisUid]= metricData[i].timeslicedValue.value;
         }
         //log.info("ZZZ Agent Status Value after: " + agentStatus[thisUid]);
         // log.info("ZZZ thisUid: " + thisUid);
         // debug stmts
         // log.info("AgentUpTotal: " + agentUpTotal[thisUid]);
         // log.info("Agent " + agentList[agent] + " is up/down: " + agentIsUp[agent]);
      }

I wonder what Duane (see script comments) fixed and why (??).

- Another interesting thing which we still have trouble to confirm is that in a few cases there are gaps in the metrics. We could not see similar gaps in the agent data or EM metrics. So no obvious stress that the MoM could not keep up with the calculator. In most cases the gap was less than 5 min so there should be no alerts sent to begin with.

- another curious thing I observed is that when you restart the MoM, the agent down alerts are sent again which could be a potential problem if you use a Enterprise system to monitor alerts. In such case you might have acknowlegded the alert already only to get it re-opened by an unrelated MoM restart.

Anon Anon

Dec 17, 2014 01:51 PM

Thank you Duane!

Todd Hall (who told me that you are very sharp) is working with me this week.

Your JavaScript came in very handy for us and saved us a lot of work. The script works like a dream!

'Very nice code. (I will be studying it - cool stuff!)

Thanks again,

Steve Stewart

Todd Hall

Dec 17, 2014 09:43 AM

Thanks for putting together this Wiki - and in particular - this javascript really came in useful yesterday. We have developed new dashboards recently that included alarms on agent connection status. We introduced a new collector and loadbalancing occurred which caused many of the alerts to be 'danger' - even though this was a natural event and everything was fine. Considering options to resolve this dilemma - we were considering using CLW and a script to validate status across the cluster then thought to search the community site and found this. Validated in the test environment then deployed to production - the only thing we modified was to comment the one log.info call early in the script - otherwise it was sending these messages to the EM's log file. Good job Duane Nielsen and everyone else who contributed - this worked great!

DX Application Performance Management

ConnectionStatus.zip

Tags and Keywords

Comments

Related Entries and Links