JavaScript calculator which brings the "ConnectionStatus" metric from Collectors to the MOM where they can be managed across the cluster.
The benefit of using this calculator is when agents are load balanced by MOM from one Collector to another.
Main thing to look out for for calcs is CPU utilization and metric explosion (generating a lot of historical metrics).
Slow EM's are usually slow due to large amounts of historical metrics, and long "harvest durations", the amount of time it takes to read data from the spool and store it into smartstor.
This generally is caused by too many metrics coming in and not enough disk IO speed.
What I'm trying to say is that the APM cluster is generally IO bound, not CPU bound.. so unless you write a super crazy calculator, there is generally a lot of resources available for calculators.
Hi Duane,
very interesting, thank you, I am bookmarking this for a quiet day.
I also have to learn to monitor the manager metrics so I can detect the impact (or lack thereof) of the scripts.
Hi Fred, here is an example script that calculates a difference between the current and previous value.
diffNetStat_calculator.txt
I found an old email reporting that there were gaps in the data due to a metricData.length == 0 so I am adding this
if( metricData.length <= 0 ) { log.info("metricData.length is 0"); }
would this happen - only - when the MoM is too busy?
is there an elegant way to remember the previous values and use them when the script fails to evaluate the current metrics values?
Ok that makes sense, in a funny kind of way... Looks like we may need to handle unmounting of dead agents in the script somehow.
Let me throw it on the to-do list...
Update 1:
I got a few log.info "ZZC timeslicedValue.value is empty undefined or null" on the mom ... interestingly also on the collector where I observed that this might be related to prior (we set the timeout to 100080)
1/22/15 11:24:41.757 MET [INFO] [TimerBean] [Manager.Agent] Automatically unmounting Agent "SuperDomain|foo|bar|server" after no communication for 10080 minute(s).
1/22/15 11:24:41.758 MET [INFO] [TimerBean] [Manager.Agent] Unmount completed of Agent "SuperDomain|foo|bar|server"
In this case there were no false alert as we set to only trigger alert notifiation "when severity increases"
Awesome, let’s see how it goes!
Duane Nielsen
CA Technologies
Principal Consultant, Presales
Tel: 480 760 1559
Mobile: 480 760 1559
Duane.Nielsen@ca.com
<mailto:Duane.Nielsen@ca.com>
Thanks. I am going to push the script at the same time as the other and see if the corresponding alerts behave differently.
Since I wonder if it is possible to have a timesliceValue undefined or null and if that matters at all, I made two temporary corrections for my testing - unfortunatelly it is going to take a couple weeks to see anything
if( ! metricData[i].timeslicedValue.value ) { log.info("ZZC timeslicedValue.value is empty undefined or null"); }
if ( agentStatus[thisUid] < metricData[i].timeslicedValue.value ) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; }
Hi Fred, Basically what the change does is...
1. If we see the agent connected anywhere.. then set it connected, and leave it set as connected for that interval.
That way if the agent is reporting in twice, connected to one collector and not connected to another, then it will show connected.
2. In the case where the agent shows up more than once in two non-connected statuses.. then just take the highest status number. (if it only shows up once, then it just takes the current number).
I suspect 2. might be causing your false alerts. Maybe try reversing the > in the script to a < , (see below) so it takes the lowest value, and see if the alerts go away. If that works.. then we will need to study the problem in more detail and see if the alerts are really false or what...
// if the agent is connected, set it connected if ( metricData[i].timeslicedValue.value == 1 ) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; }
// else if we have already seen this agent connected to any collector, then leave the status as connected (1 means connected) else if (agentStatus[thisUid] != 1) {
// else, just use the highest error value we have seen for this agent
if ( metricData[i].timeslicedValue.value < agentStatus[thisUid] ) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; } }
Good idea, Fred. @Hiko, wanna add this to the list for us to share?
To everyone who installed this script - do you get false alerts ?
Btw, is the script in GIT in case someone wants to add comments :-)
- From time to time we are getting alerts during the night that are false. We set the alert to Danger when threshold is equal 3 over 20 periods (5 min).
- one interesting difference between the script posted here and the one I received a year ago is the section This script:
// else if we have already seen this agent connected to any collector, then leave the status as connected (1 means connnected) else if (agentStatus[thisUid] != 1) {
if ( metricData[i].timeslicedValue.value > agentStatus[thisUid] ) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; } }
whereas I was given
if(metricData[i].timeslicedValue.value == 3) { //log.info("ZZZ Agent Status Value is now 3 and previous one is: " + agentStatus[thisUid]); if(agentStatus[thisUid]== 3) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; } else if (agentStatus[thisUid]== 0) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; }
} else if (metricData[i].timeslicedValue.value == 1) { agentStatus[thisUid]= metricData[i].timeslicedValue.value; } //log.info("ZZZ Agent Status Value after: " + agentStatus[thisUid]); // log.info("ZZZ thisUid: " + thisUid); // debug stmts // log.info("AgentUpTotal: " + agentUpTotal[thisUid]); // log.info("Agent " + agentList[agent] + " is up/down: " + agentIsUp[agent]); }
I wonder what Duane (see script comments) fixed and why (??).
- Another interesting thing which we still have trouble to confirm is that in a few cases there are gaps in the metrics. We could not see similar gaps in the agent data or EM metrics. So no obvious stress that the MoM could not keep up with the calculator. In most cases the gap was less than 5 min so there should be no alerts sent to begin with.
- another curious thing I observed is that when you restart the MoM, the agent down alerts are sent again which could be a potential problem if you use a Enterprise system to monitor alerts. In such case you might have acknowlegded the alert already only to get it re-opened by an unrelated MoM restart.
Thank you Duane!
Todd Hall (who told me that you are very sharp) is working with me this week.
Your JavaScript came in very handy for us and saved us a lot of work. The script works like a dream!
'Very nice code. (I will be studying it - cool stuff!)
Thanks again,
Steve Stewart
Thanks for putting together this Wiki - and in particular - this javascript really came in useful yesterday. We have developed new dashboards recently that included alarms on agent connection status. We introduced a new collector and loadbalancing occurred which caused many of the alerts to be 'danger' - even though this was a natural event and everything was fine. Considering options to resolve this dilemma - we were considering using CLW and a script to validate status across the cluster then thought to search the community site and found this. Validated in the test environment then deployed to production - the only thing we modified was to comment the one log.info call early in the script - otherwise it was sending these messages to the EM's log file. Good job Duane Nielsen and everyone else who contributed - this worked great!