DX Application Performance Management

Expand all | Collapse all

Best method for implementing a JVM "Offline" check

Jump to Best Answer
  • 1.  Best method for implementing a JVM "Offline" check

    Posted 09-13-2012 04:17 PM
    Hello all,

    I'm looking for ideas on the best way to implement an alert for an offline JVM. Essentially, I'm looking for the most reliable way to monitor and alert if the Weblogic process that I'm monitoring has died. I tried using the Custom Metric Agent (Virtual)|Agents|serverName|WebLogic|agentName|Connection Status metric, but it looks like this metric stops reporting after about a minute and a half of the process being down. I, unfortunately, need the alert to stay valid until the process is restarting and has reconnected to the Enterprise Manager, as I would like to continue sending pages every 20 minutes until service has been restored (or the alert has been disabled).

    Any assistance you can give is appreciated. I understand I might be able to do this with Calculators, but I was really hoping to have only 1 metric grouping, and use wildcards to roll all of my agents under this one Metric Grouping.

    Thanks,
    Joseph


  • 2.  RE: Best method for implementing a JVM "Offline" check

    Posted 09-13-2012 06:38 PM
    Would using an EPAgent and looking for the JVM process work for you? Should be easy to do with Unix flavor OS. Not sure about Windows. (But if the server goes down that would be another problem.)


  • 3.  RE: Best method for implementing a JVM "Offline" check
    Best Answer

    Posted 09-13-2012 07:38 PM
    There are a few other threads that have tackled this same topic. I recommend checking out the below, as they all somewhat relate to your question:

    Counting available instances
    Getting the Agent Connection Status on a other domain than Superdomain
    introscope uptime report
    Calculate Application DownTime in Introscope

    As well as this KB article:
    Creating an Alert When Application Server or JVM Shuts Down

    The trick is selecting the most appropriate metric to indicate agent availability. Often, this would be the agent's Connection Status metric: [font=Courier New]Super Domain |Custom Metric Host (Virtual)|Custom Metric Process (Virtual)|Custom Metric Agent (Virtual)|Agents|<hostname>|<Agent Name>:Connection Status[font]; however, as you noticed, this metric can disappear and, in a load-balanced or fail-over environment, this metric may show an agent as down for one collector when it has switched to a different collector. In cases like this, since a Caclulator will report zeroes when its source data is missing, you can create a calculated metric that is the sum of something like the GC Heap size (e.g.[font=Courier New] <domain>|<hostname>|<Process Name>|<Agent Name>|GC Heap:Bytes Total[font]); a value of zero would mean the agent was down and a value greater than zero that the agent is up. You could then build your report against the selected metric with the necessary criteria.


  • 4.  RE: Best method for implementing a JVM "Offline" check

    Posted 09-14-2012 09:09 AM
    Hey guys, thanks for getting back with me!

    jfaldmo: That is certainly an option, but I'm still teaching myself how to build EPAgent scripts, so for now, I was hoping to avoid that. We already do something like that with a Perl script, which does port Up/Down checks on the JVMs. Thanks for the suggestion though!

    jakbutler,

    I can't seem to get that KB article to open (it just opens to a blank page in the APM Community Forums view). I'll try again in a few; we're having some ISP slowness here today. I did do the Calculator approach for an Agent (I got the suggestion from the "Getting the Agent Connect Status on a domain..." thread that you linked), but this is a little tough, because we have over 200 JVMs that we monitor between QA and Production, so creating a calculator for each one isn't very supportable.

    Thanks guys for the feedback though; I'll keep plugging away at it.


  • 5.  RE: Best method for implementing a JVM "Offline" check

    Posted 09-14-2012 09:51 AM
    The KB article says to use Connection Status which as noted does not work in a load-balanced environment as the agent can switch to a different collector (which is what we are seeing btw). I would also be interested in getting a better, general solution.

    Roger Meli


  • 6.  RE: Best method for implementing a JVM "Offline" check

    Broadcom Employee
    Posted 09-14-2012 12:30 PM
      |   view attached
    Here's the old KB article.


  • 7.  RE: Best method for implementing a JVM "Offline" check

    Posted 09-14-2012 01:53 PM
    Thanks hdavis for the .pdf. So it looks like the only way to do this is either EPAgent or an individual calculator for each JVM. That's a little disappointing; I was hoping for something a little more scalable.

    I'll leave this thread "unanswered" for a couple days in case someone has a better method, but if not, I'll mark a solution.

    Thanks,
    Joseph


  • 8.  RE: Best method for implementing a JVM "Offline" check

    Posted 09-20-2012 05:30 PM
    We use a standalone Java program with a property file that goes like this -

    JVM_NAME1,http://server:<port>/context/root
    JVM_NAME2,http://server2:<port>/context/root
    .
    .
    .
    JVM_NAMEn,http://servern:<port>/context/root

    the code runs an HTTP GET at 15 second intervals on the EPagent, and hits these URL and returns 1 if we get a 200 response and 0 for everything else. Based on this number we configured alerts. This is much simpler as it can be integrated with any EPagent, and all JVMs can be grouped into a metric grouping. Your major effort would be creating the property file, which would be a pain if you have around 200 distinct JVMs.

    Thanks.
    AD


  • 9.  RE: Best method for implementing a JVM "Offline" check

    Posted 09-20-2012 08:11 PM
      |   view attached
    I've attached a custom JavaScript calculator that I created for this purpose. It looks at connected agents that have been assigned to a domain and checks for the [font=Courier New]GC Heap:Bytes In Use[font] metric; if present, the agent is considered connected, if not present or zero, the agent is considered disconnected. Agents must be in a custom domain to get evaluated; agents that are only in SuperDomain do not get evaluated, allowing a administrator a place to adjust Agent and Process names before generating connectivity metrics. This calculator will always return either a 1 or a 0 for any known agent; if an agent disconnects and gets unmounted, the previously generated Agent Connectivity metric will be used to preserve visibility to the disconnected agent. Once an agent that is assigned to a domain is evaluated by this calculator, regardless of the collector (if in a cluster) to which the agent is assigned, the resulting metric will still be accurately generated.

    For each evaluated agent, this will produce a metric with the following path, where [font=Courier New]SuperDomain/Custom Domain[font] is the name of the domain to which the agent is assigned, [font=Courier New]Host Name[font] is the name of the server on which the agent is running, and [font=Courier New]Agent Name[font] is the name of the agent in the investigator:
    [font=Courier New]SuperDomain/Custom Domain|Custom Metric Host (Virtual)|Custom Metric Process (Virtual)|Custom Metric Agent (Virtual)|Host Name|Agent Name:Agent Connectivity[font]

    You can easily create a Metric Group off of the output metric and create individual alerts on the Agent Connectivity metrics or the Metric Group for the domain as a whole.

    I'm very interested in hearing feedback on this script, as this is a first attempt.

    Attachment(s)

    js
    Agent Connectivity.js   4K 1 version


  • 10.  Re: RE: Best method for implementing a JVM "Offline" check

    Posted 06-30-2014 07:43 AM

    We also attempted to determine if an agent is disconnected.  In the case of a Java agent, if the agent is reporting then the connection status is reporting to one of our collectors.

    Using the various summary alert setting we found that if we set the following on a summary alert:

    metric grouping: wild card for the specific agent connection status so it will pick up no matter which collector it is reporting to.

        agent expression

                (.*)\|Custom Metric Process \(Virtual\)\|Custom Metric Agent \(Virtual\) \((.*)@5001\)

        metric expression:

                Agents\|<agent host>\|AIXAgent\|PerfMonAgent:ConnectionStatus

    Comparison Operator: Greater than

    Trigger Alert: Whenever Severity Changes

    Combination: all

    Danger threshold: 1

    Danger Periods over: 80

    Observed Periods 80 (20 minutes)

    Caution threshold: 1

    Periods over threshold: 42

    Observed periods: 42 (10:30 minutes)

     

    Then we have a summary alert that bundles all of the agent connection statuses into catagories (java, windows, aix, etc) and then have an action defined on the catagory summary alerts.

     

    Been working pretty well the only draw back is if you have a long duration ADS in which the agent does not return after reboot and is outside of the 20 minute window.

     

    To work around this, we built a dashboard that displays all of the summary alerts and review them after system reboots to insure all alerts are reporting and not grayed out.

     

    Hope this helps,

     

    Billy