Automic Workload Automation

Expand all | Collapse all

How do you monitor your AE?

  • 1.  How do you monitor your AE?

    Posted Mar 25, 2018 02:47 PM
    Hi all,

    we (my colleagues & myself) lately had a discussion that basically everything can be checked/monitored with automic.

    So out of couriosity - how and what do you check on your Automation Engine with your AE (per script, jobs, emi, ...) periodically?

    we do a basic system check once a day and periodic checks(monitoring)

    the basic check covers:
    * if all CPs and WPs run
    * if all Core Agents run
    * if all Clients are in status GO
    * the amount of activities


    the periodic checks are performed every 3..10 minutes and constist of:
    * Core Agents
    * do AE processes run
    * do AE processes write data into their logile

    thanks for your inputs!

    cheers, Wolfgang


  • 2.  How do you monitor your AE?

    Posted Mar 26, 2018 05:54 AM
    Hi FrankMuffke

    Just out of curiosity, what kind of scripts do you run ? Do you use AE scripting ?

    Some time ago I posted this : https://community.automic.com/discussion/6927/system-healthcheck

    Not sure if it helps, what you do is probably rather similar (and more advanced :))


    Best regards,
    Antoine


  • 3.  How do you monitor your AE?

    Posted Mar 26, 2018 06:56 AM
    Hi.

    We monitor:

    • processes on the UNIX server (via Nagios)
    • processes and queue loads via a simple UC4 script I wrote. It only alerts via email if any tresholds are exceeded, like "load over last 15 minutes" or such (SYS_SERVER_ALIVE and friends ...)
    • I monitor key agents via a shell script and UNIX service manager (and restart them if crashed)
    • we monitor actual Job execution, by executing a heartbeat job periodically that writes a file with the time, which then gets verified by Nagios (because SYS_HOST_ALIVE only goes so far - we had agents hang but still report they're alive ...)
    • various additional UC4 scripts by MatthiasSchelp to alert in case of unavailable Java agents (SAP, RA)
    • I monitor changes to the agent list by reading agents from the DB with a shell script, and automatically comparing them against the list of the previous day (using sdiff on Linux: needed because other departments sometimes install agents without telling us, and Automic sadly does not allow full license control purely by the server, so new agents can eat licenses without the Server Admin even allowing them to - bad design!)
    • I monitor the various MQs with a shell script (via SQL), and alert in case of unusually high levels
    • another shell script monitors how many jobs each department has active, and alerts me at unusual high levels, so I can tell SAP to cut it out if they spawn 50000 jobs at once
    • I log (and incidentially analyze) the activation lag of jobs, i.e. time between activation and start of jobs (SQL query)
    • we monitor various DB parameters
    • (in preparation) monitoring for an ususually high amount of DB deadlocks with Automic (after recent events)
    • probably some more shell scripts that monitor various things
    • I monitor the automic community via lynx, alerting me of newly found Automic issues by looking for any new posts by FrankMuffke (just kidding, I don't :p )
    Hth,
    Carsten

    p.s. monitoring is like money, old camera lenses and Battlefield 3 experience points: Amass any amount you can think of, it's still never enough.


  • 4.  How do you monitor your AE?

    Posted Mar 26, 2018 06:59 AM
    Hi Wolfgang,

    I m sure you know about this, but for checking if AE and Agents are alive you can use the SYS_HOST_ALIVE and SYS_SERVER_ALIVE features, but in my opinion the more useful way would not be to check IF the components are alive but a message if they are NOT running, e.g. using EXECUTE_ON_END feature in UC_HOSTCHAR_* variable.

    regarding if the client is available and running, i have no idea, i have to insist


  • 5.  How do you monitor your AE?

    Posted Mar 26, 2018 07:20 AM
    Just to make myself more clear we do differ between monitoring (alert if a component isn't running) and health check (how is my system performing AT THE MOMENT - thats a report only).

    Thanks a lot guys for your replies!

    cheers, Wolfgang


  • 6.  How do you monitor your AE?

    Posted Mar 26, 2018 08:47 AM
    Oh, one more:

    I also monitor the amount of objects in Transport Case. You know, because someone once managed to put ALL the objects into transport case, and transport them :)


  • 7.  How do you monitor your AE?

    Posted Mar 26, 2018 10:08 AM
    Carsten_Schmitz_7883

    THX for the input - especially "I log (and incidentially analyze) the activation lag of jobs, i.e. time between activation and start of jobs (SQL query)"

    Thats a very good input I may steal from you :-)
    how do you differ between AE slowliness and heavy scripting load ?
    e.g. a Workflow construction with many objects (and scripting effort) in it especially waits - and a job that really gets generated very slow ?

    many thanks, Wolfgang


  • 8.  How do you monitor your AE?

    Posted Mar 26, 2018 11:24 AM
    Thats a very good input I may steal from you :-)
    Sure thing.

    how do you differ between AE slowliness and heavy scripting load ?
    e.g. a Workflow construction with many objects (and scripting effort) in it especially waits - and a job that really gets generated very slow ?
    Here's what I do (output formating heavily Larry Ellison specific):

    #!/bin/bash

    mkdir -p /var/log/scripts/activation_lag_analysis

    cat > /tmp/activation_lag_analysis.sql << EOF
    SET PAGESIZE 0
    SET NEWPAGE 0
    SET SPACE 0
    SET LINESIZE 32767
    SET ECHO OFF
    SET FEEDBACK OFF
    SET VERIFY OFF
    SET HEADING OFF
    SET MARKUP HTML OFF SPOOL OFF
    SET COLSEP '|'
    alter session set nls_territory = 'GERMANY';
    ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MM-RR HH24:MI:SS';

    select AH_OH_IDNR as OHID, AH_IDNR as Runid, AH_TIMESTAMP1 as Activation, AH_STATUS as Status, AH_HOSTDST as Destination, AH_TIMESTAMP2 as Launch, to_char(AH_TIMESTAMP2, 'DAY') as Day, round((AH_TIMESTAMP2 - AH_TIMESTAMP1) * 24 * 60 * 60,0) as seconds_diff from AH,dual where AH_OH_IDNR in (select OH_IDNR from OH where OH_NAME like '%JOB%DC1%HEARTBEAT%' AND OH_NAME NOT LIKE '%OLD.%' and OH_NAME not like 'JOBP%') AND AH_STATUS = 1900;
    EOF

    if [ ! -f /var/log/scripts/activation_lag_analysis/README.txt ] ; then
      cat > /var/log/scripts/activation_lag_analysis/README.txt << EOF2

      files here are vital for determining system performance via
      time between activation and launch of UC4 key jobs. Do NOT
      remove files here, and in any case none which are younger than
      one year.

      Thanks
      --cschmitz (via some script)

    EOF2
    fi

    if [ -z $LD_LIBRARY_PATH ] ; then
      export LD_LIBRARY_PATH="/usr/lib/oracle/11.2/client64/lib"
    else
      export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/oracle/11.2/client64/lib"
    fi

    cat /tmp/activation_lag_analysis.sql \
      | sqlplus64 -S you_wish/my_password

    (DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = secret_sauce)(PORT = 1337)))(CONNECT_DATA =(SERVICE_NAME = burger_king_home_delivery_service)(SERVER = DEDICATED)))

    \
      | bzip2 \
      > /var/log/scripts/activation_lag_analysis/$(date +%y-%m-%d).bz2

    This runs from cron (use UC4? Who am I? :) ) once every week. Every once in a while, I make a really nice Excel spreadsheet from it (you need to de-duplicate the data first, because you will have overlap due to the weekly collection), and it also (I just left that in as a best practice) tells the Linux admins not to touch my sh*t :)

    edit: If you look at the SQL, this only logs the times for a Job called "JOBS.DC1.HEARTBEAT". That's my monitoring job for actual agent operation that runs very often per day. Using just this often-running, but well known job as the basis for statistics also eliminates a lot of uncertainty.

    To answer your question: Engine slowness is when all jobs on all agents are slow. But usually I see individual agents being slow, that's network issues or heavily loaded agents.

    Hth,
    Carsten


  • 9.  How do you monitor your AE?

    Posted Mar 26, 2018 11:34 AM
      |   view attached
    Here. A picture silly Excel sheet says more than a thousand words. I am quite fond of the heat map-style colors though :)

    Attachment(s)

    xlsx
    act_lag_redacted.xlsx   70 KB 1 version


  • 10.  How do you monitor your AE?

    Posted Mar 26, 2018 01:43 PM
    oh wow - many thanks for this input!
    so the heartbeat job is your reference for AE, I understand!

    and many THX for the excel example!

    cheers, Wolfgang


  • 11.  How do you monitor your AE?

    Posted Mar 30, 2018 07:48 PM
    Hello,

    I though that we have a lot of monitoring solutions :D 

    1. Silly old SNMP agent. Few of the TRAP codes has been ignored (Warm start of an agent?!).  - Ticket
    Its too bad that the title for "System error of the UC4 Server" is so generic.
    LINK
    2. On OS level the psmon is checking for the count of ucsrvwp , ucsrvcp, snmp1, (JWP is not included currently) - Ticket
    3. SYS_HOST_ALIVE on every 30 minutes against all OS agents - Email
    4. Simple SAP job (RSUSR000) against all SAP agents on every 30 min. MRT terminator at 3 min (im too generous) - Email
    5. HEARTBEAT - Unix job is printing timestamp in a file every 10 min. UXMON is checking for file age. Maximum age - 20min - Ticket
    6. Job failure monitoring - Post-process include. Parsing the jobname against static variable. Depending on the priority and the app team - ticket or email will be generated (OVO_MON.log and Send_Mail)


  • 12.  How do you monitor your AE?

    Posted Mar 31, 2018 03:31 AM
    Many thanks for your Input!

    cheers, Wolfgang


  • 13.  RE: How do you monitor your AE?

    Posted Jan 28, 2021 07:10 PM
    Hi team,

    This thread leads to what I am looking for I want to create an alert that tells me when the AE is not working by mail, but I have not succeeded, I have a powershell script that monitors all components.

    But in recent days we had a network problem all components were up wp, cp, jcp etc..

    But we evidenced the error was to see the logs of the AE, which had no connection to the DB because of the network problem.

    How could I do it, has anyone implemented a free tool, or created an external script for this?

    I tried to monitor with ORACLE JAVA MISSION CONTROL using EMI.JAR but they removed the functionality to configure the mail and send alerts.

    Thank you very much for your comments

    ------------------------------
    Computer Engineer
    N.A
    ------------------------------



  • 14.  RE: How do you monitor your AE?

    Posted Jan 29, 2021 12:32 PM
    Edited by Pete Wirfs Jan 29, 2021 12:33 PM
    We run a process against the message log (table=MELD) to trap all severe errors that were not expected, and throw appropriate alarms.  This traps dropped communications with the various agents. We run it once every 30 minutes.

    This is the SQLserver query we use for that process;
    -- finds all RED messages on the message log

    select distinct
    -- converts gmt to pacific time
    dateadd(hour, datediff(hour, getutcdate(), getdate()), meld_timestamp) as TIMESTAMP
    , meld_msginsert as MESSAGE_DETAILS
    , meld_type
    , meld_msgnr as MESSAGE_NUM
    from meld
    where ((meld_seen = 1 and meld_client = &$CLIENT#) or meld_msgnr = 11885)

    -- *********************FILTER-RULES***************************************************
    and meld_type <> 30 -- filter out all non-critical messages
    and meld_msgnr <> 00050 -- filter out all "UC4 alarm popup was deliberately cancelled"
    and meld_msgnr <> 11003 -- filter out bad RC messages (oncall alerts handle this)
    and meld_msgnr <> 11007 -- filter out bad RC messages (oncall alerts handle this)
    and meld_msgnr <> 11067 -- filter out Notification Aborted
    and meld_msgnr <> 11051 -- filter out initiated CANCEL messages
    and meld_msgnr <> 11506 -- filter out CANCEL messages
    and meld_msgnr <> 12111 -- filter out CANCEL messages
    and meld_msgnr <> 20582 -- filter out CANCEL messages
    and meld_msgnr <> 11054 -- filter out "terminated with CANCEL"
    and meld_msgnr <> 11069 -- filter out "aborted by escalation"
    and meld_msgnr <> 11544 -- filter out "runtime of task 'X' has been exceeded. Task 'Y' will be started"
    and meld_msgnr <> 11547 -- filter out "Time Checkpoint for task 'X' has been exceeded. Task 'Y' will be started"
    and meld_msgnr <> 50031 -- filter out "SMTP client cannot get host information..." (This happens in DR tests)
    and meld_msgnr <> 11348 -- filter out "Task 'X' was stopped for modification. MRT.DAILY can cause this.
    and meld_msgnr <> 11104 -- filter out "Task 'X' was started. MRT.DAILY can cause this.
    -- *********************FILTER-RULES***************************************************

    and dateadd(hour, datediff(hour, getutcdate(), getdate()), meld_timestamp)
    >
    dateadd(minute, -180, getdate())
    order by 1
    ;


    ------------------------------
    Pete Wirfs
    SAIF Corporation
    Salem Oregon USA
    ------------------------------



  • 15.  RE: How do you monitor your AE?

    Posted Jan 29, 2021 12:36 PM
    note that the idea I just posted would not throw an alarm if your AE was no longer able to communicate with its own database.  The idea I posted requires the AE still be functional, and without a database, it can not function.

    ------------------------------
    Pete Wirfs
    SAIF Corporation
    Salem Oregon USA
    ------------------------------



  • 16.  RE: How do you monitor your AE?

    Posted Mar 22, 2021 12:31 PM
    Hi @Johnny andrey

    This still works for me. I'm using Adopt open JDK with Adopt JDK Mission Control and AE V12.3.4:







    Cheers
    Christoph 




    ------------------------------
    ----------------------------------------------------------------
    Automic AE Consultant and Trainer since 2000
    ----------------------------------------------------------------
    ------------------------------



  • 17.  RE: How do you monitor your AE?

    Posted Mar 29, 2021 04:11 PM
    Edited by Marcin Uracz Mar 29, 2021 04:32 PM
    Recently I started playing around with telegraf pushing to influxdb, as the monitoring solution I was using so far has sunset some time ago.

    Telegraf seems to be able to integrate pretty well with the data received from EMI, I added some standard HTTP checks and here is what I have settled for so far: 


    ------------------------------
    Cheers,
    Marcin
    ------------------------------



  • 18.  RE: How do you monitor your AE?

    Posted Mar 30, 2021 02:42 AM
    Thank you Marcin.  I will look into whether we are using that tool.  Greatly appreciate the informatio.