Automic Workload Automation

Expand all | Collapse all

How do you monitor your AE?

  • 1.  How do you monitor your AE?

    Posted 03-25-2018 02:47 PM
    Hi all,

    we (my colleagues & myself) lately had a discussion that basically everything can be checked/monitored with automic.

    So out of couriosity - how and what do you check on your Automation Engine with your AE (per script, jobs, emi, ...) periodically?

    we do a basic system check once a day and periodic checks(monitoring)

    the basic check covers:
    * if all CPs and WPs run
    * if all Core Agents run
    * if all Clients are in status GO
    * the amount of activities


    the periodic checks are performed every 3..10 minutes and constist of:
    * Core Agents
    * do AE processes run
    * do AE processes write data into their logile

    thanks for your inputs!

    cheers, Wolfgang


  • 2.  How do you monitor your AE?

    Posted 03-26-2018 05:54 AM
    Hi FrankMuffke

    Just out of curiosity, what kind of scripts do you run ? Do you use AE scripting ?

    Some time ago I posted this : https://community.automic.com/discussion/6927/system-healthcheck

    Not sure if it helps, what you do is probably rather similar (and more advanced :))


    Best regards,
    Antoine


  • 3.  How do you monitor your AE?

    Posted 03-26-2018 06:56 AM
    Hi.

    We monitor:

    • processes on the UNIX server (via Nagios)
    • processes and queue loads via a simple UC4 script I wrote. It only alerts via email if any tresholds are exceeded, like "load over last 15 minutes" or such (SYS_SERVER_ALIVE and friends ...)
    • I monitor key agents via a shell script and UNIX service manager (and restart them if crashed)
    • we monitor actual Job execution, by executing a heartbeat job periodically that writes a file with the time, which then gets verified by Nagios (because SYS_HOST_ALIVE only goes so far - we had agents hang but still report they're alive ...)
    • various additional UC4 scripts by MatthiasSchelp to alert in case of unavailable Java agents (SAP, RA)
    • I monitor changes to the agent list by reading agents from the DB with a shell script, and automatically comparing them against the list of the previous day (using sdiff on Linux: needed because other departments sometimes install agents without telling us, and Automic sadly does not allow full license control purely by the server, so new agents can eat licenses without the Server Admin even allowing them to - bad design!)
    • I monitor the various MQs with a shell script (via SQL), and alert in case of unusually high levels
    • another shell script monitors how many jobs each department has active, and alerts me at unusual high levels, so I can tell SAP to cut it out if they spawn 50000 jobs at once
    • I log (and incidentially analyze) the activation lag of jobs, i.e. time between activation and start of jobs (SQL query)
    • we monitor various DB parameters
    • (in preparation) monitoring for an ususually high amount of DB deadlocks with Automic (after recent events)
    • probably some more shell scripts that monitor various things
    • I monitor the automic community via lynx, alerting me of newly found Automic issues by looking for any new posts by FrankMuffke (just kidding, I don't :p )
    Hth,
    Carsten

    p.s. monitoring is like money, old camera lenses and Battlefield 3 experience points: Amass any amount you can think of, it's still never enough.


  • 4.  How do you monitor your AE?

    Posted 03-26-2018 06:59 AM
    Hi Wolfgang,

    I m sure you know about this, but for checking if AE and Agents are alive you can use the SYS_HOST_ALIVE and SYS_SERVER_ALIVE features, but in my opinion the more useful way would not be to check IF the components are alive but a message if they are NOT running, e.g. using EXECUTE_ON_END feature in UC_HOSTCHAR_* variable.

    regarding if the client is available and running, i have no idea, i have to insist


  • 5.  How do you monitor your AE?

    Posted 03-26-2018 07:20 AM
    Just to make myself more clear we do differ between monitoring (alert if a component isn't running) and health check (how is my system performing AT THE MOMENT - thats a report only).

    Thanks a lot guys for your replies!

    cheers, Wolfgang


  • 6.  How do you monitor your AE?

    Posted 03-26-2018 08:47 AM
    Oh, one more:

    I also monitor the amount of objects in Transport Case. You know, because someone once managed to put ALL the objects into transport case, and transport them :)


  • 7.  How do you monitor your AE?

    Posted 03-26-2018 10:08 AM
    Carsten_Schmitz_7883

    THX for the input - especially "I log (and incidentially analyze) the activation lag of jobs, i.e. time between activation and start of jobs (SQL query)"

    Thats a very good input I may steal from you :-)
    how do you differ between AE slowliness and heavy scripting load ?
    e.g. a Workflow construction with many objects (and scripting effort) in it especially waits - and a job that really gets generated very slow ?

    many thanks, Wolfgang


  • 8.  How do you monitor your AE?

    Posted 03-26-2018 11:24 AM
    Thats a very good input I may steal from you :-)
    Sure thing.

    how do you differ between AE slowliness and heavy scripting load ?
    e.g. a Workflow construction with many objects (and scripting effort) in it especially waits - and a job that really gets generated very slow ?
    Here's what I do (output formating heavily Larry Ellison specific):

    #!/bin/bash

    mkdir -p /var/log/scripts/activation_lag_analysis

    cat > /tmp/activation_lag_analysis.sql << EOF
    SET PAGESIZE 0
    SET NEWPAGE 0
    SET SPACE 0
    SET LINESIZE 32767
    SET ECHO OFF
    SET FEEDBACK OFF
    SET VERIFY OFF
    SET HEADING OFF
    SET MARKUP HTML OFF SPOOL OFF
    SET COLSEP '|'
    alter session set nls_territory = 'GERMANY';
    ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MM-RR HH24:MI:SS';

    select AH_OH_IDNR as OHID, AH_IDNR as Runid, AH_TIMESTAMP1 as Activation, AH_STATUS as Status, AH_HOSTDST as Destination, AH_TIMESTAMP2 as Launch, to_char(AH_TIMESTAMP2, 'DAY') as Day, round((AH_TIMESTAMP2 - AH_TIMESTAMP1) * 24 * 60 * 60,0) as seconds_diff from AH,dual where AH_OH_IDNR in (select OH_IDNR from OH where OH_NAME like '%JOB%DC1%HEARTBEAT%' AND OH_NAME NOT LIKE '%OLD.%' and OH_NAME not like 'JOBP%') AND AH_STATUS = 1900;
    EOF

    if [ ! -f /var/log/scripts/activation_lag_analysis/README.txt ] ; then
      cat > /var/log/scripts/activation_lag_analysis/README.txt << EOF2

      files here are vital for determining system performance via
      time between activation and launch of UC4 key jobs. Do NOT
      remove files here, and in any case none which are younger than
      one year.

      Thanks
      --cschmitz (via some script)

    EOF2
    fi

    if [ -z $LD_LIBRARY_PATH ] ; then
      export LD_LIBRARY_PATH="/usr/lib/oracle/11.2/client64/lib"
    else
      export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/oracle/11.2/client64/lib"
    fi

    cat /tmp/activation_lag_analysis.sql \
      | sqlplus64 -S you_wish/my_password

    (DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = secret_sauce)(PORT = 1337)))(CONNECT_DATA =(SERVICE_NAME = burger_king_home_delivery_service)(SERVER = DEDICATED)))

    \
      | bzip2 \
      > /var/log/scripts/activation_lag_analysis/$(date +%y-%m-%d).bz2

    This runs from cron (use UC4? Who am I? :) ) once every week. Every once in a while, I make a really nice Excel spreadsheet from it (you need to de-duplicate the data first, because you will have overlap due to the weekly collection), and it also (I just left that in as a best practice) tells the Linux admins not to touch my sh*t :)

    edit: If you look at the SQL, this only logs the times for a Job called "JOBS.DC1.HEARTBEAT". That's my monitoring job for actual agent operation that runs very often per day. Using just this often-running, but well known job as the basis for statistics also eliminates a lot of uncertainty.

    To answer your question: Engine slowness is when all jobs on all agents are slow. But usually I see individual agents being slow, that's network issues or heavily loaded agents.

    Hth,
    Carsten


  • 9.  How do you monitor your AE?

    Posted 03-26-2018 11:34 AM
      |   view attached
    Here. A picture silly Excel sheet says more than a thousand words. I am quite fond of the heat map-style colors though :)

    Attachment(s)

    xlsx
    act_lag_redacted.xlsx   70K 1 version


  • 10.  How do you monitor your AE?

    Posted 03-26-2018 01:43 PM
    oh wow - many thanks for this input!
    so the heartbeat job is your reference for AE, I understand!

    and many THX for the excel example!

    cheers, Wolfgang


  • 11.  How do you monitor your AE?

    Posted 03-30-2018 07:48 PM
    Hello,

    I though that we have a lot of monitoring solutions :D 

    1. Silly old SNMP agent. Few of the TRAP codes has been ignored (Warm start of an agent?!).  - Ticket
    Its too bad that the title for "System error of the UC4 Server" is so generic.
    LINK
    2. On OS level the psmon is checking for the count of ucsrvwp , ucsrvcp, snmp1, (JWP is not included currently) - Ticket
    3. SYS_HOST_ALIVE on every 30 minutes against all OS agents - Email
    4. Simple SAP job (RSUSR000) against all SAP agents on every 30 min. MRT terminator at 3 min (im too generous) - Email
    5. HEARTBEAT - Unix job is printing timestamp in a file every 10 min. UXMON is checking for file age. Maximum age - 20min - Ticket
    6. Job failure monitoring - Post-process include. Parsing the jobname against static variable. Depending on the priority and the app team - ticket or email will be generated (OVO_MON.log and Send_Mail)


  • 12.  How do you monitor your AE?

    Posted 03-31-2018 03:31 AM
    Many thanks for your Input!

    cheers, Wolfgang