Automic Workload Automation

Expand all | Collapse all

Collecting AE performance metrics

  • 1.  Collecting AE performance metrics

    Posted 07-10-2018 11:02 AM

    We are interested in setting up a process to collect at regular intervals data relevant to the performance of the Automation Engine. Our hope is that historical performance data could help us identify patterns and trends relevant to UC4 performance.

     

    Here is what I have in mind.

     

    Collect the following information every 5-10 minutes:

    • Server process utilization (B.01, B.10, and B.60 numbers)
    • WP/CP CPU & memory usage as reported by ps command
    • List of hung AE server processes, if any
    • Number & type of users/clients connected
    • Number of rows in the ITL table
    • Number of rows in the MQ tables

     

    In addition, log the time, duration, and size of each UC4 DB unload/load and each XML export/import.

     

    Does anyone do this sort of thing? Do you have any tips or insights?



  • 2.  Re: Collecting AE performance metrics

    Posted 07-11-2018 06:40 AM

    Hi

     

    we differ between monitoring and health check.

     

    Monitoring
    is done continuously via external tools & staff in combination with CALL API and some OS jobs.
    we monitor externally all AE processes, DB, Agent processes, tomcat (for ECC)
    a login check to some target systems, DB, etc as well

     

    the check if all AE processes write into their log file is done via OS job triggered externally via CALL API.

     

     

     

    Healthcheck
    does a quick check of the system (is my UC4 system fine at the moment) and is done twice a day automatically (morning, evening)
    and manually on purpose.

     

    This covers if all AE processes are up and running, all agents are running and connected, all SMGR processes (on AE server) are running
    all AE processes are writing into their log within the last XY minutes,  all PROD clients are in state "GO", a latency check, filesystem usage of
    AE servers, users connected, amount of entries of MQPWP ans MQWP, Amount of activities per client, and average workload of the PWP process.

     

    The latency check (Idea stolen from Carsten_Schmitz) measures the time between activation and start of a couple of dummy jobs.

     

    cheers, Wolfgang



  • 3.  Re: Collecting AE performance metrics

    Posted 07-11-2018 07:41 AM

    Hi.

     

    We have several things in this area. From top of head:

     

    • process monitoring with Nagios (NOT for hung processes, just checking the way "ps" would)
    • "heartbeat" dummy JOBF or JOBS that get executed and the result monitored by Nagios, thus verifying the actual funtion of engine and each individual agent (as FrankMuffke aka Wolfgang Brückler already mentioned)
    • I have a bunch of cron jobs that read from the DB via SQL and monitor MQ (message queues) levels *), the count of active objects by department (to warn me of spikes in usage by individual organisational units), licenses count **) and monitor for changes in the host table (to warn me when someone adds Agents **). If you need some of these SQL queries, shoot me a PM.

      Some of these jobs feed into Zabbix, which is the companies choice for plotting diagrams.

      Booknotes:

      *) queue levels are unfortunately pretty much a binary indicator for system health, as far as Automic is concerned. Queue levels are either "oh, don't worry, an MQMEM of a few thousand items is still totally okay" or "oh, the system stopped". I don't know of useful "official" tresholds thus far beyond which a queue can be considered too full, but I made my own tresholds over time by looking at the "usual" queue levels, and when those do get exceeded, I get an alert and can take a more in-depth look at the system.

      **) licenses have to be monitored because other people in the organisation can install agents without telling me, and agents immediately consume licenses without server admin intervention, which is bad design. So I run the Automic license analyzer query once per day, and have a script that tries to emulate the same data processing that Automic (by my guesses) does with the data, thus giving me a reasonable guesstimate on the number of used licenses that Automic would calculate at the annual audit, to make sure I never exceed that number on any given day.
    • Inside Automic, I have a "Server Monitor" script that monitors queues and load levels, using the Automic functions, warns via email for any violations, and also feeds into text files, which then get farmed by Zabbix for drawing diagrams of queue levels. It also keeps a running total of dead Automic processes, and can notify people if e.g. more processes keep dying. Here's my code (company internals removed) - it probably sucks, I suck at AutomicScript

     

    !Server Monitor with Zabbix Feed: This version is a clone off of the Server
    !Monitor script which runs as a JOBS, and writes counters to files on the
    !central UC4 utility server. The text files with counters can then be processed
    !in Zabbix or any other diagram drawing app.
    !questions? -> cschmitz

    !configure the name of the variable object used for keeping state here.
    !object must exist
    :SET &STATEVAR_NAME# = "VARA.DC1.SERVER_MONITOR_STATE"
    :SET &CONFIGVAR_NAME# = "VARA.DC1.SERVER_MONITOR_CONFIG"

    :SET &MAIL_RCPT# = GET_VAR(&CONFIGVAR_NAME#, "email_recipient")

    !fallback if empty
    :IF &MAIL_RCPT# = ""
    :  SET &MAIL_RCPT# = "go_ahead_send_me_spam@example.com"
    :ENDIF

    :SET &VER_INFO# = SYS_INFO(SERVER, VERSION,ALL)
    :SET &SERVER_OPTS# = GET_UC_SETTING(SERVER_OPTIONS)

    :SET &LOAD_01# = SYS_BUSY_01()
    :SET &LOAD_10# = SYS_BUSY_10()
    :SET &LOAD_60# = SYS_BUSY_60()

    !wow, this is silly
    :SET &LOAD_01_FMT# = FORMAT(&LOAD_01#)
    :SET &LOAD_10_FMT# = FORMAT(&LOAD_10#)
    :SET &LOAD_60_FMT# = FORMAT(&LOAD_60#)

    !PWP is the only one that has load averages for it's MQ
    :SET &MQ_PWP_BUSY_01# = SYS_INFO(MQPWP, BUSY, 1)
    :SET &MQ_PWP_BUSY_01_FMT# = FORMAT(&MQ_PWP_BUSY_01#)

    :SET &MQ_PWP_BUSY_10# = SYS_INFO(MQPWP, BUSY, 10)
    :SET &MQ_PWP_BUSY_10_FMT# = FORMAT(&MQ_PWP_BUSY_10#)

    :SET &MQ_PWP_BUSY_60# = SYS_INFO(MQPWP, BUSY, 60)
    :SET &MQ_PWP_BUSY_60_FMT# = FORMAT(&MQ_PWP_BUSY_60#)


    !Message queues outstanding messages count
    :SET &MQ_PWP_COUNT# = SYS_INFO(MQPWP, COUNT)
    :SET &MQ_WP_COUNT# = SYS_INFO(MQWP, COUNT)
    :SET &MQ_DWP_COUNT# = SYS_INFO(MQDWP, COUNT)
    :SET &MQ_OWP_COUNT# = SYS_INFO(MQOWP, COUNT)
    :SET &MQ_RWP_COUNT# = SYS_INFO(MQRWP, COUNT)

    !Message queue average processing times
    !
    !beware: if you get something like "20877", that's actually a
    !return code printed in place of the result, letting you know
    !that e.g. the period parameter is not accepted!
    !
    :SET &MQ_PWP_AVGTIME# = SYS_INFO(MQPWP, LENGTH, 1)
    :SET &MQ_WP_AVGTIME# = SYS_INFO(MQWP, LENGTH, 1)
    :SET &MQ_DWP_AVGTIME# = SYS_INFO(MQDWP, LENGTH, 1)
    :SET &MQ_OWP_AVGTIME# = SYS_INFO(MQOWP, LENGTH, 1)
    :SET &MQ_RWP_AVGTIME# = SYS_INFO(MQRWP, LENGTH, 1)

    !send email if various values are out of boundaries

    :IF &MQ_PWP_COUNT# >= 30
    :  SET &RETVAL# = SEND_MAIL(&MAIL_RCPT#,,"PWP MQ Count exceeded limit on &$SYSTEM#","Warning: Message queue count of PWP on &$SYSTEM# is currently &MQ_PWP_COUNT#")
    :  IF &RETVAL# <> 0
    :    PRINT "ERROR: Sending Email failed with code &RETVAL#!"
    :  ENDIF
    :ENDIF

    !possibly ten minutes would be the most interesting, with one being to twitchy
    !and the hour too laggy - IF the averages were to work as advertised, which
    !they don't seem to do

    :IF &LOAD_10_FMT# >= 90
    :  SET &RETVAL# = SEND_MAIL(&MAIL_RCPT#,,"UC4 load on &$SYSTEM# exceeded limit over last 10 minute average","Warning: load average over 10 minutes on &$SYSTEM# is currently &LOAD_10_FMT# (which may be benign, since the function to report the averages seems to be buggy ...")
    :  IF &RETVAL# <> 0
    :    PRINT "ERROR: Sending Email failed with code &RETVAL#!"
    :  ENDIF
    :ENDIF

    :PRINT "System Monitor for &$SYSTEM#"
    :PRINT "Version: &VER_INFO#"
    :PRINT "Options: &SERVER_OPTS#"
    :PRINT ""
    :PRINT "Load (system) pct. (1m, 10m, 60m): &LOAD_01_FMT#, &LOAD_10_FMT#, &LOAD_60_FMT#"
    :PRINT "Load (PWP MQ) pct. (1m, 10m, 60m): &MQ_PWP_BUSY_01_FMT#, &MQ_PWP_BUSY_10_FMT#, &MQ_PWP_BUSY_60_FMT#"
    :PRINT ""
    :PRINT "Message queues:"
    :PRINT "Type |   Msg. Queued      |   Avg. Proc. Time    |  Descr.    "
    :PRINT "-----------------------------------------------------------------------------------------"
    :PRINT "PWP  |   &MQ_PWP_COUNT# |   &MQ_PWP_AVGTIME#   |  Primary WP"
    :PRINT "WP   |   &MQ_WP_COUNT# |   &MQ_WP_AVGTIME#   |  Worker Process"
    :PRINT "DWP  |   &MQ_DWP_COUNT# |   &MQ_DWP_AVGTIME#   |  Dialog WP"
    :PRINT "OWP  |   &MQ_OWP_COUNT# |   &MQ_OWP_AVGTIME#   |  Output"
    :PRINT "RWP  |   &MQ_RWP_COUNT# |   &MQ_RWP_AVGTIME#   |  Resources"
    :PRINT ""

    ! check all server processes, report if one (or more) are considered dead
    ! should be changed to parsing an array later

    :SET &SYS_SERVERS_DEAD#=""
    :IF SYS_SERVER_ALIVE("UC4P#CP001") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP001"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#CP002") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP002"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#CP003") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP003"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#CP004") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP004"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP001") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP001"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP002") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP002"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP003") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP003"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP004") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP004"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP005") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP005"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP006") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP006"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP007") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP007"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP008") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP008"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP009") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP009"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP010") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP010"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP011") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP011"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP012") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP012"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP013") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP013"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP014") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP014"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP015") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP015"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP016") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP016"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP017") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP017"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP018") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP018"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP019") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP019"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF SYS_SERVER_ALIVE("UC4P#WP020") <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP020"
    :  PRINT "Fehler in &SYS_SERVERS_DEAD#"
    :ENDIF

    :IF &SYS_SERVERS_DEAD# <> ""
    !  only send an email if something is a) wrong and b) has changed since last email
    :  SET &LAST_DEADPROCESS_LIST# = GET_VAR(&STATEVAR_NAME#, "LAST_DEADPROCESS_LIST")
    :  IF &SYS_SERVERS_DEAD# <> &LAST_DEADPROCESS_LIST#
    :    PRINT "The following processes are reportedly DEAD: &SYS_SERVERS_DEAD#"
    :    SET &RETVAL# = SEND_MAIL(&MAIL_RCPT#,,"Dead Process on &$SYSTEM#","The following server processes are reported DEAD by the server monitoring script: &SYS_SERVERS_DEAD#. You will NOT receive further emails unless the list of dead processes changes!")
    :    IF &RETVAL# <> 0
    :      PRINT "ERROR: Sending Email failed with code &RETVAL#!"
    :    ELSE
    !      only set "acknowledgement" in variable if email sending worked. Otherwise, keep attempting to send mails in case it's a temporary problem.
    :      PUT_VAR &STATEVAR_NAME#,"LAST_DEADPROCESS_LIST", &SYS_SERVERS_DEAD#
    :    ENDIF
    :  ENDIF
    :ELSE
    :  print "All server processes reportedly alive."
    :ENDIF

    ! ----------- no shell script above this line

    set -e

    COLLATE_DIR=/tmp/uc4_servmon_counters

    mkdir -p $COLLATE_DIR

    # put data into files so Zabbix can farm it later

    echo "&LOAD_01_FMT#" | bc        > $COLLATE_DIR/uc4_sys_load_1min.txt
    echo "&MQ_PWP_BUSY_01_FMT#" | bc > $COLLATE_DIR/uc4_mq_pwp_busy_1min.txt
    echo "&MQ_PWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_pwp_count_cur.txt
    echo "&MQ_WP_COUNT#"    | bc     > $COLLATE_DIR/uc4_mq_wp_count_cur.txt
    echo "&MQ_DWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_dwp_count_cur.txt
    echo "&MQ_OWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_owp_count_cur.txt
    echo "&MQ_RWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_rwp_count_cur.txt
    echo "&MQ_PWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_pwp_avgproctime.txt
    echo "&MQ_WP_AVGTIME#"  | bc     > $COLLATE_DIR/uc4_mq_wp_avgproctime.txt
    echo "&MQ_DWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_dwp_avgproctime.txt
    echo "&MQ_OWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_owp_avgproctime.txt
    echo "&MQ_RWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_rwp_avgproctime.txt

    SERVER_PROCESSES_NOT_OK="&SYS_SERVERS_DEAD#"

    if [ ! -z "$SERVER_PROCESSES_NOT_OK" ] ; then
      NUM_SERVER_PROCESSES_NOT_OK=$( echo "$SERVER_PROCESSES_NOT_OK" | wc -w )
      echo $NUM_SERVER_PROCESSES_NOT_OK > $COLLATE_DIR/uc4_sys_server_procs_not_ok.txt
    else
      echo "0" > $COLLATE_DIR/uc4_sys_server_procs_not_ok.txt
    fi

     

    This also prints a mildly human-readable report like this:


    It's mildly entertaining that the queue numbers from inside Automic never seem to match those gotten from the actual DB tables, but it's usually not enough difference to truly matter.

     

    Here's an example of how the data looks in Zabbix after it's been processed into a pile of widgets:

     


     



  • 4.  Re: Collecting AE performance metrics

    Posted 08-20-2018 03:41 AM

    There is also the option to meassure the average transaction times in the AE. Thew way to do this is to activate minimal traces (set trace option 16=1) and leave it running for 10-15 mins. Then upload the AE serverlogs/trace files to Support and have them send you the result.

    The average transaction time is a good indicator for how the AE is performing and it would be really good if this number could be continuously fetched, say once every hour, and without involving Support (sending traces file to support every hour is of course not an option...

     

    The minimal trace should not impact the overall performance so it should be fairly harmless...(so I have been told by Support).

     

    /Keld.



  • 5.  Re: Collecting AE performance metrics

    Posted 10-16-2018 09:03 AM

    Thank you @Carsten Schmitz for sharing the script with everyone. We implemented these checks as well but instead of creating an ARRAY for SYS_SERVER_ALIVE, we used this:

     

    :SET  &WP#=PREP_PROCESS_VAR(VARA.SQL.SERVER_PROCESSES)
    :PROCESS &WP#
    :  SET &WP_NAME# = GET_PROCESS_LINE(&WP#,1)
    :IF SYS_SERVER_ALIVE(&WP_NAME#) <> "Y"
    :  SET &SYS_SERVERS_DEAD#="&WP_NAME#"
    :  PRINT "Error in &SYS_SERVERS_DEAD#"
    :  ENDIF
    :ENDPROCESS

     

    We created VARA.SQL.SERVER_PROCESSES with the following SQL:

    select oh_name from uc4db.oh
    where oh_otype IN ('SERV')

     

    With this set, the VARA picks all our 53 WPs/DWPs/CPs and JWP and the rest of the script checks them line by line to make sure they are up or not.

     

    Hope this helps!



  • 6.  RE: Collecting AE performance metrics

    Posted 09-06-2019 05:48 PM
    Would be great if some of these were built into the GUI.
    Troubleshooting performance issues is tricky with workload.

    • Server process utilization (B.01, B.10, and B.60 numbers). - This one is in the GUI but only 24 hours worth, not enough for trend analysis.
    • Number of rows in the ITL table
    • Number of rows in the MQ tables
    • Stats - how many unique jobs do i have? How many jobs were executed per day/month?

    Not sure if Analytics component tracks any of this long term?


  • 7.  RE: Collecting AE performance metrics

    Posted 09-11-2019 04:50 AM
    Edited by Carsten Schmitz 09-11-2019 04:51 AM
    > Server process utilization (B.01, B.10, and B.60 numbers). - This one is in the GUI but only 24 hours worth, not enough for trend analysis.
    > Number of rows in the ITL table
    > Number of rows in the MQ tables
    > Stats - how many unique jobs do i have? How many jobs were executed per day/month?

    Great idea, I totally second all of that. The question "how many objects/jobs/executions" one per day/week/year has comes up so very often, it should be a running total counter and discoverable in the GUI.

    Additionally, it would also be neat to have similar stats for the agents. E.g. the time to start jobs can be severely impacted on agents with high loads. Currently we monitor those execution timings with SQL queries, would be cool to have that on the admin tab/agent panel, and a counter for number of jobs. Could display the ratio of "overhead" vs. "execution" (similar to "user" vs. "system" times in UNIX). Or also have an "uptime %" counter for agents.

    Unless it's in Analytics (sorry, I wouldn't know ...) this should go into ideation.