Automic Workload Automation

Expand all | Collapse all

Identifying hung AE server processes

  • 1.  Identifying hung AE server processes

    Posted Jul 04, 2018 07:52 AM

    We have recently encountered a problem that can cause Automation Engine server processes — specifically work processes — to hang. In this case, the WPs are not blocking DB sessions, and they are not consuming CPU time. They’re just doing nothing. They do not respond to requests to quit from the Service Manager. They must be killed with the KILL signal (-9) and restarted. We would like to find a way to identify such hung AE server processes programmatically, so that we can kill and restart them automatically.

     

    In the Java User Interface, we can identify hung processes by the fact that they appear grayed-out in the System Overview.

     

    Initially I thought it might be possible to list the same information using the ServerList Java API class. For instance, in the System Overview, all of the hung WPs do not have a PID, and have a B.60 of 0. Might these two criteria be used to uniquely identify hung processes?

     

    Unfortunately, the answer is no. When we iterate through the server list, ServerListItem.getName() returns no data for hung processes. This means that although we can use this approach to list the number of hung WPs, we cannot use it to identify which WPs are hung. (The other methods of ServerListItem also return empty results for hung WPs.)

     

    I noticed that when WPs hang, they stop writing to their log files.
    $ ls -ltr /var/uc4/server/?Psrv_DEV_log_???_00.txt
    -rw-r----- 1 aedev1 mycompany  3151469 Jun 23 19:14 /var/uc4/server/WPsrv_DEV_log_014_00.txt
    -rw-r----- 1 aedev1 mycompany  2595001 Jun 23 19:22 /var/uc4/server/WPsrv_DEV_log_016_00.txt
    -rw-r----- 1 aedev1 mycompany    44771 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_056_00.txt
    -rw-r----- 1 aedev1 mycompany    45492 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_010_00.txt
    -rw-r----- 1 aedev1 mycompany    44759 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_008_00.txt
    -rw-r----- 1 aedev1 mycompany    45012 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_006_00.txt
    -rw-r----- 1 aedev1 mycompany   431613 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_038_00.txt
    -rw-r----- 1 aedev1 mycompany   371402 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_034_00.txt
    -rw-r----- 1 aedev1 mycompany   386269 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_032_00.txt
    -rw-r----- 1 aedev1 mycompany   381381 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_030_00.txt
    ...

     

    So another possibility would be to set up a log file monitor that periodically looks for files matching the pattern ?Psrv_DEV_log_???_00.txt and that have not been modified in the last few hours. E.g.,

     

    check_for_hung_WPs.sh

    #!/bin/bash
    LOG_DIR="/var/uc4/server/"
    ENV="DEV"
    AGE="1"
    for file in $(find "${LOG_DIR}" -name "?Psrv_${ENV}_log_???_00.txt" -mtime +"${AGE}"); do
            echo $file | awk -F'_' '{print $4}'
    done

     

    This appears to work well, and it correctly identifies the hung WPs.

    $  ./check_for_hung_WPs.sh

    016
    014

     

    We still need to find the process IDs though, and the only easy way I know of is to use the Service Manager UI.

     

    Is there a way to look up the PIDs programmatically?



  • 2.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 08:15 AM

    It would be cool if the processes had some way of telling whether they are still alive enough to honor their signal mask. E.g. Automic could code it so they can be sent e.g. a USR1 signal and the process echos back a message.

     

    I don't think they currently have this, so that'd be an Idea at best.

     

    Apart from that, I can only come up with one arcane and rather unorthodox idea: When your processes hang the next time, check whether they still do something in strace (strace -f -p <pid>). If they do nothing at all, you could use that for an automated check. If they do little (my hanging processes usually loop in a select() syscall but do nothing else), that could possibly be coded as a criterion as well. Alternatively, you might use /proc/<pid>. E.g. there is a file which has i/o stats, looking at these stats for a small sample period might be usable to distinguish a working WP from a dead one, if the correct tresholds are found. "status" or "sched" may be other candidates for pseudo files that might be useful for looking for numbers that don't increase.

     

    So I'd start by capturing a few files from /proc for the WP while they work, and then a set of the same files the next time one hangs, and compare those for usable indicators.

     

    But yeah, a bit unorthodox ...



  • 3.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 08:31 AM
    Edited by Michael A. Lowry Apr 03, 2023 06:41 AM

    I found a way to find the PIDs of the hung WPs, once the names of the log files are known. This approach uses lsof.

    #!/bin/bash
    LOG_DIR="/var/uc4/server/"
    ENV="DEV"
    AGE="1"
    for file in $(find "${LOG_DIR}" -name "?Psrv_${ENV}_log_???_00.txt" -mtime +"${AGE}"); do
            lsof $file
            pid=$(lsof $file | awk -v file=$file '$9 == file {print $2}')
            echo "PID: $pid"
    done

     

    $ ./check_for_hung_WPs.sh
    COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
    ucsrvwp 17188 aedev1    3w   REG  253,6  2595001 516358 /var/uc4/server/WPsrv_DEV_log_016_00.txt
    PID: 17188
    COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
    ucsrvwp 17204 aedev1    3w   REG  253,6  3151469 516228 /var/uc4/server/WPsrv_DEV_log_014_00.txt
    PID: 17204

    I think this is close enough. If anyone has a better idea, please let me know.

    Carsten_Schmitz: I like your idea of checking the actual activity using strace. Thanks for the suggestion.



  • 4.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 08:36 AM

    Cool solution

     

    Caveat might be the granularity though, it seems this will only alert if the log is a day or older? You could possibly even use -mmin instead of -mtime to get alerted more quickly.



  • 5.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 09:10 AM
    Edited by Michael A. Lowry Apr 03, 2023 06:41 AM

    Yeah, good idea. The JWPs aren’t used much in our environment (yet), so they tend to show up erroneously in the list if the time is too short. (See the four logs with the July 3rd modification date in the above listing.) But since we have not had any problems with JWP hangs, I think it’s OK to exclude them for now. This updated script runs lsof just once.

    #!/bin/bash
    LOG_DIR="/var/uc4/server/"
    ENV="DEV"
    AGE="120"
    lsof_out=/tmp/lsof_out_$$.txt
    files=""
    for file in $(find "${LOG_DIR}" -name "?Psrv_${ENV}_log_???_00.txt" -mmin +"${AGE}"); do
            files="$files $file"
    done
    lsof $files | awk '($1 != "java") {print $0}' > ${lsof_out}
    cat ${lsof_out}
    for pid in $(awk '($1 != "COMMAND") {print $2}' ${lsof_out}); do
            echo "PID: $pid"
    done
    rm ${lsof_out}

     

    $ ./check_for_hung_WPs.sh
    COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
    ucsrvwp 17188 aedev1    3w   REG  253,6  2595001 516358 /var/uc4/server/WPsrv_DEV_log_016_00.txt
    ucsrvwp 17204 aedev1    3w   REG  253,6  3151469 516228 /var/uc4/server/WPsrv_DEV_log_014_00.txt
    PID: 17188
    PID: 17204

     

    I’m not quite sure how best to use strace to check for activity programmatically. Do you have experience with this?



  • 6.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 09:51 AM

    I’m not quite sure how best to use strace to check for activity programmatically. Do you have experience with this?

     

    Little, and a long time ago. But in principle, you should just need to do an

     

    strace -f -p <pid> 2>&1

     

    to attach to the process and trace it, and all else depends on you watching this for a working vs. a non-working process and finding the difference.

     

    Say your WP hang in the same way mine did, then the strace will consist of nothing but

     

    select(1,

     

    (or something similar).

     

    So if the hallmark of a hanging WP is that it only does "select" syscalls, you could probably add "-e trace=write", which would print out only writes (e.g. to the logfile). Then, if your output has lines in it (wc -l), your process is alive and (in this example) writing log files or some such, and if there is no output, it's a dead process. It's a pitty the "-c" option is still limited on Linux, else you could automatically terminate it upon ever reaching "write()", but alas, you might need to wrap it into a shell script that terminates strace after a few seconds (grab PID, run in backgound for some time, kill that PID - since you're merely attaching strace with "-p", you should be able to kill it while the WP itself is unaffected). Ofc, you need to be "root" for all of that.

     

    My first avenue would still be /proc though. Less costly than a near constant running strace.

     

    And another caveat: The above should work for WP written in C, e.g. ucsrvwp. The JWP is a different beast, since you can't strace it alone, you'd always strace the whole JVM along with it. strace is not very feasible for JWP or any Java processes (I've done it, but it's not nice )

     

    (edit: oh, I've read your post again and now understood that JWP aren't much of a concern at this time, so yeah, go for strace, but maybe still look at the /proc options first).



  • 7.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 08:56 AM

    Hi Michael,

     

     not sure if this would help, but the Service Manager command line interface can also provide you the PID:

     

    bash-4.1$ ./ucybsmcl -c GET_PROCESS_LIST -h localhost:8871 -n uc4 |grep WP3

    "UC4 WP3" "R" "21236" "2018-06-22 13:52" "12/00:57" "0/00:16:40.00"

    bash-4.1$

     

    Regarding the hung AE process:

     

    I can't simulate a hung process, but how is the script function SYS_SERVER_ALIVE reacting in the case of a hung process?



  • 8.  Re: Idetifying hung AE server processes

    Posted Jul 04, 2018 09:31 AM

    vicja02: Yes, using the SMgr CLI would be a good idea, but unfortunately it doesn’t work because ucybsmcl does not return the actual process names, but only the names from the SMD file.

     

    As you may know, when AE server processes start up, they are assigned names by the PWP in sequential order. These names may differ from the names listed in the SMD file. It depends on timing.

     

    The Service Manager GUI shows the true name of the process in the Service column. There is another column on the far right called fix Servicename that shows the process name from the SMD file. The column is initially hidden, so you have to expand from the rightmost edge of the CPU Time column it to reveal it. See the SMgr GUI screenshot in the original question above.

     

    The Service Manager CLI does not return the true process names. It returns only the names as they appear in the SMD file. This means these results cannot be relied upon. In my example, when I grepped the ucybsmcl output for WP14 and WP16, it returned the details the WPs that are actually WP22 and WP52.

     

    If you know a way to make the Service Manager CLI return the true process names, or if you think the current behavior is a bug, please let me know.

     

    To answer your question, SYS_SERVER_ALIVE returns Y for these hung WPs.



  • 9.  Re: Identifying hung AE server processes

    Posted Jul 06, 2018 09:11 AM

    I’m going to go out on a limb here and make the assumption that the current behavior of ucybsmcl is not the intended behavior. I opened a new case for this.



  • 10.  Re: Identifying hung AE server processes

    Posted Jul 09, 2018 09:49 AM

    This seems like an obvious bug to me, especially considering that the documentation for the Service Manager GUI and the Service Manager CLI refer to the column in using the exact same terminology. But as expected, CA Support replied that the Service Manager CLI is working as designed.

     

    So, here we go again...

    ‘ucybsmcl -c GET_PROCESS_LIST’ should return process names  

     

    If you like the idea, please vote for it.



  • 11.  Re: Identifying hung AE server processes

    Posted Jul 09, 2018 10:23 AM

    replied that the Service Manager CLI is working as designed

     

    For the unusual conditions created on that day, so was Chernobyl ...

     

    *scnr*



  • 12.  RE: Re: Identifying hung AE server processes

    Posted Apr 03, 2023 06:42 AM
    Edited by Michael A. Lowry Apr 03, 2023 06:42 AM

    Broadcom deleted the idea.



  • 13.  Re: Identifying hung AE server processes

    Posted Jul 05, 2018 07:20 AM

    Hi Gentelemen,

     

    we do have a similar "monitoring" a periodical job checks if the logfiles were written within the last 10 minutes:

     

    ls -l $(find /var/log/uc4/automationengine -name "[CW]Psrv_log_0[0-9][0-9]_00.txt" -type f -mmin +10 ) null 2>/dev/null  | awk '{printf ("%s-%s %s %s\n",$6,$7,$8,$9)}'

     

    this alerts our OPS staff via email.

     

    2 weeks ago it saved our lives on PROD, there was a DB issue that caused DB responses fo 20..30 seconds.

    One after the other WP died (stopped writing to logfile, was greyed out in System overview but responded with Y to SYS_SERVER_ALIVE and was shown up with ps-ef | grep .... ) but we were alerted and had the chance to kill the affected WPs and restart the other ones cyclic...

     

    For the JWP we have almost no Workload, this is ignored in our alert mechanisms.

     

    cheers, Wolfgang



  • 14.  Re: Identifying hung AE server processes
    Best Answer

    Posted Jul 05, 2018 10:27 AM
      |   view attached

    Here’s what I came up with.

    1. UC4.LIST_HUNG_AE_SERVER_PROCS.SCRI →
    2. UC4.BACKEND.RUN_JOB.VARA_EXEC →
    3. UC4.BACKEND.RUN_JOB.SCRI →
    4. UC4.LIST_HUNG_AE_SERVER_PROCS.JOBS

     

    The first SCRI uses PREP_PROCESS_AGENTGROUP  to parse an agent group containing both of the AE servers. For each server it uses PREP_PROCESS_VAR to parse the EXEC VARA and collect the list of hung AE processes. The EXEC VARA is a generalized version of my General purpose EXEC VARA for running arbitrary SQL statements. It runs an SCRI that runs a UNIX job based on the little shell script above. This job figures out which AE processes are hung. The results are passed back via the EXEC VARA to the calling SCRI, which formats the information and prints the results.

     

    Before you run the main script, UC4.LIST_HUNG_AE_SERVER_PROCS.SCRI, you must first configure three script variables:

    VariableDescriptionDefault
    &Agent_Group#The agent group (mode all) that containing agents running on all of the AE server nodes.AE_NODES_BOTH
    &Login#A login object with users able to read information about running AE server processes.UC4.LOGIN
    &Queue#The queue where the objects should run.UC4

     

    Once you have adjusted these values for your environment, just run the script. Here is an example of the output:

    U00020408 Checking for hung AE processes on AE-DEV-1
    U00007000 'UC4.BACKEND.RUN_JOB.SCRI' activated with RunID '0153655126'.
    U00020408 No data available. Reason: <No results returned>
    U00020408
    U00020408 Checking for hung AE processes on AE-DEV-2
    U00007000 'UC4.BACKEND.RUN_JOB.SCRI' activated with RunID '0153652177'.
    U00020408
    U00020408 #1
    U00020408 Hung AE process          : WP016
    U00020408 PID of hung process      : 17188
    U00020408 Log file of hung process : /var/uc4/server/WPsrv_DEV_log_016_00.txt
    U00020408
    U00020408
    U00020408 #2
    U00020408 Hung AE process          : WP014
    U00020408 PID of hung process      : 17204
    U00020408 Log file of hung process : /var/uc4/server/WPsrv_DEV_log_014_00.txt
    U00020408

    The objects are attached.

    Attachment(s)



  • 15.  Re: Identifying hung AE server processes

    Posted Jul 05, 2018 11:16 AM

    I realize that this solution is way more complicated than it needs to be.

     

    I made it this way for a couple of reasons:

    • I wanted to assemble all of the results from multiple AE nodes in one place.
    • I’m thinking longer-term, with an eye to reusing some of these objects in other contexts. 


  • 16.  Re: Identifying hung AE server processes

    Posted Jul 06, 2018 03:47 AM

    I realize that this solution is way more complicated

     

    Or, just reboot the server once a day. This staple of Windows server administration wisdom ca. 2001 can't be wrong, can it?

     

    *scnr*



  • 17.  Re: Identifying hung AE server processes

    Posted Jul 06, 2018 04:13 AM
      |   view attached

    Here’s a stripped-down version that does everything in a single UNIX job. It uses an agent group called AE_NODES_BOTH with mode all that contains both of the AE server nodes. Therefore, running the job will actually trigger three tasks: one for the C_HOSTG container, and one for each of the two server nodes. The disadvantage of this approach is that the list of hung AE server processes is in two places rather than one. The advantage is that it’s much simpler.

    Ping: Carsten_Schmitz