Automic Workload Automation

Expand all | Collapse all

Handling Routine Maintenance

Jump to Best Answer
  • 1.  Handling Routine Maintenance

    Posted 20 days ago
    Hi, all.

    It turns out that we had a misconception regarding how shutting down the agents worked.
    We have a high-availability set up, so we have been shutting down one 'half' of our agents earlier the day that they are scheduled to be patched, and then the same with the other half later in the week. Since they are in agent groups, if one isn't available, then it will go to the other.

    We had been under the impression that when shutting down an agent via AWI, it would wait until it was finished processing whatever was active before shutting down. I believe we thought this because I know this is how it works when you perform a CAU. Apparently we had been running on sheer dumb luck up until about a week ago, when a client was less-than-impressed with us killing one of their long-running jobs.

    I was thinking about going the route of scheduling a pair of scripts that would remove the affected agents from the agent groups, then put them back in at the end of the maintenance window, but I would like to avoid reinventing the wheel if at all possible.

    TL;DR: Is there a more elegant way to ensure that jobs aren't going to these agents during maintenance windows?

    Thanks!


  • 2.  RE: Handling Routine Maintenance
    Best Answer

    Posted 19 days ago
    Hello,

    :SET_UC_SETTING WORKLOAD_MAX, Agent, 0

    Kind Regards


  • 3.  RE: Handling Routine Maintenance

    Posted 14 days ago
    That looks like it'll do the trick.

    Thank you!


  • 4.  RE: Handling Routine Maintenance

    Posted 11 days ago
    Edited by Harlow 11 days ago
    ​One issue I've found with this approach is that we use AgentGroups for redundancy (in addition to load balancing). We've had those servers' Windows update schedules split up so that at least one server is available to process a job. The AgentGroup doesn't seem to be smart enough to determine that the agent to be patched is actually unavailable. So if I put the AgentGroup into Next Listed, set the max workload on one agent to 0, and spam a dummy job, I get:

    I can build on my previous failed attempts at scripting this to avoid that, but just saying.

    Thanks again!



  • 5.  RE: Handling Routine Maintenance

    Posted 13 days ago
    ​Hey,

    > it would wait until it was finished processing whatever
    > was active before shutting down.

    It mostly works like this (for OS agents and on all platforms I know of, and as far as I know):

    The engine submits the job, the agent starts it. The job runs, and if the agent dies (or is stopped, which happens immediately upon receiving the respective command) the actual job on the agent machine keeps running running in a disassociated state.

    For instance, on SysV Linux that means the actual job process continues to run, but the process is no longer owned by the agent but by init (pid 1) instead. They use a similar way to do things on Windows. When the job eventually finishes, the disassociated agent process calls the agent binary in a special way (refered to as "job messenger") and that reports completion of the job back to the engine.

    So when your agent is stopped for any reason, no new jobs will be accepted, existing jobs will finish, but finished jobs will only be reported back if the agent can still reach the engine obviously (e.g. from a network perspective).

    To ensure no jobs are sent to an agent prior to a planned agent shutdown, there are various ways but :SET_UC_SETTING WORKLOAD_MAX, Agent, 0 as already suggested is likely the most viable one.

    Hth,

    ------------------------------
    These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
    ------------------------------



  • 6.  RE: Handling Routine Maintenance

    Posted 11 days ago
    I have noticed that the job will continue even if the agent is shut down. ​The only problem with that is that a couple of our clients have monsters of workflows that rely on the status (or other output) of a preceding job before moving forward. In these situations, it's not exactly ideal if a job vanishes into the ether as far as AA is concerned.


  • 7.  RE: Handling Routine Maintenance

    Posted 11 days ago
    And of course now I find the section of the documentation that would have answered my question. Whoops.


  • 8.  RE: Handling Routine Maintenance

    Posted 11 days ago
    ​Again my mileage and some gut feeling, but when an agent is unavailable, the engine will try to submit the job for some time.

    So I would think (but I might be wrong) that:

    1. given a JOBP with jobs A, B, and C
    2. and an agent shutting down while working on job A, then

    a) job A is processed to the end on the agent machine, and reported back
    b) the engine tries to submit (and re-submit) job B for quite some time. My colleage even believes it's virtually forever. When the agent comes back, B and C should run.

    Of course this poses a major problem for any job plans that somehow depend on wall time or elapsed time. This is why I keep preaching to our user base to not have hard MRT aborts of only a few minutes, or wall time requirements, unless there are valid, good reason for having those. A job plan that doesn't care if it runs late should probably survive a downtime unscathed.

    But of course I might be totally wrong.

    Wow, it would be splendid if Automic would describe how that is actually designed in a neat description (or maybe even flow chart) in the documentation. I hope they don't take my posting privileges some day but gonna be paging @Elina McCafferty: Elina, is there a possibility this could make it onto somebodies ToDo list? :)

    Br,
    Carsten




    ------------------------------
    These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
    ------------------------------



  • 9.  RE: Handling Routine Maintenance

    Posted 11 days ago
    That was initially our impression, but we ended up with a few jobs going into an inconsistent status (which were then locking up a clients queues with only one slot). The really fun part of that was that these runs weren't showing up in Process Monitoring, so we had to query the database to find jobs using that queue that had a null end time. We could then look up the runID in AWI and kill it there. Not that bad once you figure out what's what in the DB, but still.
    (Sidenote: I had sent that in to support, as I feel like I should have been able to find those in Process Monitoring, but ended up with an infinite loop about how we didn't shut down the system properly).

    We are on the latest version (12.3.1HF1), perhaps something changed?


  • 10.  RE: Handling Routine Maintenance

    Posted 11 days ago
    In the long run, and an ideal world, Automic should document what should happen (maybe on a positive note, we kicked that off now). Any ​deviation from that behaviour could then be treated as a malfunction and the manufacturer would have to act on it. Unfortunately, getting things established as malfunctions with support is hard at present time, especially when they can't be replicated easily.

    I have also seen jobs with weird, or null timestamps in the past. We had a major drive with Automic support to clean them from the database once. No idea if those also originated from a supposed "unclean shutdown" but yeah, it can definetly happen.

    Also, but that's really taking things off topic now, I do not accept any arguments from Automic about persistent problems being the operator's fault due to supposed improper shut downs. Our servers are regularily rebooted unattended for patching. The engine needs to honor the respective signal (sigterm) reasonably well and shut itself down cleanly. Our company doesn't pay people to bring down the engine at 2am or something, and with automatic patching being the industry norm now (or so the RedHat admins tell me), Automic should shut down along with the server like any other service with no permanent ill effects.

    But that's my $0.02 really :)

    Cheers,
    Carsten

    ------------------------------
    These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
    ------------------------------



  • 11.  RE: Handling Routine Maintenance

    Posted 11 days ago
    Hi,

    Just a quick info to think about:

    you discussed the behavior of OS agents so far but theres a different one for all JAVA agents (all RA, all DB, SAP, ...):

    As soon as a Java Agent is stopped, all currently running jobs (object status 1550) will crash.
    No way to resume, no way to recover, no way to undo (via Automic).

    we created a script that is started prior to agent shutdown (its a matter of luck to catch all long runners and prevent a heap of waiting jobs...)
    a) all JAVA agents that have jobs in running state, will be set to resource 0
    b) all JAVA agents that do not have jobs in running state, will be shut down.

    script has to be repeaded if there are agents still active because of a)

    This construct works fine for 2 years.

    of course this is only valid for planned downtimes of agents or the whole system :-)

    cheers, Wolfgang

    ------------------------------
    Support Info:
    if you are using one of the latest version of UC4 / AWA / One Automation please get in contact with Support to open a ticket.
    Otherwise update/upgrade your system and check if the problem still exists.
    ------------------------------



  • 12.  RE: Handling Routine Maintenance

    Posted 11 days ago
    ​Great info, thanks!

    (although I am still hoping that some day Automic explains that the Java agents all aren't really canon and are just part of the extended universe ... :D )

    ------------------------------
    These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
    ------------------------------



  • 13.  RE: Handling Routine Maintenance

    Posted 11 days ago
    Thank you, @Carsten Schmitz. I've asked one of the documentation engineers to look into it. Good input!​