Hi,
Just a quick info to think about:
you discussed the behavior of OS agents so far but theres a different one for all JAVA agents (all RA, all DB, SAP, ...):
As soon as a Java Agent is stopped, all currently running jobs (object status 1550) will crash.
No way to resume, no way to recover, no way to undo (via Automic).
we created a script that is started prior to agent shutdown (its a matter of luck to catch all long runners and prevent a heap of waiting jobs...)
a) all JAVA agents that have jobs in running state, will be set to resource 0
b) all JAVA agents that do not have jobs in running state, will be shut down.
script has to be repeaded if there are agents still active because of a)
This construct works fine for 2 years.
of course this is only valid for planned downtimes of agents or the whole system :-)
cheers, Wolfgang
------------------------------
Support Info:
if you are using one of the latest version of UC4 / AWA / One Automation please get in contact with Support to open a ticket.
Otherwise update/upgrade your system and check if the problem still exists.
------------------------------
Original Message:
Sent: 02-13-2020 08:06 AM
From: Carsten Schmitz
Subject: Handling Routine Maintenance
Again my mileage and some gut feeling, but when an agent is unavailable, the engine will try to submit the job for some time.
So I would think (but I might be wrong) that:
1. given a JOBP with jobs A, B, and C
2. and an agent shutting down while working on job A, then
a) job A is processed to the end on the agent machine, and reported back
b) the engine tries to submit (and re-submit) job B for quite some time. My colleage even believes it's virtually forever. When the agent comes back, B and C should run.
Of course this poses a major problem for any job plans that somehow depend on wall time or elapsed time. This is why I keep preaching to our user base to not have hard MRT aborts of only a few minutes, or wall time requirements, unless there are valid, good reason for having those. A job plan that doesn't care if it runs late should probably survive a downtime unscathed.
But of course I might be totally wrong.
Wow, it would be splendid if Automic would describe how that is actually designed in a neat description (or maybe even flow chart) in the documentation. I hope they don't take my posting privileges some day but gonna be paging @Elina McCafferty: Elina, is there a possibility this could make it onto somebodies ToDo list? :)
Br,
Carsten
------------------------------
These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):
http://www.catb.org/~esr/faqs/smart-questions.html
https://www.chiark.greenend.org.uk/~sgtatham/bugs.html
I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
Original Message:
Sent: 02-13-2020 07:42 AM
From: Harlow
Subject: Handling Routine Maintenance
I have noticed that the job will continue even if the agent is shut down. The only problem with that is that a couple of our clients have monsters of workflows that rely on the status (or other output) of a preceding job before moving forward. In these situations, it's not exactly ideal if a job vanishes into the ether as far as AA is concerned.
Original Message:
Sent: 02-11-2020 07:17 AM
From: Carsten Schmitz
Subject: Handling Routine Maintenance
Hey,
> it would wait until it was finished processing whatever
> was active before shutting down.
It mostly works like this (for OS agents and on all platforms I know of, and as far as I know):
The engine submits the job, the agent starts it. The job runs, and if the agent dies (or is stopped, which happens immediately upon receiving the respective command) the actual job on the agent machine keeps running running in a disassociated state.
For instance, on SysV Linux that means the actual job process continues to run, but the process is no longer owned by the agent but by init (pid 1) instead. They use a similar way to do things on Windows. When the job eventually finishes, the disassociated agent process calls the agent binary in a special way (refered to as "job messenger") and that reports completion of the job back to the engine.
So when your agent is stopped for any reason, no new jobs will be accepted, existing jobs will finish, but finished jobs will only be reported back if the agent can still reach the engine obviously (e.g. from a network perspective).
To ensure no jobs are sent to an agent prior to a planned agent shutdown, there are various ways but :SET_UC_SETTING WORKLOAD_MAX, Agent, 0 as already suggested is likely the most viable one.
Hth,
------------------------------
These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):
http://www.catb.org/~esr/faqs/smart-questions.html
https://www.chiark.greenend.org.uk/~sgtatham/bugs.html
I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
Original Message:
Sent: 02-04-2020 09:40 AM
From: Harlow
Subject: Handling Routine Maintenance
Hi, all.
It turns out that we had a misconception regarding how shutting down the agents worked.
We have a high-availability set up, so we have been shutting down one 'half' of our agents earlier the day that they are scheduled to be patched, and then the same with the other half later in the week. Since they are in agent groups, if one isn't available, then it will go to the other.
We had been under the impression that when shutting down an agent via AWI, it would wait until it was finished processing whatever was active before shutting down. I believe we thought this because I know this is how it works when you perform a CAU. Apparently we had been running on sheer dumb luck up until about a week ago, when a client was less-than-impressed with us killing one of their long-running jobs.
I was thinking about going the route of scheduling a pair of scripts that would remove the affected agents from the agent groups, then put them back in at the end of the maintenance window, but I would like to avoid reinventing the wheel if at all possible.
TL;DR: Is there a more elegant way to ensure that jobs aren't going to these agents during maintenance windows?
Thanks!