Automic Workload Automation

 View Only
Expand all | Collapse all

Handling Routine Maintenance

  • 1.  Handling Routine Maintenance

    Posted Feb 04, 2020 09:40 AM
    Hi, all.

    It turns out that we had a misconception regarding how shutting down the agents worked.
    We have a high-availability set up, so we have been shutting down one 'half' of our agents earlier the day that they are scheduled to be patched, and then the same with the other half later in the week. Since they are in agent groups, if one isn't available, then it will go to the other. 

    We had been under the impression that when shutting down an agent via AWI, it would wait until it was finished processing whatever was active before shutting down. I believe we thought this because I know this is how it works when you perform a CAU. Apparently we had been running on sheer dumb luck up until about a week ago, when a client was less-than-impressed with us killing one of their long-running jobs.

    I was thinking about going the route of scheduling a pair of scripts that would remove the affected agents from the agent groups, then put them back in at the end of the maintenance window, but I would like to avoid reinventing the wheel if at all possible.

    TL;DR: Is there a more elegant way to ensure that jobs aren't going to these agents during maintenance windows?

    Thanks!


  • 2.  RE: Handling Routine Maintenance
    Best Answer

    Posted Feb 05, 2020 10:12 AM
    Hello,

    :SET_UC_SETTING WORKLOAD_MAX, Agent, 0

    Kind Regards


  • 3.  RE: Handling Routine Maintenance

    Posted Feb 10, 2020 12:02 PM
    That looks like it'll do the trick.

    Thank you!


  • 4.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 12:46 PM
    Edited by Kyle Harlow Feb 13, 2020 12:50 PM
    ​One issue I've found with this approach is that we use AgentGroups for redundancy (in addition to load balancing). We've had those servers' Windows update schedules split up so that at least one server is available to process a job. The AgentGroup doesn't seem to be smart enough to determine that the agent to be patched is actually unavailable. So if I put the AgentGroup into Next Listed, set the max workload on one agent to 0, and spam a dummy job, I get:

    I can build on my previous failed attempts at scripting this to avoid that, but just saying.

    Thanks again!



  • 5.  RE: Handling Routine Maintenance

    Posted Feb 11, 2020 07:17 AM
    ​Hey,

    > it would wait until it was finished processing whatever
    > was active before shutting down.

    It mostly works like this (for OS agents and on all platforms I know of, and as far as I know):

    The engine submits the job, the agent starts it. The job runs, and if the agent dies (or is stopped, which happens immediately upon receiving the respective command) the actual job on the agent machine keeps running running in a disassociated state.

    For instance, on SysV Linux that means the actual job process continues to run, but the process is no longer owned by the agent but by init (pid 1) instead. They use a similar way to do things on Windows. When the job eventually finishes, the disassociated agent process calls the agent binary in a special way (refered to as "job messenger") and that reports completion of the job back to the engine.

    So when your agent is stopped for any reason, no new jobs will be accepted, existing jobs will finish, but finished jobs will only be reported back if the agent can still reach the engine obviously (e.g. from a network perspective).

    To ensure no jobs are sent to an agent prior to a planned agent shutdown, there are various ways but :SET_UC_SETTING WORKLOAD_MAX, Agent, 0 as already suggested is likely the most viable one.

    Hth,


  • 6.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 07:42 AM
    I have noticed that the job will continue even if the agent is shut down. ​The only problem with that is that a couple of our clients have monsters of workflows that rely on the status (or other output) of a preceding job before moving forward. In these situations, it's not exactly ideal if a job vanishes into the ether as far as AA is concerned.


  • 7.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 07:53 AM
    And of course now I find the section of the documentation that would have answered my question. Whoops.


  • 8.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 08:07 AM
    ​Again my mileage and some gut feeling, but when an agent is unavailable, the engine will try to submit the job for some time.

    So I would think (but I might be wrong) that:

    1. given a JOBP with jobs A, B, and C
    2. and an agent shutting down while working on job A, then

    a) job A is processed to the end on the agent machine, and reported back
    b) the engine tries to submit (and re-submit) job B for quite some time. My colleage even believes it's virtually forever. When the agent comes back, B and C should run.

    Of course this poses a major problem for any job plans that somehow depend on wall time or elapsed time. This is why I keep preaching to our user base to not have hard MRT aborts of only a few minutes, or wall time requirements, unless there are valid, good reason for having those. A job plan that doesn't care if it runs late should probably survive a downtime unscathed.

    But of course I might be totally wrong.

    Wow, it would be splendid if Automic would describe how that is actually designed in a neat description (or maybe even flow chart) in the documentation. I hope they don't take my posting privileges some day but gonna be paging @Elina McCafferty: Elina, is there a possibility this could make it onto somebodies ToDo list? :)

    Br,
    Carsten





  • 9.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 08:44 AM
    That was initially our impression, but we ended up with a few jobs going into an inconsistent status (which were then locking up a clients queues with only one slot). The really fun part of that was that these runs weren't showing up in Process Monitoring, so we had to query the database to find jobs using that queue that had a null end time. We could then look up the runID in AWI and kill it there. Not that bad once you figure out what's what in the DB, but still.
    (Sidenote: I had sent that in to support, as I feel like I should have been able to find those in Process Monitoring, but ended up with an infinite loop about how we didn't shut down the system properly).

    We are on the latest version (12.3.1HF1), perhaps something changed?


  • 10.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 09:32 AM
    In the long run, and an ideal world, Automic should document what should happen (maybe on a positive note, we kicked that off now). Any ​deviation from that behaviour could then be treated as a malfunction and the manufacturer would have to act on it. Unfortunately, getting things established as malfunctions with support is hard at present time, especially when they can't be replicated easily.

    I have also seen jobs with weird, or null timestamps in the past. We had a major drive with Automic support to clean them from the database once. No idea if those also originated from a supposed "unclean shutdown" but yeah, it can definetly happen.

    Also, but that's really taking things off topic now, I do not accept any arguments from Automic about persistent problems being the operator's fault due to supposed improper shut downs. Our servers are regularily rebooted unattended for patching. The engine needs to honor the respective signal (sigterm) reasonably well and shut itself down cleanly. Our company doesn't pay people to bring down the engine at 2am or something, and with automatic patching being the industry norm now (or so the RedHat admins tell me), Automic should shut down along with the server like any other service with no permanent ill effects.

    But that's my $0.02 really :)

    Cheers,
    Carsten


  • 11.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 08:51 AM
    Hi,

    Just a quick info to think about:

    you discussed the behavior of OS agents so far but theres a different one for all JAVA agents (all RA, all DB, SAP, ...):

    As soon as a Java Agent is stopped, all currently running jobs (object status 1550) will crash.
    No way to resume, no way to recover, no way to undo (via Automic).

    we created a script that is started prior to agent shutdown (its a matter of luck to catch all long runners and prevent a heap of waiting jobs...)
    a) all JAVA agents that have jobs in running state, will be set to resource 0
    b) all JAVA agents that do not have jobs in running state, will be shut down.

    script has to be repeaded if there are agents still active because of a)

    This construct works fine for 2 years.

    of course this is only valid for planned downtimes of agents or the whole system :-)

    cheers, Wolfgang

    ------------------------------
    Support Info:
    if you are using one of the latest version of UC4 / AWA / One Automation please get in contact with Support to open a ticket.
    Otherwise update/upgrade your system and check if the problem still exists.
    ------------------------------



  • 12.  RE: Handling Routine Maintenance

    Posted Feb 13, 2020 09:33 AM
    ​Great info, thanks!

    (although I am still hoping that some day Automic explains that the Java agents all aren't really canon and are just part of the extended universe ... :D )


  • 13.  RE: Handling Routine Maintenance

    Broadcom Employee
    Posted Feb 13, 2020 09:16 AM
    Thank you, @Carsten Schmitz. I've asked one of the documentation engineers to look into it. Good input!​


  • 14.  RE: Handling Routine Maintenance

    Posted Feb 26, 2020 05:01 AM
    We have recognized an additional unpleasant behaviour regarding ":SET_UC_SETTING WORKLOAD_MAX, Agent, 0". We used this functionality before we restart a servers. And we set the parameter very early, because we wanted to avoid abortion of jobs. In case between setting the resources to 0 and the restart of the server a warmstart of the this agent is happening the resources is set to unlimited again. Support said, "worked as desined" :-(.
    With this new knowledge, unhappily for our usecase this functionality cannot be used any longer.
    Regards, Kordula

    ------------------------------
    Kordula
    ------------------------------



  • 15.  RE: Handling Routine Maintenance

    Posted Feb 26, 2020 05:12 AM
    Hi

    Thanks for the information, we use this functionality as well.

    Seems that this was an incompatible change, at least in V11.2 it was working as you described...

    @David Ainsworth was this really changed? How to proceed in the future - especially with JAVA based agents?

    KR & THX Wolfgang



    ------------------------------
    Support Info:
    if you are using one of the latest version of UC4 / AWA / One Automation please get in contact with Support to open a ticket.
    Otherwise update/upgrade your system and check if the problem still exists.
    ------------------------------



  • 16.  RE: Handling Routine Maintenance

    Posted Feb 26, 2020 05:45 AM
    >​ a warmstart of the this agent is happening the resources is set to unlimited again

    Kinda would have expected this - after all, trace flags for agents also get reset on an agent restart, if set via the engine. Unlike trace flags though, I can't find a way to pin the ressource limit via ini file though.

    Probably another "idea" for "ideation". Until then, I'd probably set up a time event that resets the workload to zero every minute or so. Crude hack, I know :)

    Best,
    Carsten


  • 17.  RE: Handling Routine Maintenance

    Posted Feb 26, 2020 05:46 AM
    Hello,

    that's right, in UC_HOSTCHAR_* there are probably WORKLOAD_MAX_FT and WORKLOAD_MAX_JOB on UNLIMITED. After a restart, these parameters are selected. The Idea would be to create a UC_HOSTCHAR_AGENT with 0 for both Parameters.
    Of course there are some risks.

    Regards
    Markus


  • 18.  RE: Handling Routine Maintenance

    Posted Feb 26, 2020 06:33 AM
    Hi Markus, 
    okay but after agent restart, the resources are 0, too! I cannot see a benefit! 
    Regards Kordula

    ------------------------------
    Kordula
    ------------------------------



  • 19.  RE: Handling Routine Maintenance

    Posted Feb 26, 2020 07:12 AM
    Hi Kordula,

    This topic has been covered here:

    https://community.broadcom.com/enterprisesoftware/communities/community-home/digestviewer/viewthread?MessageKey=09c06420-e86f-4fe4-bd73-62f24a9fad99&CommunityKey=2e1b01c9-f310-4635-829f-aead2f6587c4&tab=digestviewer#bm09c06420-e86f-4fe4-bd73-62f24a9fad99

    In V12.3 and I'm pretty sure in all older versions as well, the agent recources are always set to "Unlimited" after an agent restart, regardles whether you  set the resources via AWI or  :SET_UC_SETTING. The only chance to set the resources to an other value than "Unlimited" directly after the agent starts is the way that I described in my post (s. above link)

    Cheers
    Christoph

    ------------------------------
    ----------------------------------------------------------------
    Automic AE Consultant and Trainer since 2000
    ----------------------------------------------------------------
    ------------------------------



  • 20.  RE: Handling Routine Maintenance

    Broadcom Employee
    Posted Mar 18, 2020 08:23 AM
    Hi Carsten,
    I'll ask a tech writer to review this and decide if/where to add it.
    Thank you for this input!
    Elina


  • 21.  RE: Handling Routine Maintenance

    Posted Mar 18, 2020 08:34 AM
    HI all

    I think the terms Agent restart and Agent Warmstart were mismatched in this thread multiple times!
    That could be the cause for misunderstandings...

    As far as I am informed:

    Agent Warmstart a.k.a. Reconnect (= Agent process is continuously running)  - no change of resources takes place.

    Agent Start a.k.a. Stop agent & start agent - resources will be changed to default settings.

    cheers, Wolfgang

    ------------------------------
    Support Info:
    if you are using one of the latest version of UC4 / AWA / One Automation please get in contact with Support to open a ticket.
    Otherwise update/upgrade your system and check if the problem still exists.
    ------------------------------