Idea Details

Autosys System Agent Shutdown

Last activity 06-03-2019 08:10 PM
naren.joshi's profile image
11-06-2018 06:39 AM

We would like the agent process/service to not shutdown due to low disk space. Ideally it would hibernate and not accept any workload and periodically check for disk space before automatically resuming.

 

With big organisation where we have agents running on 4K + machine on non-production, we regularly run out of disk space. This becomes nightmare to manage to Admin team, as manual process is required to restart the agent once the disk space becomes available 


Comments

11-16-2018 05:17 PM

Keith,

Reading your comment, this sentence came to attention: "This is often a crash / core dump of a process, which exceeds the disk space and then is instantly deleted because it is incomplete."

This is precisely the problem that I have solution'd at one of the shops, by recommending that the agent file system be exclusive to itself.

 

As Mike has said, " the hibernate suggestion the feature is already there".

Ref your other comment: "In a world where it is normal to have multiple gigabytes available on a disk at any time, there is no practical difference between 20mb and 10mb" ...............The only question is, are these current threshold numbers adjustable and if so, you can tune these to match your requirement.

 

Hope this helps.

Best,

Chris <CJ>

11-16-2018 02:14 PM

Hi Mike,

 

I have the default settings as above.  But reality and these settings are too far apart.  In a world where it is normal to have multiple gigabytes available on a disk at any time, there is no practical difference between 20mb and 10mb.

 

In order for the first alert for the agent to be hit, something is often critically consuming the disk space.  This is often a crash / core dump of a process, which exceeds the disk space and then is instantly deleted because it is incomplete.  Monitoring does not show the event because it is polling at 1+ minute intervals.  Unfortunately, the agent sees this momentary lack of space and shuts down.

 

What is being ask for is for the agent to be more tolerant.  A lack of space for less than a minute should not cause the failure of the agent, especially when it isn't even doing anything or writing any logs.  Far too often, an agent that is actually idle is shutting down and then we need someone to go onto the server to fix this in the middle of the night.

 

One thing you may note from this discussion and others like it is that when this happens, system admins often log in to find that there is no disk space problem even a few minutes later.  Other applications seem to survive just fine and only AutoSys seems unable to handle the lack of space when it isn't even asking for any space to be allocated.

11-16-2018 02:03 PM

Hi All,

 

I think for the hibernate suggestion the feature is already there.  

The agent has three disk threshold settings already: 

agent.resourcemon.threshold.disk.warning.notice=20M
agent.resourcemon.threshold.disk.warning.severe=18M
agent.resourcemon.threshold.disk.critical=10M

 

When the first is hit a message and alarm is sent out, but processing continues.  

When the second is hit, the agent stops accepting new requests and an alarm is sent out and the machine is set to a BLOCKED state.

   "CAUAJM_I_40245 EVENT: ALARM            ALARM: MACHINE_DISKTHRESHOLD MACHINE: Mike-3.ca.com TEXT: <Disk resource below threshold. Machine <Mike-3.ca.com> has blocked communication. Status(BLOCKED) DiskSpaceCurrent(zM) DiskSpaceThreshold(xM)>"

 

At this point jobs will go to the PEND_MACH (or whatever state you are configured for) until the disk space frees up and the  machine is marked back online.

 

Only when the third setting is hit will the agent shut down.  As mentioned in other posts, completely running out of disk presents problems with logging and updating status, so zero space is not a good place to end up at.

 

Is this what you're looking for?

 

Regards,

Mike

11-13-2018 04:08 PM

+1 for what Steve has said.............

1. the nohup file gets cleaned out with the restart and extracting the cause of the crash requires going through the files in the agent log directory.

2. tracing out the failure of a job also requires a diligent scan of the files in the spool directory.

In summary, getting to the root cause takes some digging - it should be easier.

Requests for enhancement/ideation have been submitted before, not sure of their status.

Best,

Chris <CJ>

11-13-2018 03:25 PM

i had one as it related to common attributes. CA WAAE jobtypes need to support standard optional attributes  

11-13-2018 01:17 PM

Is there already an idea for that request I can up vote?

11-13-2018 12:09 PM

Keith i have been asking for that forever! it goes to spool/agent which gets cleaned up on success which is horrendous. keeping all those files under the software directory is a no can do. 

it's because of the way cybermation did it's batch and nothing else...

 

but yes std_out/err should be for all job types!!

 

just my 3 cents -- 

 

Steve C.

11-12-2018 02:26 PM

I agree, although that does bring up the other item that it would be nice if built in job types actually had an option to create 'normal' log files as other jobs do.  Trying to find root cause in a shared log/event file is a real pain.  Plus being able to create job specific log files for all jobs would make it easier for debugging and also allow us to specify where the logs are placed.

11-12-2018 02:08 PM

Hi, 

For the SCP job type that was mentioned.... 

There needs to be a record (log) of  any transaction.  

If it failed and there is no log, how does the root cause get determined.? 

The application team and audit group also should have an issue with running batch jobs without any records. 

 

2¢ 

If I give the bank 500 dollars, I want the record of that transaction.....  

11-12-2018 01:44 PM

I agree with Steve in that it would be best to have the option of either having the existing behavior of the agent shutting down when its space threshold is exceeded or having the agent remain up but suspend operations until space is restored.

 

Personally, I would like it to 'act' like it is down when space is exhausted but then either resume or restart automatically once it is resolved (assuming you pick that mode).

 

But what about built in functions/job types?  Should these be terminated simply because the agent space is exhausted when there may be no issue with the work it is performing?  Should an SCP job terminate because of this?  That is why I prefer it not actually shut down but simply stop accepting any new requests.

11-08-2018 03:01 PM

With Steve's proposal, we have a combo that could work!

1. the configuration choice

hybernate.threshold

shutdown.threshold

2. a error specific CAUAJM_E message in the EP log and a related MIB

 

I have not voted up or down for this idea, because it band-aids the issue of disk space.

Since the introduction of the Cybermation Agent, I have seen the Agent shutdown for lack of disk space only when the Agent installation directory is on a common drive [and the reasons for this are beyond the scope of this discussion].

Allocate an agent-exclusive drive/mount point with 10 GB space, adjust the archive settings in the agentparm file and you will be immune to an agent shutdown due to disk space constraint.

Trust me, I have proved this at more than one Client's shop [at the current "shop", they went cheap on me and gave the agent an exclusive disk with only 5 GB, but this too will work]

 

Best,

Chris <CJ>

11-08-2018 02:46 PM

Ramon,

 

This was not the ask. they do not want the agent to process. nor do they want it to shutdown. 

My thought is .. you reach a diskfull you either say hybernate or shutdown.

However, this is a catch 22 because if there's no diskspace you can log the issue.

This is why i am not voting up nor down for the ask, but the ask should be something different as i proposed, above.

 

Thank you

Steve C.

11-08-2018 02:38 PM

You can disable the space needed for the agent not to shutdown. In the agent parm there is an extra line to disable this option. So the agent will continue running and your jobs will be stuck in starting or fail mode.

 

Sent from my BlackBerry 10 smartphone on the Verizon Wireless 4G LTE network.

11-08-2018 11:01 AM

AutoSys itself shuts down if out of disk space. I think it makes sense that the agents do as well. 

Perhaps the ask should be, can this be made a configuration choice:

 

e.g

hybernate.threshold

shutdown.threshold

 

This i would stand behind...

 

just my 3 cents

 

Steve C.

11-07-2018 09:09 AM

Agree with Lisa regarding the convenience of the "hibernation".

Ramon, has a very valid point - the "hibernation" will leave you guessing as to what is the root cause, could be disk space, CPU contention, long runnning jobs.....so many possible causes.

Let's have the "hibernation" plus a error specific CAUAJM_E message in the EP log and a related MIB for those using SNMP.

Then we can have the cake and eat it too!

Chris <CJ>

11-07-2018 08:46 AM

How do you monitor the problems when the machine runs out of space. With the hard failure you will notice something is bad right away.

11-06-2018 09:52 AM

I like the hibernation idea as well. If the agent could go into a hibernation state rather than shutdown when disk space has reached an indicated threshold and then resume when space issue has been addressed it would be far less impactful to our Autosys users and administrators as well who are then faced with having to start the agent manually. Great idea!