AutoSys Workload Automation

  • 1.  boxes for multi-part transactions

    Posted Oct 24, 2014 12:58 AM

    My company is in the midst of a philosophical debate over how best to break up multi-step processes.  Here are two ETL examples:

    1. Download and ingest data:

    1.a. Check to see whether the file exists on the remote site.

    1.b. Run an FTP job to transfer the file locally.

    1.c. Load the data from the file into a staging table.

    1.d. Run validation against the data.

    1.e. Transfer the data to the production database.

    1.f.  Clean out the staging table.

     

    2. Transfer data from one database to another.

    2.a. SQL trigger watches for  a new record in DB1.

    2.b. SQL job inserts the new record into a staging table in DB2.

    2.c. Perform validation against the data in the stagng table.

    2.d. Transfer the data to the production tables in DB2.

    2.e. Clear out the staging table.

     

    Historically, we've often done this by putting all the steps in a single program or perl script.  Everyone agrees that this is a terrible way to go.  If 1.c fails, the only way to move forward after you've fixed whatever environmental issue caused the failure is to make a copy of the program and rewrite it to skip 1.a and 1.b..

     

    Our initial thought was that we would simply break each job down into its constituent parts, and each stage would get its own Autosys job.  The jobs would go into a box, and that box would replace the original job.  Now if 1.3 fails, you fix the environmental issue, force_start job #3 and let the process continue on its way.

     

    Others have said that Autosys doesn't do a particularly good job at this sort of thing and they would prefer that we use a dedicated ETL tool like SSIS or Informatica that can be set up to break each "task" into "sub-tasks."  If 1.3 fails, you fix the environmental issue and then just re-run the SSIS task #1 as a whole.  It's smart enough to realize that 1.a and 1.b have already run and can be skipped.

     

    I can see the argument - if job 2.c failed and box #2 were force_started, then 2.a and 2.b would run a second time, with potentially catastrophic results if the staging table had never been cleared out (though in theory it's the developers' job to make sure that any error situation returns to the pre-transaction state).  My concern is visibility:  if the job fails, the only way to identify which step failed might be to launch a separate SSIS console, which strikes me as a less supportable model (but maybe SSIS can be configured to write to stdout after each sub-task was completed; problem solved).

     

    I'd like to open this up and see what the community thinks.



  • 2.  Re: boxes for multi-part transactions

    Posted Oct 27, 2014 01:23 PM

    This is a common dilemma - how much logic goes into the AutoSys workflow, versus how much goes into the job itself.  I don't think there is a single answer.

     

    > "... putting all the steps in a single program or perl script.  Everyone agrees that this is a terrible way to go ..."

     

    I don't agree this is terrible.  It depends on whether your script writers are up to the task.

     

    We have similar challenges with SAP Business Warehouse, where one could create process chains that perform multiple steps outside of AutoSys control.  The SAP application team then needs to connect to that system and diagnose the fault.  The primary downside we've seen to "all the logic in one job" is when an initial step hits a new bug and stalls.  If that logic was in one Autosys job with a max run alarm, we'd know about it sooner.  If multiple steps are in one job, we must set the max run to be a sum of all worst case scenarios, meaning we would not find out something was wrong until much later.

     

    > "... Others have said that AutoSys doesn't do a particularly good job at this sort of thing..."

     

    Exception processing may be simpler with other scheduler tools.  AutoSys has "one-time overrides" for jobs, which can do a world of good if properly managed.

     

    Some ways we have tried to contain failures:

     

    • Have good "operations documentation" so operators understand what they can safely re-run and what they must escalate.
    • Write scripts that exit with specific codes depending on the condition.  Always exiting with a "1" for failure leaves doubt as to the cause, and to the cure.
    • Document how to execute and clean up one-time overrides.
    • Use Max Run (and Min Run) alarms for time-sensitive jobs, with clear escalation procedures.
    • And training. Yes, training...

     

    Sometimes, the application team needs to fix things "manually."  Then they need to communicate with the operators such that the workflow is resumed at the correct place.  Leaving a job failure in a box might lead to the next run not occurring when scheduled (compounding the original damage).