Automic Workload Automation

Expand all | Collapse all

Dynamic Adaptive ERT

  • 1.  Dynamic Adaptive ERT

    Posted 05-23-2018 08:04 AM

    Hello,

     

    I am trying to figure out how this works .. here is the scenario/test I have set up.

     

    I have one job that has this in the post process.

     

     

    The variable object looks like this:

     

     

    The runtime looks like this:

     

    The variable declaration at job level looks like this:

     

     

    There is a workflow with 2 instances of the job in it:

     

    The other job has RUN_WHAT set to SLOW

     

    The runtime tab is set like:

     

    I have the following row in UC_CLIENT_SETTINGS:

     

    I have executed the workflow +30 times to get enough stats.

     

    What I was expecting to happen was, if I amend the variable object for the FAST value from 1 to 30, the runtime deviation would trigger and activate the job JOBS.UNIX.TEST_AI_OVERRUN?

     

    Should this be the expected behavior?  Are there any settings/options I am missing?

     

    As well as the variables used by the jobs execution, is it documented anywhere what else the dynamic adaptive option uses?  For example, if there was a job that under normal circumstances took seconds to run but on the last Friday of every month took an hour, is it clever enough to identify this pattern?

     

    Also, does it get logged anywhere what ERT calculation was used during the execution?

     

    Cheers,

     

    Dan



  • 2.  Re: Dynamic Adaptive ERT

    Posted 05-23-2018 10:14 AM

    Hello,

     

    Found some logging in the JWP report ....

     

     

    Cheers,

     

    Dan



  • 3.  Re: Dynamic Adaptive ERT

    Posted 05-23-2018 11:07 AM

    Update ...

     

    Seemed to get the job to report an error now ...

     

    Had the jobs set to as FAST = 1 second and SLOW = 30 seconds:

     

    It was showing this in the JWP log which I guess is as expected:

     

    20180523/151304.640 - U00045019 The estimated runtime for RunID '2514404' is '31' seconds.
    20180523/151304.859 - U00045018 Calculate estimated runtime for RunID '2514405'.
    20180523/151304.859 - U00045019 The estimated runtime for RunID '2514405' is '2' seconds.
    20180523/151306.187 - U00045020 Got feedback for the runtime estimation of RunID '2514405': estimated '2' seconds, actual runtime was '2' seconds.
    20180523/151334.984 - U00045020 Got feedback for the runtime estimation of RunID '2514404': estimated '31' seconds, actual runtime was '30' seconds.

     

    I then changed SLOW = 40 seconds ... this caused the SLOW job to report an overrun .. all good ... 

     

    However it then recalcuated the stats as:

     

    20180523/152334.704 - U00045021 Based on the feedback the ERT model for RunID '2516324' will be recalculated. 20180523/152335.188 - U00045018 Calculate estimated runtime for RunID '2514463'. 20180523/152335.204 - U00045019 The estimated runtime for RunID '2514463' is '1' seconds. 20180523/152335.376 - U00045018 Calculate estimated runtime for RunID '2514464'. 20180523/152335.376 - U00045019 The estimated runtime for RunID '2514464' is '1' seconds.

     

    So then all SLOW jobs got reported as an overrun.  Not sure why they both got estimated with a run time of 1 second?

     

    who knows

     

    Cheers,

     

    Dan.



  • 4.  Re: Dynamic Adaptive ERT

    Posted 05-24-2018 06:08 AM

    Hello,

     

    Further update ... am not sure I can get this working at all....

     

    I added a MEDIUM job as well so I had

     

     

    I ran the job plenty of times and it recalculated the ERT on a few occassions:

     

    20180524/093146.554 - U00045019 The estimated runtime for RunID '2512198' is '2' seconds.
    20180524/093146.710 - U00045018 Calculate estimated runtime for RunID '2512199'.
    20180524/093146.710 - U00045019 The estimated runtime for RunID '2512199' is '30' seconds.
    20180524/093146.867 - U00045018 Calculate estimated runtime for RunID '2512200'.
    20180524/093146.867 - U00045019 The estimated runtime for RunID '2512200' is '1' seconds.

     

    20180524/100011.699 - U00045019 The estimated runtime for RunID '2516450' is '31' seconds.
    20180524/100011.902 - U00045018 Calculate estimated runtime for RunID '2516451'.
    20180524/100011.917 - U00045019 The estimated runtime for RunID '2516451' is '2' seconds.
    20180524/100012.120 - U00045018 Calculate estimated runtime for RunID '2516452'.
    20180524/100012.120 - U00045019 The estimated runtime for RunID '2516452' is '1' seconds.

    20180524/101507.872 - U00045019 The estimated runtime for RunID '2517306' is '2' seconds.
    20180524/101508.060 - U00045018 Calculate estimated runtime for RunID '2517307'.
    20180524/101508.075 - U00045019 The estimated runtime for RunID '2517307' is '31' seconds.
    20180524/101508.232 - U00045018 Calculate estimated runtime for RunID '2517308'.
    20180524/101508.247 - U00045019 The estimated runtime for RunID '2517308' is '2' seconds.

    20180524/103430.671 - U00045019 The estimated runtime for RunID '2514747' is '31' seconds.
    20180524/103430.875 - U00045018 Calculate estimated runtime for RunID '2514748'.
    20180524/103430.890 - U00045019 The estimated runtime for RunID '2514748' is '43' seconds.
    20180524/103431.109 - U00045018 Calculate estimated runtime for RunID '2514749'.
    20180524/103431.109 - U00045019 The estimated runtime for RunID '2514749' is '2' seconds.

    The ERT for the SLOW job seems to be increasing but has still not got to the 60 seconds that it will take.

    Does anyone out there use this feature?  If not how do people generally flag overruns?  We cannot use fixed values and the same job can take minutes against one environment and an hour against another..

     

    Cheers,

     

    Dan



  • 5.  Re: Dynamic Adaptive ERT

    Posted 05-25-2018 08:45 AM

    Hello Dan,

     

    thanks for your very interesting experiment with our Adaptive ERT feature! I'm sorry that you couldn't achieve your desired behavior, but I have one hint for your problem: Have you tried to execute the SLOW/MEDIUM/FAST instances of your job within different parent workflows?

     

    As you suggested correctly, the Adaptive ERT algorithm mustn't  throw all runtimes of a job together to predict its ERT. Instead, it has to be aware of different execution contexts ("same job takes minutes against one environment and hours against another") and consider only runtimes of similar contexts for the prediction.

     

    But what are the features that constitute a distinct execution context? Agent name? Parent workflow? Host? Or all together? Good news: you can define this by yourself in the UC_CLIENT_SETTINGS property ERT_ADAPTIVE_DEFAULT_CONTEXT. There you simply enumerate those object properties that compose your execution context.

     

    Now, why have I requested you to run the SLOW/MEDIUM/FAST instances of your job under different parent workflows? Because by default ERT_ADAPTIVE_DEFAULT_CONTEXT uses the name/alias of the job's parent for assigning its runtime to a certain context. (Note that it is not possible to use VARA values for defining contexts. I assume your tests failed because Adaptive ERT was just using one context internally.)

     

    Thus, I expect the ERT for your SLOW jobs to converge towards 60 after some executions under a SLOW.JOBP. In case of further questions, just reply to this post.

     

    Best,

     

    Franz 



  • 6.  Re: Dynamic Adaptive ERT

    Posted 05-25-2018 10:31 AM

    Hello Franz,

     

    Thanks very much for you reply ... I have now amended the flow so that I have one master workflow with 3 sub-workflows (SLOW, MEDIUM,FAST).  In each of the sub workflows there is only job (same job in each one using a different variable value).

    Here are the results:

     

    All jobs are using the max adaptive ERT with a tolerance of 10%

     

    RUN 1:

     

    FAST = 1
    MEDIUM = 30
    SLOW = 60

     

    20180525/114531.187 - U00045021 Based on the feedback the ERT model for RunID '2508299' will be recalculated.
    20180525/114654.391 - U00045018 Calculate estimated runtime for RunID '2508302'.
    20180525/114654.406 - U00045019 The estimated runtime for RunID '2508302' is '30' seconds.
    20180525/114654.609 - U00045018 Calculate estimated runtime for RunID '2508303'.
    20180525/114654.609 - U00045019 The estimated runtime for RunID '2508303' is '61' seconds.
    20180525/114654.812 - U00045018 Calculate estimated runtime for RunID '2508304'.
    20180525/114654.812 - U00045019 The estimated runtime for RunID '2508304' is '2' seconds.

    20180525/141148.838 - U00045020 Got feedback for the runtime estimation of RunID '2518183': estimated '2' seconds, actual runtime was '1' seconds.
    20180525/141217.463 - U00045020 Got feedback for the runtime estimation of RunID '2518181': estimated '30' seconds, actual runtime was '31' seconds.
    20180525/141247.666 - U00045020 Got feedback for the runtime estimation of RunID '2518182': estimated '61' seconds, actual runtime was '61' seconds.

    All looks ok.

     

    RUN 2:

     

    FAST = 10
    SLOW = 30
    MEDIUM = 60

    20180525/142401.355 - U00045021 Based on the feedback the ERT model for RunID '2516592' will be recalculated.
    20180525/142432.199 - U00045018 Calculate estimated runtime for RunID '2516596'.
    20180525/142432.199 - U00045019 The estimated runtime for RunID '2516596' is '33' seconds.
    20180525/142432.386 - U00045018 Calculate estimated runtime for RunID '2516597'.
    20180525/142432.386 - U00045019 The estimated runtime for RunID '2516597' is '61' seconds.
    20180525/142432.527 - U00045018 Calculate estimated runtime for RunID '2516598'.
    20180525/142432.527 - U00045019 The estimated runtime for RunID '2516598' is '2' seconds.

    20180525/142442.949 - U00045020 Got feedback for the runtime estimation of RunID '2516598': estimated '2' seconds, actual runtime was '10' seconds.
    20180525/142502.652 - U00045020 Got feedback for the runtime estimation of RunID '2516596': estimated '33' seconds, actual runtime was '30' seconds.
    20180525/142532.933 - U00045020 Got feedback for the runtime estimation of RunID '2516597': estimated '61' seconds, actual runtime was '60' seconds.

    FAST job is now taking 10 seconds and when estimated 2 but not reporting an overrun..

     

    RUN 3:


    FAST = 10
    SLOW = 30
    MEDIUM = 60

     

    20180525/143710.809 - U00045021 Based on the feedback the ERT model for RunID '2517362' will be recalculated.
    20180525/143741.731 - U00045018 Calculate estimated runtime for RunID '2514971'.
    20180525/143741.731 - U00045019 The estimated runtime for RunID '2514971' is '11' seconds.
    20180525/143741.888 - U00045018 Calculate estimated runtime for RunID '2514972'.
    20180525/143741.888 - U00045019 The estimated runtime for RunID '2514972' is '60' seconds.
    20180525/143742.060 - U00045018 Calculate estimated runtime for RunID '2514973'.
    20180525/143742.060 - U00045019 The estimated runtime for RunID '2514973' is '10' seconds.

    20180525/143752.482 - U00045020 Got feedback for the runtime estimation of RunID '2514973': estimated '10' seconds, actual runtime was '11' seconds.
    20180525/143812.153 - U00045020 Got feedback for the runtime estimation of RunID '2514971': estimated '11' seconds, actual runtime was '31' seconds.
    20180525/143842.341 - U00045020 Got feedback for the runtime estimation of RunID '2514972': estimated '60' seconds, actual runtime was '61' seconds.


    Looks like it has changed the estimate of the FAST one correctly but also change the MEDIUM estimate to 11 seconds .. which now throws the overrun exception?

     

    For information this is being run on .. Automic Web Interface 12.0.0.HF03-346

     

    What we are trying to achieve is this:

     

    We have one master workflow that will run against different environments (ENV1 and ENV2).

     

    Within the master workflow we have sub workflows that will run the same job against separate data sets (SET1, SET2, SET3).

     

    So I guess what we would like the ERT to take into account is the environment/data set the job is running against. e.g JOB1 needs to have its own estimate calculated for the combinations:

     

    EnvironmentData Set
    ENV1SET1
    ENV1SET2
    ENV1SET3
    ENV2SET1
    ENV2SET2
    ENV2SET3

     

    Also when we first execute this job in a LIVE environment I am guessing the adaptive ERT will be inaccurate/not there due to lack of previous runs/data?  How do we handle this as we do not want overrun alerts being sent out without reason?

     

    And another thing ... there will be certain jobs that throughout the month will take seconds to run and then for example on the last Friday of the month will take an hour...

     

    If this is all achievable please can you let us know what settings we need to set and anything else we need to do?

     

    Hope all this makes sense.

     

    Cheers,

     

    Dan



  • 7.  Re: Dynamic Adaptive ERT

    Posted 05-28-2018 05:05 AM

    Hello Dan,

    nice to see that you are making progress on this! First of all I'd recommend that you execute the FAST, MEDIUM and SLOW jobs in three distinct workflows (not subworkflows), because parent alias/name always refers to the topmost workflow container (which would be the same again in your case).

     

    Then you have to make sure to trigger enough executions that Adaptive ERT acutually kicks in (25 executions by default, with less executions the ERT fallback method is considered, which is linear regression).

     

    To solve this issue in a production environment, I recommend to configure Adaptive ERT with Fixed ERT as a fallback method in the respective objects (i.e. ERT Calculation Method: Adaptive on Runtime tab, with a Fallback ERT value). This means that if there are not enough executions for the Adaptive ERT, the fixed fallback number would be used.

     

     

    Best,

     

    Franz