Is there a CA standard procedure for restarting task processing after the engine has been down for a period of time?
Ie - My engine servers get taken down for OS patching for 4 hours from 0800-1200.
I then restart the engine.
I want all jobs that should have run during the downtime to run, so everything in that 4 hour period.
Cant find much documentation on this..
It should come out of the box i would think..
How does everyone manage this?
The product does not have a ready made solution to my knowledge. I would think every datacenter could have different recovery use-cases?
Have you been able to run a forecast to review what is scheduled to run during the 4 hour window? We run a forecast, use Excel to filter it down to just those objects that will need to be recovered after the outage, and then manually recover them after the outage. But we are a small shop so we find this to be manageable. (And if you have objects that dynamically start other objects, a forecast would probably not be aware of them.)
We also pass a lot of date parameters into our jobs via promptsets. When recovery takes place past midnight, then those date parameters have to be adjusted during recovery.
Another consideration is will there be CALLAPI requests trying to run on your AE while it is down, and do you care?
I like to use the 'reset task' on the schedule object monitors for any task that was supposed to execute during that time frame. By using the reset task it will take into account calendar conditions and anything else that might be set on the task properties of the schedule for that specific object.
Everything else remains out in the activities window in a stopped queue and resumes when the AE comes back online and I start up the queues.
Usually for short maintenance windows we stop all the activity recursively - Inc Schedules and resume it post completion - However like you said i dont think there is a straight forward UC4 functionality to start the objects which missed their start time.
I think is in solution which could maybe work in favour of handling such issues - All the workflows are Activated from the schedule at the same time of load and are waiting in the activity queue for their start time.
The start time is set on the workflow level and hence in such scenarios - You just have to hold your activities and resume it post the Maintenance - This has its own cons as well.
So maybe you have an a list of very important jobs to which you can do such setup with a seperate schedule - Unless you think this is an elegant solution.
Of course there is going to be a lot of load on the system when all the workflows are Activated at the same time which may cause a slow down at that time.
Ultimately we end up doing the 'reset task' option that Michael_Pirson mentioned earlier. In our case, that's still a challenge because we have over 100 distinct schedules in use, though.
Here's a query we've developed to identify all the tasks that were skipped during an outage so that we'll know which schedules to look at and which tasks to reset:
-- Version 4: catch skipped tasks from a schedule due to an outage-- * Catches tasks that were skipped in an active schedule, or in a schedule that just turned around in the last hour (configurable in red). -- * Flters out tasks from active schedules that are stopped (STOP - Automatic processing has been stopped).-- * Filters out tasks from schedules that just turned around if the active instance of that same schedule is stopped (STOP - Automatic processing has been stopped).-- * Lists only tasks that were skipped within the last hour (configurable in red)select ah_client as client,ah_name as schedule,eh_status||' (ACTIVE)' as sched_status,ejpp_object as skipped_task,substr(varchar_format(EJPP_STARTTIME + CURRENT TIMEZONE,'MM-DD-YYYY HH24:MI:SS'),1,24) as start_time,substr(varchar_format(EJPP_ENDTIME + CURRENT TIMEZONE,'MM-DD-YYYY HH24:MI:SS'),1,24) as end_timefrom ah,ejpp,ehwhere ah_otype='JSCH'and ejpp_status=1941and ah_idnr=ejpp_ah_idnrand eh_otype='JSCH'and eh_ah_idnr=ah_idnrand eh_status<>1563and (ejpp_starttime + current timezone) > current timestamp - 1 hourunion allselect ah_client as client,ah_name as schedule,ah_status||' (ended)' as sched_status, ajpp_object as skipped_task,substr(varchar_format(AJPP_STARTTIME + CURRENT TIMEZONE,'MM-DD-YYYY HH24:MI:SS'),1,24) as start_time,substr(varchar_format(AJPP_ENDTIME + CURRENT TIMEZONE,'MM-DD-YYYY HH24:MI:SS'),1,24) as end_timefrom ah, ajppwhere ah_otype='JSCH'and ah_idnr in (select ah_idnr from ah,eh where ah_otype='JSCH' and eh_otype='JSCH' and ah_name=eh_name and ah_client=eh_client and eh_status<>1563 and ah_timestamp4 > current timestamp - 1 hour)and ah_name not in (select ah_name from ah)and ah_idnr=ajpp_ah_idnrand ajpp_status=1941and (ajpp_starttime + current timezone) > current timestamp - 1 hourORDER BY 1,5,2,4;
-- Version 4: catch skipped tasks from a schedule due to an outage
- We're using a DB2 database; you may need to tweak some of the date-related syntax in this query if you're using Oracle or SQL server.
- We developed this in v9, as we're not yet up to v12. Hopefully someone here can validate if the query still works.
FYI - CA does have a solution for this. It's called the DRT or downtime recovery tool. It is a fancy little tool that basically allows you to recover from an outage and use the "reset task" on hundreds of schedules at a time. Or just one. I found this more helpful at my prior employer because we set the standard for every application / team - all schedules were reset. But at my current location, each team / application wants to do their own thing. You can still use the DRT, but it's a little more setup to get it configured. It's an add-on tool. You'll have to pay for it.