Idea Details

Service Level Objective Maturity

Last activity 7 days ago
Scott.Hughes's profile image
10-23-2018 04:26 PM

SLO behavior on tasks which span midnight is not possible in the current design.  SLO are great as they give a consistent single way to monitor executions, until it gets complicated.  Prior to SLO the designer could chose to use post processing to monitor all jobs, or use task evaluation per task on the JOBP.  Runtime was a different panel and could have an else action.  Lots of discordant ways to monitor the batch.

 

SLO are introduced and appear to be a single “pane of glass” to monitoring.  They can be set to check that an objects is activated before a certain time, that an objects ends by a certain time, that an object ends with a specified status, that a specified runtime is achieved.  You can even put multiple end time SLO on a single task.  Great news, we can consolidate all the old ways to monitor, we even get a process monitoring perspective and an analytics data stream from it.

 

  • SLO cannot be made to span midnight
    • If a job is expected to end at 23:30 local and would be late at 00:30 the next day we cannot cover that directly with an SLO.
    • There is one known workaround:
      • The object must be a task of a workflow and the workflow checkpoint feature be used, since the checkpoint and most every pre-awi date feature has a +day setting.
      • Set the checkpoint to execute a script at d+1 h+00:30
      • Have script monitored by SLO that is defined to fail
      • Have SLO do normal violation response
    • The workaround has the following issues
      • Poor adoption
        • The customer is confused about all the ways and objects involved in monitoring.  This is counter to the strength of SLO being a “single pane of glass” for monitoring
      • No direct feed to process monitoring
        • The above scripts “can” activate a dummy script which is monitored by an SLO to report to process monitoring and analytics.  This is quite onerous to keep track of after more than 2 or 3 uses.
      • The script must activate a dummy script in the future if you want more than one late alarm (ie warning an hour late, alarm with ticket at 2 hours late).
        • If an object finishes at 23:30 with warning at 00:30 and late at 01:30 the following must be done to keep process monitoring accurate
          • The script that is activated by the checkpoint must also run activate_uc_object with a future start time of 01:30, passing the runid of the current task to be monitored
          • The script that is activated at 01:30 is written to check for completion of the passed runid and take appropriate action, if still running activate a script to be monitored by SLO, if complete exit 0
  • SLO can only be restricted by the seven days of the week instead of a calendar event
    • When monitoring a service that has an irregular run schedule due to calendar conditions the following is necessary
      • Set the auto forecast settings in the client
      • Write and schedule a script to generate auto forecast
      • Write a vara_sec_sqli which validates the current task is contained in the forecast
      • In the script which the SLO activates insert an if block that will escape if the object is not found in the forecast
    • Drawbacks
      • Extra failure success data on days on which job is not set to run but SLO evaluates
      • Excessive objects elements and scripting making adoption difficult

 

Suggestions:

  • Add +day feature to SLO that exists in most every other area of the product: schedules, task preconditions, task post conditions, external conditions, breakpoints etc
  • Add calendar events to the SLO definition or fine tune the SLO engine to ignore jobs skipped by calendar condition
    • Testing of finish by hh:mm with “any_ok” did not achieve this.

 

The Service Level Governor had these features.