Release Automation

Expand all | Collapse all

Resuming Process Run After Action Failure in Production

  • 1.  Resuming Process Run After Action Failure in Production

    Posted 07-22-2016 06:26 PM

    Myself, ChipRab, and MikeLong began a discussion offline a little while ago that I thought should be brought out to the Communities. Gents, please point out if I misrepresent anything that we discussed.

     

    Basically, it's a discussion about the legitimacy of resuming a process run after an action has failed in CARA during a production deployment, e.g.

     

     

    Their contention was that this is a development feature only and should never be used when running a deployment in production. Their principal points (that I remember) were:

    • In general, all actions should have their "pause on failure" setting disabled when used in production.
    • Any action that might fail should (ideally) be handled with subsequent actions that do something useful/intelligent, i.e. error handling.
    • Any action that does fail during a production deployment should be considered a catastrophic deployment failure and no attempt to intervene and allow the deployment to continue should be made.

     

    While I see value in those arguments, upon further consideration and discussion with other colleagues, I feel there are counter arguments to be made...

     

    1. Firstly, my Operations folks specifically want the option to intervene in a paused deployment if that is possible/practical to do, e.g. transient network issues caused REST call action to fail.
    2. I know of no practical way to disable the "pause on failure" setting for hundreds of actions when preparing to use them in production.
    3. For transient problems such as network interruptions, resuming a deployment may be all that is needed.
    4. WRT the "resume" and "pass manually" feature, if we acknowledge that it is highly unlikely that all possible failure conditions will be accounted for during the development of our flows and processes, we must consider the circumstance where an action fails in production and we missed a case that needed error handling.
      • Consider the following scenario...
        • We have a flow that is used in the deployment of many of our company's applications.
        • A failure in that flow was caused by a bug in the flow, e.g. logic error, incorrect action parameter value, etc.
        • An Operations Engineer could intervene and address what task that the failed action was attempting, e.g. disable load-balancing of a machine.
      • If we state that the "resume" or "pass manually" features should *never* be used in that scenario, it means...
        • the deployment must be aborted/cancelled.
        • the flow or process must be fixed
        • the fix must be tested
        • the fixed process must be deployed to Production

     

    … before any other deployments using that process can be run again. Effectively, we would be saying that the deployment pipeline is completely blocked until a bug fix is deployed - no workarounds available.

     

    It seems to me that adds brittleness to the whole proposition of using CARA for deployments from an operations point of view.

     

    However, if a user runs into a case during a deployment where there is insufficient error handling, but they had the ability to both manually pass an action *and* populate missing parameter values at run-time, they would then have a work around available to them. Instead of the pipeline being halted, it would merely be slowed. A slowed pipeline would still be motivation enough to get the missing error handling problem fixed.



  • 2.  Re: Resuming Process Run After Action Failure in Production

    Posted 07-25-2016 02:52 AM

    nice discussion you had there, I consider myself in between both sides:

     

    build in error handling and fail deployments if it is a crucial part of the deployment, that can't be done manually in a fast manner, on the other hand, pause on failure if it something small, that can easily be done manually during the deployment to then continue the flow.

     

    so, for example we pause usually most of the time, except the project itself defines that we should always fail the deployment on any error. an example of a recent error is the following: CA RA wasn't able to stop a website, the reason behind it is unclear, might have been a timing issue, but this is something that happens not very often, so I logged on to the server, stopped it manually and proceeded the deployment.

     

    if you now would choose to not do that, the consequences would have been:

    - deeply analyzing why the website couldn't be stopped, although the action was successful thousands of times before

    - fix the flow and test it on the "latest" tag

    - if tests are successful publish with a new tag

     

    but here comes the next part, that isn't mentioned above. what do you do now after the publish? are you creating a new plan? are you adapting the existing one and just change the tags there?

     

    also, this deployment would need to go through all pre-prod environments again, which can take several weeks because of organizational processes. this would lead to a lot of more costs, so believe me, every project would be kinda pissed, if they would need to pay a lot more, just because one small step like stopping a website or a windows service or maybe just a clean up failed.

     

    I support the pause on failure and manual passing of steps that can and have been done manually to continue the flow. only in very crucial moments (e.g. a database update or the running of a silent installer) it should fail completely.

     

    best regards

    michael



  • 3.  Re: Resuming Process Run After Action Failure in Production

    Posted 07-27-2016 09:26 AM

    MichaelGebhardt, thank you for your contribution to the dicussion. I believe we are in agreement. I think it's worth pointing out that this topic leads to something I believe ChipRab has been arguing in favor of for some time, which is better built in error handling. Such improvements I can imagine would include:

    1. adding options to the "pause on failure" flag that allow for customizable error information to be presented/logged when an action fails.
    2. adding options to specify if the failure should be considered fatal

     

    Currently, to do something like that I have to add a "User input" action that executes on failure of a predecessor with the paramaterized and customized information in it to display to the user. It directs the user on how to proceed. That action is followed by a "terminate" action if I want the predecessor failure to be fatal. The terminate action is really just a compare action that is hard coded to fail - very hacky.



  • 4.  Re: Resuming Process Run After Action Failure in Production

    Posted 07-27-2016 09:33 AM

    the thing I would appreciate would be a third output state, which really indicates an error. because at the moment when I'm checking a condition (like is a string "true") I always need to remove the "pause on failure" flag, which indicates that the "failure" case is either an error (from whatever kind) or simple the string is just "false"



  • 5.  Re: Resuming Process Run After Action Failure in Production

    Posted 08-01-2016 03:26 AM

    Totally agree with Michael. When you check a condition, you must remove "pause on failure", then it's impossible to pause if an error is raising for whatever reason.