Myself, ChipRab, and MikeLong began a discussion offline a little while ago that I thought should be brought out to the Communities. Gents, please point out if I misrepresent anything that we discussed.
Basically, it's a discussion about the legitimacy of resuming a process run after an action has failed in CARA during a production deployment, e.g.
Their contention was that this is a development feature only and should never be used when running a deployment in production. Their principal points (that I remember) were:
While I see value in those arguments, upon further consideration and discussion with other colleagues, I feel there are counter arguments to be made...
… before any other deployments using that process can be run again. Effectively, we would be saying that the deployment pipeline is completely blocked until a bug fix is deployed - no workarounds available.
It seems to me that adds brittleness to the whole proposition of using CARA for deployments from an operations point of view.
However, if a user runs into a case during a deployment where there is insufficient error handling, but they had the ability to both manually pass an action *and* populate missing parameter values at run-time, they would then have a work around available to them. Instead of the pipeline being halted, it would merely be slowed. A slowed pipeline would still be motivation enough to get the missing error handling problem fixed.
nice discussion you had there, I consider myself in between both sides:
build in error handling and fail deployments if it is a crucial part of the deployment, that can't be done manually in a fast manner, on the other hand, pause on failure if it something small, that can easily be done manually during the deployment to then continue the flow.
so, for example we pause usually most of the time, except the project itself defines that we should always fail the deployment on any error. an example of a recent error is the following: CA RA wasn't able to stop a website, the reason behind it is unclear, might have been a timing issue, but this is something that happens not very often, so I logged on to the server, stopped it manually and proceeded the deployment.
if you now would choose to not do that, the consequences would have been:
- deeply analyzing why the website couldn't be stopped, although the action was successful thousands of times before
- fix the flow and test it on the "latest" tag
- if tests are successful publish with a new tag
but here comes the next part, that isn't mentioned above. what do you do now after the publish? are you creating a new plan? are you adapting the existing one and just change the tags there?
also, this deployment would need to go through all pre-prod environments again, which can take several weeks because of organizational processes. this would lead to a lot of more costs, so believe me, every project would be kinda pissed, if they would need to pay a lot more, just because one small step like stopping a website or a windows service or maybe just a clean up failed.
I support the pause on failure and manual passing of steps that can and have been done manually to continue the flow. only in very crucial moments (e.g. a database update or the running of a silent installer) it should fail completely.
MichaelGebhardt, thank you for your contribution to the dicussion. I believe we are in agreement. I think it's worth pointing out that this topic leads to something I believe ChipRab has been arguing in favor of for some time, which is better built in error handling. Such improvements I can imagine would include:
Currently, to do something like that I have to add a "User input" action that executes on failure of a predecessor with the paramaterized and customized information in it to display to the user. It directs the user on how to proceed. That action is followed by a "terminate" action if I want the predecessor failure to be fatal. The terminate action is really just a compare action that is hard coded to fail - very hacky.
the thing I would appreciate would be a third output state, which really indicates an error. because at the moment when I'm checking a condition (like is a string "true") I always need to remove the "pause on failure" flag, which indicates that the "failure" case is either an error (from whatever kind) or simple the string is just "false"
Totally agree with Michael. When you check a condition, you must remove "pause on failure", then it's impossible to pause if an error is raising for whatever reason.