nice discussion you had there, I consider myself in between both sides:
build in error handling and fail deployments if it is a crucial part of the deployment, that can't be done manually in a fast manner, on the other hand, pause on failure if it something small, that can easily be done manually during the deployment to then continue the flow.
so, for example we pause usually most of the time, except the project itself defines that we should always fail the deployment on any error. an example of a recent error is the following: CA RA wasn't able to stop a website, the reason behind it is unclear, might have been a timing issue, but this is something that happens not very often, so I logged on to the server, stopped it manually and proceeded the deployment.
if you now would choose to not do that, the consequences would have been:
- deeply analyzing why the website couldn't be stopped, although the action was successful thousands of times before
- fix the flow and test it on the "latest" tag
- if tests are successful publish with a new tag
but here comes the next part, that isn't mentioned above. what do you do now after the publish? are you creating a new plan? are you adapting the existing one and just change the tags there?
also, this deployment would need to go through all pre-prod environments again, which can take several weeks because of organizational processes. this would lead to a lot of more costs, so believe me, every project would be kinda pissed, if they would need to pay a lot more, just because one small step like stopping a website or a windows service or maybe just a clean up failed.
I support the pause on failure and manual passing of steps that can and have been done manually to continue the flow. only in very crucial moments (e.g. a database update or the running of a silent installer) it should fail completely.
best regards
michael