So I have since circled back to the SSMv3 Retry Logic. In the meantime I have been working through the tailoring that our site requires and testing various pieces of SSMv3 before going live with it on our initial testing system. One of the tests of course was the retry logic. I have some dummy tasks set up that generate WTO's from a REXX which is called via their start or stop actions. Based on the state I pass to the REXX this will determine what state the task will "end" in. Anyway, I was setting the current state to FAILED to invoke the retry logic. What I wanted to test was once the retry counter was reached what conditions would need to be met to be able to attempt to start the task again. After my testing and some head scratching I came back here. I think we need to clarify the reset of the restart counter. I was under the impression that at some point the restart counter (column SSM#RSTCNT) was set back to 0. This way you could try again. Based on my testing however, this counter is never reset (unless done manually) and thus even once the SSM#INITTIME has expired, the task cannot be restarted. Basically, this is the same action as SSMv2 just more "under the covers" so to speak. The need to go back and reset this counter manually is what my coworkers complained about. Especially if they say fat fingered a JCL statement on a proc, realized it quickly, fixed it and were ready to try again within seconds of the task failing X times. All they want to do is to be able to start that task as soon as they are ready. Yes, I agree, if they are working with a task in a way that it may crash, they should use PASSIVE mode but I still think the restart counter should get reset at some point. I checked and it doesn't even get reset once the dynamic timeout rule has terminated. This means that even though the task may have started successfully but took X tries to do it, the counter will prevent you from shutting the task down and then starting it back up.
This is the behavior I am seeing:
CSTATE_DSTATE_RESTARTS (MAX RESTARTS=2)
DOWN_DOWN_0
DOWN_UP_0
STARTING_UP_1
FAILED_UP
DOWN_UP
STARTING_UP_2
UP_UP
SOMETIME LATER, Could be days later.
UP_DOWN_2
STOPPING_DOWN_2
DOWN_DOWN_2
DOWN_UP_2
FAILED_UP_2 < Task will fail because the restart counter is still at the max.
If this is the behavior that is supposed to occur then I had a couple thoughts to fix it. 1. You could add a reset to the dynamic timeout rule so when it terminates it sets the counter back to zero. 2. When the task reaches the DOWN_DOWN state reset the restart counter.
Another question I had is what exactly is it that determines if a task has failed? The SSMEOM rule only sets the term state but if you have the following scenario something has to stop it.
DOWN_DOWN 0
DOWN_UP
STARTING_UP_1
EOM
DOWN_UP
STARTING_UP_2
EOM
DOWN_UP
STARTING_UP_?
Is there an underlying piece that recognizes that the DOWN_UP action is occurring and then checks the restart counter?