I am using the SSMRETRY rule in Stateman and have had several instances where tasks have reached their retry limit. I looked at the rule and found that when the action type changes the counter resets. What if you have a task which goes down and Stateman tries to restart it but it keeps crashing. Eventually it hits the retry limit. At this point aside from editing the task via 4.11.2 or the column in the table, how do you reset the counter for the retry limit?
Here are the basics of my scenario:
TASKABC is running.
User issues command: C TASKABC (not correct for them to do so but that is another topic)
TASKABC comes down, State of TASKABC becomes TERM_UP
Stateman takes action, Creates SSMRETRY, issues the S TASKABC command and waits.
Task starts but fails in 4 secs.
Stateman tries again
Task again starts but fails in 3 secs.
TASKABC has now been started 3 times, the next thing Stateman does is issue a message that the task has reached its retry limit and will not be restarted. This is the "Something is wrong with this task. Fix it and try again." message.
Ok so I fix whatever is wrong, and go to start the task again. Same retry message.
As far as I can tell there is nothing that resets the counter even after the time limit has expired on SSMRETRY. When I look at the STCTBL table, the column with the counter is always >1 for all tasks which use the SSMRETRY logic (I think all of mine do). How would you handle this scenario?
I'm Kraig. I work in OPS/MVS Support here at CA Technologies. You are correct about the retry limit. We can offer you this advice regarding the SSMRETRY:
1) You can 'manually' get into the SSM STC resource definition and change the RESACT_COUNT column that SSMRETRY updates the retry counter back to zero. This I see you know.
2) You can programmatically change the RESACT_COUNT column when the OPS7940O appears by setting up an AOF MSG rule to do it.
3) You could be to create a command rule to be able to do the 'manual' reset of the retry column counter if you ever hit it.
It could be something similar to:
UPDATE STCTBL SET RESACT_COUNT = 0
Important: Be careful not to produce a Loop; you will need to consider implementing something to prevent it.
If you are experiencing problems implementing any of these suggestions, please open a case with us to pursue it.
In the SSMRETRY code there is a check against the RESACT_TIME column to see if the current event is beyond that many seconds from the event that set the RESACT_COUNT to 1.
For example, if your table entry is defined as the following:
EVRULE("SSMRETRY &SSMTABLE &NAME,5,120,START UP");
This task will only start 5 times in 120 seconds (2 minutes). Any attempt to start the task more then 5 times will not work. However if you wait 2 minutes (from the first start attempt) the RESACT_COUNT field will get reset back to 1 and allow the start to occur. You can tell when the time limit as expired waiting for the CURRENT_STATE to be set to TIMEOUT by the Dynamic TOD rule created within SSMRETRY the first time a task is started.
The only way to by-pass this time limit is to reset the RESACT_COUNT column to 0 as explained in the previous comment
I decided that two ideas would be useful. First, I have created a command to reset the counter manually when needed. Second, we are going to reduce our timers from 300 seconds (default) to 120 seconds. This way IF you do have to wait, you won't wait long to be able to retry. Thanks for all your help.