View Only
  • 1.  Reset Retries in Stateman

    Posted Jul 30, 2015 02:07 PM

    I am using the SSMRETRY rule in Stateman and have had several instances where tasks have reached their retry limit. I looked at the rule and found that when the action type changes the counter resets. What if you have a task which goes down and Stateman tries to restart it but it keeps crashing. Eventually it hits the retry limit. At this point aside from editing the task via 4.11.2 or the column in the table, how do you reset the counter for the retry limit?


    Here are the basics of my scenario:


    TASKABC is running.

    User issues command: C TASKABC (not correct for them to do so but that is another topic)

    TASKABC comes down, State of TASKABC becomes TERM_UP

    Stateman takes action, Creates SSMRETRY, issues the S TASKABC command and waits.

    Task starts but fails in 4 secs.

    Stateman tries again

    Task again starts but fails in 3 secs.

    Stateman tries again

    Task again starts but fails in 3 secs.

    TASKABC has now been started 3 times, the next thing Stateman does is issue a message that the task has reached its retry limit and will not be restarted. This is the "Something is wrong with this task. Fix it and try again." message.

    Ok so I fix whatever is wrong, and go to start the task again. Same retry message.


    As far as I can tell there is nothing that resets the counter even after the time limit has expired on SSMRETRY. When I look at the STCTBL table, the column with the counter is always >1 for all tasks which use the SSMRETRY logic (I think all of mine do). How would you handle this scenario?

  • 2.  Re: Reset Retries in Stateman
    Best Answer

    Posted Jul 30, 2015 06:13 PM

    Hello Travi,


    I'm Kraig. I work in OPS/MVS Support here at CA Technologies. You are correct about the retry limit. We can offer you this advice regarding the SSMRETRY:


    1) You can 'manually' get into the SSM STC resource definition and change  the RESACT_COUNT column that SSMRETRY updates the retry counter back to zero. This I see you know.


    2) You can programmatically change the RESACT_COUNT column when the OPS7940O appears by setting up an AOF MSG rule to do it.


    3) You could be to create a command rule to be able to do the 'manual' reset of the retry column counter if you ever hit it.      


      It could be something similar to:                                        

          )CMD RESETCNT                                                            

          ADDRESS SQL                                                            

          UPDATE STCTBL SET RESACT_COUNT = 0                                      


      Important: Be careful not to produce a Loop; you will need to consider implementing something to prevent it.          


      If you are experiencing problems implementing any of these suggestions, please open a case with us to pursue it.

  • 3.  Re: Reset Retries in Stateman

    Broadcom Employee
    Posted Jul 31, 2015 08:47 AM

    In the SSMRETRY code there is a check against the RESACT_TIME column to see if the current event is beyond that many seconds from the event that set the RESACT_COUNT to 1.


    For example, if your table entry is defined as the following:



    This task will only start 5 times in 120 seconds (2 minutes).  Any attempt to start the task more then 5 times will not work.  However if you wait 2 minutes (from the first start attempt) the RESACT_COUNT field will get reset back to 1 and allow the start to occur.  You can tell when the time limit as expired waiting for the CURRENT_STATE to be set to TIMEOUT by the Dynamic TOD rule created within SSMRETRY the first time a task is started.


    The only way to by-pass this time limit is to reset the RESACT_COUNT column to 0 as explained in the previous comment


    Mike Kiehl

  • 4.  Re: Reset Retries in Stateman

    Posted Jul 31, 2015 02:14 PM

    I decided that two ideas would be useful. First, I have created a command to reset the counter manually when needed. Second, we are going to reduce our timers from 300 seconds (default) to 120 seconds. This way IF you do have to wait, you won't wait long to be able to retry. Thanks for all your help.