I just got done setting up OPS 12.3 on our maintenance system and am playing with SSMV3. This is a perfect time to play since we are building up this new maintenance system and adding tasks as we go. I am really liking the Policy Manager. Anyway, my question is in regards to the SSMRETRY replacement in SSMv3. I see there is still a counter and timer involved but no rule. The RESACT_* columns also are not in my resource table though they still show on the edit screen. I did notice that the SSM#* columns that were added include the information I input for the task. The time to wait and the number of retries. In the past we have had issues with support staff restarting a task within the timer and being rejected (See my other posts about SSMRETRY). To get around this, we just reset the retry counter when needed and tried again. This worked 99% of the time. So how do we handle this same situation in SSMv3? Do we just modify the SSM#RESTARTS or SSM#RSTCNT column? If so it is kind of buried on the edit screen that most of my co-workers are used to. I do have a command that they can use but I am not sure many of them are not aware of it despite my repeated e-mails. Many of them are much better with panels than commands.
Definitely need to take a 'TIMEOUT' here and put together the best possible solution for 1) Customers that will be using only PM to maintain/insert resources 2) Customers that will continue with base SSM and any user created tools (No PM) and 3) Customers that may have one set of systems using SSMRETRY and possibly others they were configured to use PM. You will also see other differences between SSMRETRY and SSMRSTR(v3s retry REQ rule) such as SSMRSTR does not front end the DOWN_UP action as SSMRETRY does which causes some execution differences. Need an overall clear and effective method thus avoiding the problems/confusion that you pointed out. Need to discuss with the development team about this subject. We will update this posting with a better picture of how to move forward with retry logic if choosing to use PM, including more detailed and specific documentation.
Glad to hear you are testing the new SSM Policy Manager.
Before I address your questions about the AOF request rule SSMRETRY I wanted to share with you this link:
After reviewing this section of the documentation and if you decide to continue using the SSM Policy Manager in conjunction with an existing SSM environment then look next at this part of the documentation:
What we are recommending (optional) is to replace the use of the AOF request rule SSMRETRY with SSM Policy data. I have run myself a few tests in our Customer Support labs and we use new AOF request rules to implement what SSMRETRY does under SSMv2.
Wanted also to share that under an environment where SSM Policy data is used with SSMv3 a manual change in the two new columns SSM#RESTARTS o SSM#RSTCNT in the RDF SSM STCTBL table entry for a given SSM monitored resource does not produce any automation results.
Hope this answers helps for starters and the documentation references as well.
Let this one slip away....sorry for delay. First comment on the original SSMRETRY logic for those users that will not go to the Policy Manager and continue to use its logic - Within your focal SSM failure routine (such a the SSMFAIL out of the box sample) that triggers when the CS=FAILED DS=UP and sends alert msg/email/text,etc, you can add logic to update the RESACT_COUNT = 0 for the detected failed resource. This of course will prepare the resource to be restarted if fixed within the original max time limit from a S jobname/SSM SJ=jobname/OPSVIEW 4.11.2/some other user SSM start mechanism. Thus, no end-user has to reset the column to restart once it is fixed. I'll reach out the dev folks that can address the Policy Manager retry logic and ensure that you get the answers to your comments/concerns.
So I have since circled back to the SSMv3 Retry Logic. In the meantime I have been working through the tailoring that our site requires and testing various pieces of SSMv3 before going live with it on our initial testing system. One of the tests of course was the retry logic. I have some dummy tasks set up that generate WTO's from a REXX which is called via their start or stop actions. Based on the state I pass to the REXX this will determine what state the task will "end" in. Anyway, I was setting the current state to FAILED to invoke the retry logic. What I wanted to test was once the retry counter was reached what conditions would need to be met to be able to attempt to start the task again. After my testing and some head scratching I came back here. I think we need to clarify the reset of the restart counter. I was under the impression that at some point the restart counter (column SSM#RSTCNT) was set back to 0. This way you could try again. Based on my testing however, this counter is never reset (unless done manually) and thus even once the SSM#INITTIME has expired, the task cannot be restarted. Basically, this is the same action as SSMv2 just more "under the covers" so to speak. The need to go back and reset this counter manually is what my coworkers complained about. Especially if they say fat fingered a JCL statement on a proc, realized it quickly, fixed it and were ready to try again within seconds of the task failing X times. All they want to do is to be able to start that task as soon as they are ready. Yes, I agree, if they are working with a task in a way that it may crash, they should use PASSIVE mode but I still think the restart counter should get reset at some point. I checked and it doesn't even get reset once the dynamic timeout rule has terminated. This means that even though the task may have started successfully but took X tries to do it, the counter will prevent you from shutting the task down and then starting it back up.
This is the behavior I am seeing:
CSTATE_DSTATE_RESTARTS (MAX RESTARTS=2)
SOMETIME LATER, Could be days later.
FAILED_UP_2 < Task will fail because the restart counter is still at the max.
If this is the behavior that is supposed to occur then I had a couple thoughts to fix it. 1. You could add a reset to the dynamic timeout rule so when it terminates it sets the counter back to zero. 2. When the task reaches the DOWN_DOWN state reset the restart counter.
Another question I had is what exactly is it that determines if a task has failed? The SSMEOM rule only sets the term state but if you have the following scenario something has to stop it.
Is there an underlying piece that recognizes that the DOWN_UP action is occurring and then checks the restart counter?
So I have read the sections but I don't think they adequately answer my question. I also dug into the rules to see if I could find answers there but nothing seemed to give me the answer I am looking for. I did get a better understanding of the new logic but am not sure if I understand it properly. Here is what I have discerned from my analysis of the rules and actions:
Taskname = TASKA
Max Restarts = 2Init Time = 120s
00:00 Start TASKA DOWN_UP00:01 SSMTMOUT STARTING_UP00:02 TASKA FAILS FAILED_UP00:03 SSMRSTRT UPDATE CS=DOWN RSTCNT=1 Disable Dynamic TOD00:04 Start TASKA DOWN_UP00:05 SSMTMOUT STARTING_UP00:06 TASKA FAILS FAILED_UP00:07 SSMRSTRT UPDATE CS=DOWN RSTCNT=2 Disable Dynamic TOD00:08 Start TASKA DOWN_UP00:09 SSMTMOUT STARTING_UP00:10 TASKA Fails FAILED_UP00:11 SSMRSTRT RSTCNT=RSTMAX CALL SSMNOTFI Error 502
What happens with the TOD rule for a time out when a task reaches a 502? It does not seem to be disabled if the resource ends in a FAILED_UP state.What if the resource sits in a FAILED_UP state for, in this case, over 120s? Does the Dynamic TOD rule set the CS to TIMEOUT generating a message?
00:12 User fixes failure.01:10 User starts TASKA
What happens next? The RSTCNT was never reset so the SSMRSTRT rule will immediately kill any attempt to start it correct or am I missing something?
A reset back to zero in column SSM#RSTCNT is achieved when you change the current state of the monitored resource from either failed or timeout to down. The process then starts over again until this counter is reached one more time.
A series of dynamic TOD rules are created during each attempt to restart the task.
When the counter is meet this message is posted:
OPS7944O STCTBL.CASPOOL FAILED. POLICY AUTO RESTART ATTEMPTS OF 3 EXCEEDED
After this there is no further attempt to restart the monitored resource unless the current state is set back to down.
Keep in mind Travis that this behavior is seeing using pure SSM Policy data and SSMv3.
What would be the behavior you would like to see next and what would be the criteria we should use?
Cesar Molina wrote: A reset back to zero in column SSM#RSTCNT is achieved when you change the current state of the monitored resource from either failed or timeout to down. The process then starts over again until this counter is reached one more time.
Cesar Molina wrote:
This is the piece that I couldn't find in the documentation or the comments of the rules. It makes sense and I kind of suspected as much since there was a move to place the retry logic into the background/SSMV3 Engine.
Cesar Molina wrote: What would be the behavior you would like to see next and what would be the criteria we should use?
I am still a little clear on the TIMEOUT portion of things. Based on my analysis of the rules, I see nowhere that the Dynamic TOD rule of the last start attempt of a task is ever disabled if the task stays in a failed state. So in my example above, if no user ever fixed the issue, or did not do so within 120s, would the Dynamic TOD rule set the current state to TIMEOUT when it executed or is the Dynamic TOD disabled by the SSMV3 engine when the counter is maxed out? In other words, if the task sat in a FAILED_UP state for more than 120s, would the Dynamic TOD rule change the current state?
You are correct about the reset of the SSM#RSTCNT counter is something that is not covered in the documentation.
I have to run more tests to simulate the timeout condition for the current state. I can only confirm so far that once the current state is set to failed there are no further actions to attempt restart the monitored resource Travis.
The resetting of the RSTCNT field is probably something that should be included in the documentation. I'd actually like to see an entire explanation of the process in the documentation. This was one of the hardest things to explain in V2 to those who didn't work with SSM on a regular basis. SSMV3 has simplified the process but now has hidden some of the logic so maybe an explanation in the documentation might be handy. There may also be an opportunity here to expand the SSM section into its own guide book or section in the online world with a little more in depth explanations of how things work.
To simulate the timeout current state condition I have created a situation where a task, for another product I support at CA Technologies, issues a WTOR that needs to replied in order to continue processing.
This is an excerpt from the OPSLOG of all the events taking place:
OPD7914T SSMv2 AUDIT: STCTBL.CASPOOL UPDATED by MOLCE01 STATESET DESIRED_STATE=DOWNOPD7914T SSMv2 AUDIT: STCTBL.CASPOOL UPDATED by MOLCE01 STATESET CURRENT_STATE=DOWNOPD7914T SSMv2 AUDIT: STCTBL.CASPOOL UPDATED by MOLCE01 STATESET DESIRED_STATE=UPOPD7902H STATEMAN ACTION FOR STCTBL.CASPOOL: EVRULE=SSMTMOUT STCTBL.CASPOOL 5 UPOPD4320H OPSC3MN OPSD *LOCAL* AOF verb ENABLE command ENABLE *DYNAMIC.#V300001OPD1000I SSMTMOUT: OPD3900O RULE *DYNAMIC.#V300001 FOR TOD 2017/02/07 09:41:17 NOW ENABLEDOPD7902H STATEMAN ACTION FOR STCTBL.CASPOOL: DOWN_UP MVSCMD=START CASPOOL OPD1181H OPSC3MN (*Local*) MVS N/A OPSYSTZS START CASPOOL START CASPOOL START CASPOOL OPD7914T SSMv2 AUDIT: STCTBL.CASPOOL UPDATED by OPSC3MN STATEMAN CURRENT_STATE=STARTINGOPD7902H STATEMAN ACTION FOR STCTBL.CASPOOL: STARTING_UP=NO ACTION FOUND
0006 ESF053 CHKPTDS1 REPLY Y OR N TO CONFIRM CHECKPOINT RECORD CHANGEOPD3900O RULE *DYNAMIC.#V300001 FOR TOD 2017/02/07 09:41:17 NOW DISABLED OPD7914T SSMv2 AUDIT: STCTBL.CASPOOL UPDATED by OPSC3MN *DYNAMIC.#V300001 CURRENT_STATE=TIMEOUTOPD3916I TOD rule *DYNAMIC.#V300001 has been DISABLEd - all time criteria have expiredOPD7902H STATEMAN ACTION FOR STCTBL.CASPOOL: TIMEOUT_UP RULE=SSMNOTFY STCTBL.CASPOOL TIMEOUT UPOPD1370H OPSC3MN X'0000' X'4000' X'0000' GRP900 300 OPD7943O STCTBL.CASPOOL FAILED TO INITIALIZE WITHIN THE POLICY ACTIVAOPD7943O STCTBL.CASPOOL FAILED TO INITIALIZE WITHIN THE POLICY ACTIVATION TIME OF 5 SECONDS
The only difference is that during a failed current state condition multiple attempts to restart the task are driven and column SSM#RSTCNT keeps track of the failed attempts. We stop trying once the value stored in the column SSM#RESTARTS is reached. In this case, we try starting the resource only one time and SSM#RSTCNT is not updated.
For both scenarios the same final outcome is noticed. This is the task remains down in timeout or failed current state and no further attempts are made to bring the task up.
Travis, as I mentioned in my prior update this is all happening when the new Policy data is used in conjunction with SSMv3 and the AOF SSMRETRY REQ type rule has been disabled so the new SSMv3 AOF REQ rules are taking control.
Thanks for the feedback on the documentation.