Datacom

 View Only
  • 1.  STOP_LOOP_HANG_1 / _2 usage

    Posted Dec 04, 2020 04:59 AM

    Hi Community,

    we have just applied the necessary hipers to be able to use the STOP_LOOP_HANG_1 and STOP_LOOP_HANG_2 parameters.

    Does anyone have had some actual  experience with it, already ? Would love to know the thoughts on that.

    Also, these parameters seem to do an 'overall' job?

    Are there any ways to be more specific, like making a distinction for a certain job(name/mask). Or would that be something for the near future, perhaps ?

    thanks.

    Peter.



  • 2.  RE: STOP_LOOP_HANG_1 / _2 usage

    Posted Dec 06, 2020 02:57 PM

    Hi Peter,

     

    We were validation partners for this feature and have implemented it in all DB and AD MUF environments now.

     

    In non-production environments we have..

    MESSAGE_TYPE_OVERRIDE W,00823  STOP_LOOP_HANG WARN/FAIL/REQABORT

    STOP_LOOP_HANG_1_WARN 5        TCB/SRB HAS NOT CALLED DISPATCHER

    STOP_LOOP_HANG_1_FAIL 9        TCB/SRB HAS NOT CALLED DISPATCHER

    STOP_LOOP_HANG_2_WARN 15       LONG-RUNNING REQUEST            

    STOP_LOOP_HANG_2_FAIL 30       LONG-RUNNING REQUEST            

     

    In Production environments we have…

    MESSAGE_TYPE_OVERRIDE W,00823  STOP_LOOP_HANG WARN/FAIL/REQABORT

    STOP_LOOP_HANG_1_WARN 5        TCB/SRB HAS NOT CALLED DISPATCHER

    STOP_LOOP_HANG_2_WARN 15       LONG-RUNNING REQUEST            

     

    Here's why we chose the above settings…

     

    We elevate the message type to W so that it is picked up by our CA OPS/MVS message rule and an email is sent immediately to the DBAs.

     

    STOP_LOOP_HANG_1 is really there to catch bugs in the CA Datacom code itself, where a single request is causing a CPU loop within the Datacom code. Thankfully these days, that is a very rare event, but it's great to have something to catch it when it does happen.

    • In non-Production environments we are happy to let Datacom cancel the MUF (STOP_LOOP_HANG_1_FAIL) after 9-minutes and thereby capture the necessary dump for analysis by support (Shadow MUF takes over so there is no interruption of service).  Having it cancel the MUF after 9-minutes ensures that it is not left looping for too long, and so limits the performance impact to other regions and/or the 4hr CPU cap.
    • In Production environment we do not code STOP_LOOP_HANG_1_FAIL, as we would need to investigate the loop and its business impact before making a decision about cancelling the MUF. For example, some years ago we had a CPU loop that was causing a single zIIP processor to spin at 100%. This occurred during the online day. Since we have multiple SMP tasks (and a second zIIP processor), we were able to continue processing until the end of the online day, and then schedule a MUF cancel/restart prior to the start of the main nightly batch.

     

    STOP_LOOP_HANG_2 is there to catch long-running application requests. These are far less likely to be a CA Datacom bug, and more likely to be an application coding error (such as a poorly-coded ad-hoc SQL request).

    • In non-Production environments these are more common, as the new/changed applications are developer/tested. After 30 minutes the request is automatically cancelled by REQABORT due to STOP_LOOP_HANG_2_FAIL 30.
    • Again in Production we do not code STOP_LOOP_HANG_2_FAIL for the same reasons that we don't code the STOP_LOOP_HANG_1_FAIL. The DBAs are notified by email at the 15 minute warning, and would therefore be monitoring the query. If a REQABORT is warranted then it is issued by the DBAs after consultation with the offending application owner/user.
    • We do have a couple of Production applications that regularly trigger the STOP_LOOP_HANG_2_WARN 15 alert. These are genuine long-running SQL searched-DELETE statements deleting several million rows in a single request. Since the job name and relative step number are included in the DB00823 message it is easy to filter out these job steps in the CA OPS/MVS message rule so that the email alert is only sent for these exceptional steps if the RUN_TIME value exceeds a higher threshold (e.g. 30 mins).

     

    For us, the STOP_LOOP_HANG Datacom feature delivers enough for us to alert and/or automate the responses exactly where it is appropriate.

     

    Cheers,

    Owen

    ------------------------------
    Technical Consultant
    Redcentric PLC
    ------------------------------



  • 3.  RE: STOP_LOOP_HANG_1 / _2 usage

    Posted Dec 08, 2020 01:40 AM

    Hi Owen,

    Thanks very much for your quick reply.
    I do have a better understanding now of the 'meaning'/purpose of these parameters..

    So indeed they could be useful for these rare occasions, as you said.

    But unfortunately not meant for what we had initially in mind.

    But we will have a closer look into it anyway, even it's only as a safeguard..

     

    Best regards

    Peter

     






  • 4.  RE: STOP_LOOP_HANG_1 / _2 usage

    Posted Dec 08, 2020 07:31 AM
    Hi Peter,

    I am curious what you "had initially in mind"?
    If you are looking to identify/auto-cancel jobs that are consuming more resource than expected (i.e. "looping" while issuing millions of requests), then your monitoring tool (e.g. CA Sysview THRESHold and/or IMODS) should cater for that (including jobname masking/filtering). 
    If you are are looking to identify/auto-cancel scheduled jobs that are taking longer to execute than expected, then your scheduling package (e.g. CA 7) should cater for that.
    Traditional tools such as Datacom Accounting facility are still very useful in monitoring resource usage within MUF, and also allow jobname masking/filtering.

    Beyond the safeguard aspect, we have found STOP_LOOP_HANG_2_WARN to be useful in identifying long-running requests that had previously flown under the radar (including SQL and CA Ideal/MetaCOBOL FOR construct SELFR/SELNR requests performing full table scans due to lack of suitable key and/or poor selection criteria or optimization choices). It is a useful addition to the tuning toolbox.

    Cheers,
    Owen

    ------------------------------
    Technical Consultant
    Redcentric PLC
    President of the CADRE community/user group
    ------------------------------