Idea Details

Resolve hang issue -- CA Disk DMSAR

Last activity 05-29-2019 11:17 PM
aw-bmw's profile image
08-08-2016 09:16 AM

Due to an issue (high business impact) we raised a ticket to CA in order to solve a bug we discovered during post mortem analysis.

 

Unfortunately CA development stated after long discussions that this bug is not supported by design.

 

As we're convinced that no one should encounter any issues that might lead into production problems, we're going to explain the situation and solution below.

 

If you're not familiar with the technical details, just quickly jump to "5. Real life analogue" which gives you an abstract description that is very easy to understand.

 

    1. Background and Impact Summary
    2. Management summary
    3. Technical Summary
    4. Problem Solving
    5. Real life analogue -- read this for a quick understanding

 


1. Background and Impact Summary

 

One day we had system critical problems (enqueue, “master catalog”) that caused the whole Sysplex to get into trouble. Besides "ordinary jobs" lots of DMSARs (on demand restore jobs) were also affected.

 

While one of the systems in the Sysplex became almost unresponsive (commands entered on the console in order to cancel jobs where not processed anymore), it was still possible to kill jobs from another Sysplex member via Mainview -- which we did.

 

This saved us -- and the bad LPAR became responsive again. Without this practice we would have IPLed at least one of the LPARs (which would have caused very high business impact).

 

So far so good, but the big trouble began:

 

As DMSAR jobs had also been removed from the system _all_ dependent jobs started hanging; users waiting for restores were blocked and lost their unsaved data.

 

Lots of jobs were still waiting for a dataset restore but their related DMSAR was not existent anymore. During the next batch run all new jobs blocked as the old ones were still in the system waiting forever doing nothing…

 

This caused a real business impact as thousands of jobs needed to be investigated for orphaned DMSAR relations – because the CA Disk (client) did not terminated.

 

 

2. Management Summary

 

The CA Disk is a classic client server architecture managing disk/tape data. If the user wants to read a dataset which is archived, the "CA Disk client" is starting a DMSAR for backend retrieval. This works so far without hassle.

 

But, as soon as the server process (DMSAR) has any problem or you need to kill it, the depending clients (thus any batch job and any user session) will start hanging forever! They don't realise the server has died. As a consequence (both happened to us) you'll either run into serious batch problems or users are going to lose data.

 

If the "CA Disk client" would talk(!) to the server process (and not just "fire and forget") -- which is BTW a common understanding in every client server constellation -- no business impact  due to endless wait would occur.

 

Please keep in mind, we're not talking about a single system with 10 users. We have several Sysplex systems with DB2 (several thousand databases), IMS, CICS, millions of batch jobs etc. -- all business critical processes. It’s clear that restarting or shutting down such a system should be avoided as far as possible.

 

Perhaps the "real life analogue" from section 5 would be helpful to understand the problem in a more abstract way. Probably would accept the behaviour there.

 

 

3. Technical Summary

 

As described above, one day we ran into system problems -- please read that section first for the general background.

 

We used the Kill command which terminates any address spaces through MEMTERM. As Sysplex activity was endangered and one system was not responding (and everybody will clearly understand that running an IPL on a productive system during prime time should be avoided as far as possible), this is and was the only way to remove address spaces.

 

The problem is, that the "CA Disk client" does not realise if the corresponding DMSAR address space is removed. Every time DMSAR is not responding for any reason (removed, hanging etc.) you will end up with hanging clients. And as they do not check for the serving process they will wait endlessly.

 

They are not terminating! Never ever.

 

This becomes problematic for users (as their session is lost and all unsaved data also) and batch jobs (as they will not start -- the old job is still "running"). If you only have one or two jobs, you can do manually, but if you have a high load batch system, you're into a big trouble.

 

CA was stating that MEMTERM is not supported in the current design but force/arm. Having a look at the situation above one will clearly realise that this was not working.

 

The "real life analogue" (see below) might give a good understanding how much inacceptable that is in real world.

 

 

4. Problem Solving

 

As often, it's unbelievable how simple a solution can be in order to prevent high impact issues.

 

The following is not just an idea of us; it's a common solution implemented millions of times in various program codes across all platforms.

 

After firing the "retrieve dataset from tape" request and waiting for the answer from the server, the “CA Disk client” starts polling the DMSAR server process in a periodic way  "are you there?" -- a classic keep alive.

 

If the keep alive is not answered, the DMSAR client treats the DMSAR server as dead and will terminate with a corresponding error…

 

Thus the client will terminate(!) and does not hang for ever if the server has any problem or just does not exist anymore.

 

It's superfluous to mention that in this case

    • Operations team could handle all jobs in error according to the instructions
    • Users will not lose any data as their session will not be blocked endlessly by “CA Disk client”
    • Batch jobs will not hang or new ones will not collide with potentially old hanging ones

 

The details of the polling algorithm are (of course) subject of discussion (frequency, timeout, combination etc.), but any implementation of a keep alive mechanism is better than what we have today.

 

Simple, clean, highly efficient.

 

 

5. Real life analogue

 

Imagine, you’re visiting a butcher shop.

 

You enter the store and tell the sales person that you would like to buy some minced meat.

 

The salesperson tells you that she has nothing fresh on the counter and it will need to be prepared – “Would you like to wait until it is ready?”, she asks.

 

“Yes, of course”, you answer.

 

The sales person at the counter calls the assistant in the background “Please prepare 500 g of minced meat.“

 

And then you wait for the meat to be prepared… meanwhile the sales person grabs your arms, holds them tight and does not move (waiting for the assistant’s response).

 

In the meantime the assistant in the background has a problem. Maybe they have fainted, cut their finger, been kidnapped etc. Whatever the reason may be, they are not able to talk to the sales person anymore.

 

And the sales person is waiting… (for a response from the assistant) … still holding our arms, still not moving.

 

And you are waiting… (for the sales person)

 

What would you do in real life? Would you wait? If so, for how long?

 

Ah, of course, you would ask the sales person -- and the sales person would ask the assistant.

 

****, the sales person is not responding… (and still holding your arm so that you cannot leave the store without cutting it off!)

 

 

CA is now saying that this butcher store "works as designed". And they do not support fainted assistants (or kidnapping etc.) but(!) friendly kidnappers who inform the sales person about the kidnapping.

 

And the moral of the history:
If the sales person would be empowered to talk to the assistant in order to ask if they're still busy, the customer does not need to cut off the arm.


Comments

08-08-2017 02:07 AM

Thanks for the update. That's a step in the right direction and interactive users will be glad about.

 

Please bear in mind anyhow that non-ineractive processes (batch) are the critical one (thus very important). It's very complicated and time consuming resolving a hang situation (but has been written above).

08-04-2017 12:43 PM

We are adding this request to our Wish List for CA Disk. Our plan is to first implement the ATTN key support for TSO since that is well defined.  The request to have some way to monitor DMSARs and associated jobs/users will require some research for two-way communication in order to determine when a DMSAR is not responding or has gone away and the ARM GUI needs to be able to manage the requests when DMSARs have failed.   This is a much longer term project.  It has been added to our Agile backlog for review and prioritization. 

Thanks,

Marjory Montgomery Principal Architect (CA)

 

06-09-2017 03:18 AM

Dear colleagues, is there any news?

Thank you!

10-13-2016 08:37 PM

Hello,

We are continuing to monitor this idea request and noticed your update. Yes, abends where the DMSAR cannot get control back could also result in a job or TSO user hanging in a wait for DMSAR that is gone. We are investigating some new ideas on server type processing and will have a better answer soon.

 

Marjory Montgomery

Principal Architect

10-11-2016 08:35 AM

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Update ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

This is just to clarify the magnitute of the program code issue:

 

Always I'm reading only about cancel and kill. Thereby I clearly stated during the discussions around this case that one will run probably also into this problem if the DMSAR is terminating uncontrolled for any other issue!

 

Here's the proof:

 

13.59.50 STC02938 ---- TUESDAY,   11 OCT 2016 ----
13.59.50 STC02938  IEF695I START DMSAR    WITH JOBNAME DMSAR    IS ASSIGNED TO USER zzz
13.59.50 STC02938  $HASP373 DMSAR    STARTED
13.59.50 STC02938  IEF403I DMSAR - STARTED - TIME=13.59.50
13.59.50 STC02938  +DMS3183 WARNING: THIS PROGRAM IS RUNNING NON-APF AUTHORIZED; UNPREDICTABLE RESULTS MAY FOLLOW
13.59.50 STC02938  +DMS3754 A SUBSYSTEM RECALL REQUEST FOR DSN = ***
13.59.50 STC02938  +DMS3754 IS IN PROGRESS FOR JOBNAME = yyy
13.59.50 STC02938  +DMS3891 COMPLETE REQUEST SENT FOR DSN = ***
13.59.50 STC02938  +PET204I SYSTEM ABEND(S047) AT PSW(078D0000 00086412).
13.59.50 STC02938  +PET206I ABEND OCCURRED AT PGM(SVC107.SVC107)
13.59.50 STC02938  +*** SUPERVISOR SERVICE REGISTERS ORIGINATING PSW(078D0000 00086412).
13.59.50 STC02938  +GPR0-3... 800396D8 0004D220 00000000 7F5C1CF0
13.59.50 STC02938  +GPR4-7... 7F6191D0 009FD4F8 00000010 0004D220
13.59.50 STC02938  +GPR8-11.. 05BF20BE 05BF10BF 85BF00C0 00000001
13.59.50 STC02938  +GPR12-15. 7F5C210C 0004D8C0 7F5C1D10 05BF2DA9
13.59.50 STC02938  +PET205I ABEND OCCURRED FROM PGM(SVC012.SVC012)
13.59.50 STC02938  +*** SYNCHRONOUS EXIT REGISTERS ORIGINATING PSW(070C1000 85BF26A0).
[…]

 

In this case the DMSAR is terminating uncontrolled due to missing APF authorisation. But that is marginal. The important thing is the evidence. _Without_ using any kill etc. an uncontrolled termination of DMSAR will cause hang/lock of the client processes (batch job waiting forevery, user session hang etc. -- see above).

 

If there's a problem when DMSAR is failing quickly/often you will have to recovers thousands of clients -- have fun (if you every find out who's affected as no DMSAR is existent anymore).

 

So the only correct solution is the one we suggested above.

09-16-2016 05:11 PM

Our documented procedure for stopping a DMSAR is to use the MVS STOP command as it allows CA Disk to recover the user data set back to the condition it was in when the DMSAR started.   Customers who had previously used an MVS CANCEL found that that partial data set may have been left on DASD and use of that partial would result in data loss.  The same would apply if using the MAINVIEW Kill command as terminating the DMSAR could also leave a partial data set.    We understand customers may have the need to terminate jobs and hope the STOP command could be used however functions like the Kill may be needed.  As such we have added a story to our Agile backlog to review and prioritize the following ideas:

    • For TSO sessions, the ATTN key should be accepted to exit the WAIT routine for a restore. If the restore really takes a long time, the TSO user should be able to continue working, even if he has answered ‘WAIT’ during the start of the DMSAR task.
    • For Batch a method to fail the job if there is a problem with the restore where the batch job is left hanging.

 

Failing the batch job however will not clean up the data set being restored. That would have to be done manually.

 

Robert Hurwitz

Director Software Engineering (MF Storage)

08-30-2016 09:18 AM

08-11-2016 02:33 PM

Yes, I agree.