IDMS

Expand all | Collapse all

Longer run times after consolidating batch to single LPAR

  • 1.  Longer run times after consolidating batch to single LPAR

    Posted Dec 05, 2017 04:09 PM

    Hello all,

    First, I'm an MVS guy very new to IDMS so please excuse my lack of IDMS experience/knowledge.  I'm running with this since our IDMS folks are busy and I'm impatient.  Plus, I have a bad feeling I'm the causer.  :-)

     

    Traditionally, our batch IDMS jobs have been split across two LPARs; SYSA and SYSB.  Both LPARs are participants in a sysplex and share DASD.  SYSA is designated as production and SYSB is designated as development.  The split workload is leftover from a time near-forgotten when there were two physical mainframes on the floor (today there is only one).  The production IDMS database lives on SYSA and all jobs running in CV mode ran on SYSA.  All jobs running in LOCAL mode ran on SYSB.

    We've been pushing the capacity of our box, so, as an MVS'er, I wanted the ability to manipulate LPAR weighting to give production workloads more system resources as needed.  Can't do that when production is also running on the dev LPAR.  Technically I could, but I didn't want to squash the prod work on SYSB.  So I convinced the team to consolidate all production batch to a single LPAR (SYSA) which, of course, includes the previously mentioned IDMS jobs.  

    Since the change, there are some IDMS batch jobs (not all) that are running longer and experience greater variation in overall run time.  Job stats show CPU and IO are nearly the same as previous.  Only the elapsed time appears to be significantly effected resulting in a noticeably longer batch cycle (long enough to generate complaints from the app folks).  This longer elapsed time issue does not appear to correlate with CV mode jobs or LOCAL mode jobs.  Other than being a 'database job', I can't find any commonality.  The job schedule hasn't changed; jobs that ran concurrently previous are running concurrently now (JES initiators were adjusted accordingly).

     

    Any thoughts?  Suggestions?

     

    If it's not CPU or IO, it leads me to suspect WAIT time.  But where is the wait time?  I don't notice any significant ENQs with RMF.  DASD response time is mostly sub-millisecond.

     

    Talking with our IDMS folks, there is awareness of some index, buffer, and page size optimizations that need to be made.  However this was the case before the batch consolidation.  Could batch consolidation have exacerbated the aforementioned issues?

     

    Thanks in advance!



  • 2.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 06, 2017 07:21 AM
      |   view attached

    Have you adjusted WLM settings when you merged LPAR’s?

    Are you running zIIP at all?



  • 3.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 06, 2017 08:32 AM

    Regarding WLM, IDMS is lumped in with other online work (CICS & MQ).  We've been considering adding a new service class for IDMS by itself.

    Yes, we have a single zIIP and IDMS is utilizing it.



  • 4.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 06, 2017 09:57 AM
      |   view attached

    How about the batch jobs in WLM?



  • 5.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 06, 2017 11:14 AM

    Batch job priorities in WLM did not change as part of the batch consolidation.  We have four (4) classifications for production batch; HOTBATCH, PRDBATHI, PRDBATMD, and PRDBATLO.  We do not break out database jobs specifically.  The priority classifications are based on job dependencies, critical path, and the overall job flow.



  • 6.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 07, 2017 05:02 AM

    Following my experience, IDMS/DB/DC, as a Backend System, must in a higher WLM priority class than CICS.  Check also the z/OS level dispatching of IDMS Central Version with the Effectiveness DCMT command. Effectiveness must be higher than 90%. These 2 points are related to IDMS/DB/DC region.

    Check also if another batchs not using IDMS are also impacted by the consolidation.



  • 7.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 07, 2017 07:14 AM

    Morning!

    I have an open ticket with Support and they also mentioned the Effectiveness.  That data collection is in progress.  

    We can certainly pursue the WLM priority change.  As mentioned we were already looking into isolating IDMS into its' own class.

    Other non-database batch jobs, for the most part, appear to be running better on a single LPAR.  Granted, I didn't look at every single job individually.  I took a sampling from certain time frames and their resource usage and elapse times were a mildly-to-moderately improved.



  • 8.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 07, 2017 09:29 AM

    If other workload has seen improved run-times, it is definitely a wait issue impacting just IDMS.

     

    What could impact just IDMS?

     

    You stated that CPU and I/O was the same, is that total CPU (zIIP + CP). If zIIP and CP times have changed (more zIIP

    less CP) then check IIPHONORPRIORITY.  If SYSA has NO and SYSB has YES, then there may now be more wait for zIIP as all zIIP eligible work on SYSA will be forced to run on zIIP.

     

    Second thing to look at is whether the slowdown is when there is a lot of IDMS activity. If it is not, then I have no clue.

    If you have IDMS statistics going back far enough, you can look at the I/O wait times in IDMS (basically the User + System Time vs. Total Time) to check if there is an I/O bottleneck. 

    I/O channels – something I know very little about, but that does not keep me from trying to comment on it. 

    1. Are your IDMS database files on dedicated IDMS volumes/devices?
    2. Do you have dedicated channels and subchannels to those devices? 
    3. Do you have subchannels dedicated to the LPARs?

    My understanding is that subchannels and channels can be shared by LPARs but not necessarily so if you have subchannels dedicated to SYSB and SYSA, it could be that the channels on SYSA cannot handle the now increased I/O volume from SYSA, and the solution may be to add channels or subchannels.

     

    I hope this makes sense.



  • 9.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 07, 2017 10:16 AM

    IIPHONORPRIORITY is set to YES on both LPARs.

    Yes, IDMS database files/datasets are on dedicated volumes.

    No, we do not dedicate channels to specific storage devices (or LPARS).  

     

    Channel utilization/saturation wasn't flagged, but I will definitely go run some reports to see what the percent utilization was.

     



  • 10.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 07, 2017 04:10 PM

    Revisited the DASD response time report and can confirm that there were no issues.  the worst performing database volume had an average response time of 1.9 milliseconds.  (horrible, I know).  :-)

    Likewise, the channel reports don't show any issues.  There were a few intervals where 'channel busy' exceeded 10%, but that was it.

     

    I'll narrow my focus on CPU / dispatch times and run some more reports tomorrow.  I also planned on WLM changes for putting IDMS in it's own service class and bumping it's priority.  I'll hold off on activating the policy until CA Support provides their feedback.

     

    Thank you all for your responses so far!  If anyone can provide any other suggestions and/or insights, please feel free.  I appreciate the input.  If anything, this is quite the learning experience.  :-)



  • 11.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 07, 2017 06:15 PM

    Peter,

     

    One thing I have not seen mentioned here is memory.  You basically doubled the load on this LPAR.

     

    How much paging is going on in your SYSA LPAR after this consolidation ?

     

    Did you also steal memory from the other LPAR ?  That would be appropriate, especially if that LPAR is to be decommissioned.

     

    2 cents

     

    Claude F.../



  • 12.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 09:33 AM

    Peak paging rate was .03 per second.  Peak faults and paging demand were approximately the same.

     

    I had some questions about locking and sharing; Brian Brendlinger called and we talked.  Of all the questions, we kept circling back to CPU, wait times, and ultimately latent demand.  Multi-Tasking did come up since that is something we currently do not utilize.  We will look into that one.

    I've sent off a whole slew of SMF records to our IBM business partner for analysis.  

     

    I was merely curious before, but now it is starting to bother me.   Why does the same workload on the same physical hardware not perform as well on a single LPAR as it did on two LPARs?  What is it about an LPAR, a virtual system, that would permit more throughput?

     

     

     

    Need to go think some more....



  • 13.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 09:54 AM

    You mentioned that your problem jobs are running CV mode so I assume they are updaters, have your checked CV system shutdown stats for any waits? Since memory is not a problems these days you could put all buffers and areas in cached memory these days virtually eliminating physical I/O's. In my shutdown job (our CV's are recycled once every night) you could put some displays buffers to see what the stats are for them. My 2cents worth..



  • 14.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 11:43 AM

    The jobs that are running longer are not all CV mode jobs.  If I implied otherwise, I apologize.

    From what I understand, the general practice here is CV mode for update jobs and LOCAL mode for read jobs.  However that is not necessarily what is in play.  Just this morning, a particular job was identified as reading only, but had a SYSCTL DD coded.

    Stats do get displayed at IDMS recycle time.  Technically, there are three (3) recycles a night (don't ask).  CA Support was asking about TSKWAIT, but we didn't have it on.  We updated the parm to start capturing TSKWAIT info.  We are going to let it go over the weekend and check Monday to see what TSKWAIT looks like across several batch windows.



  • 15.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 12:11 PM
      |   view attached

    There are indeed reasons to run READ-ONLY run-units in CV mode – many times it is set up so that the batch process has real-time access to any in-buffer but not-yet-externalized data modifications – which would prevent the infamous phantom broken chain ….

     

    Chris Hoelscher

    Technology Architect, Database Infrastructure Services

    Technology Solution Services

     

    123 East Main Street

    Louisville, KY 40202

    Humana.com

    (502) 476-2538 or 407-7266



  • 16.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 12:15 PM

    Sorry, I don't follow; "phantom broken chain"?



  • 17.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 12:31 PM

    Phantom broken Chains:

     

    In CV mode when an update is made to a record, for example, to connect a record to an owner via a set, there is no guarantee that all the changes to the pointers will be get written to disk at the same time. They could be in different buffers and, for many reasons, could get flushed to disk at different times. All the data is consistent and available to any processes in the CV via the buffer but for a period, the disks could only contain the update to the owner set pointers and not the member set pointers.

     

    A Local-mode job, that reads these records will only read the Disk. Therefore it could read the owner’s pointers (written disk) and the member’s pointers (not written to disk yet and still the old pointer values) and, when seeing that the pointers do not match, could interpret that as a broken chain and produce error codes and horrible dumps in the local mode jobs. These are the Phantom broken chains. (E.G. If you see 0317 status codes).

     

    Hope that helps.

     

    Steve



  • 18.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 04:06 PM

    Good info!  Thank you!

    We haven't seen 0317 codes, but I'll definitely keep that in mind.

     

    I ran the the wait % and wait time reports and they are ugly for December 1st.  There are wait time peaks measured in whole seconds instead of milliseconds.  Ouch.  I reran the reports for October 1st (well before the batch consolidation change) and the reports showed the high wait times (while not as high, but still in seconds) were on SYSB.  So it looks like there was a wait issue that already existed on SYSB that moved to SYSA with the batch consolidation and was intensified.

    I still want to wait for the TSKWAIT info from over the weekend and CA Support's analysis of the results, but I'm thinking of moving forward with the WLM change (giving IDMS its' own class and increasing its' priority).  Folks are also reviewing some of the database jobs to make some scheduling tweaks; waiting to hear back on this front.



  • 19.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 10:52 AM

    If PerfMon is installed, you may activate with task PMAM a trace on one of the slow-running batch report. PMAM will show the type of waits inside IDMS/DB/DC. This will be a good starting point for your investigations. Another 2cents worth..



  • 20.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 11:44 AM

    I will definitely look into the PerMon trace.  I didn't know it could do that.



  • 21.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 08, 2017 11:58 AM

    Hi,

     

     

    Perfmon is only for Central Mode batch jobs and online transactions of

    course.

     

     

    For local mode jobs, you may play with SYSIDMS parms like Buffer Trace

    and/or QSAMTrace (if QSAM activated and helpfull)

     

     

    Regards

     

    Philippe Jacqmin | Formula OpenSoft

     

    Philippe Jacqmin |Formula OpenSoft

     

    +32(0)496.540.166 |

    philippe.jacqmin@formulaopensoft.com<mailto:philippe.jacqmin@formulaopensoft.com>|

    http://www.formulaopensoft.com

     

    /Database Tuning & Security (Oracle, SQL Server, IDMS, …) | Development

    Tools| Modelling Tools///

     

    /MF Integration & Modernization (IDMS, IMS, Datacom, VSAM, ...) ///

     

    /fos///Web//Services Requester and Provider for IDMS - A fully

    integrated web services solution for IDMS/

     

    /CA Certified IDMS/ADS Trainer/



  • 22.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 09, 2017 12:13 PM

    You did not mention this.

    The jobs that are running slower, do they use relatively less CPU per I/O than the IDMS jobs that slowed down? 

    During the periods the jobs slow down, am I correct in assuming that you are at or close to 100% CPU usage?



  • 23.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 11, 2017 08:13 AM

    tommy.petersen wrote:

    The jobs that are running slower, do they use relatively less CPU per I/O than the IDMS jobs that slowed down? 

    If I'm understanding the question correctly; negative.  Generally, the LOCAL mode jobs that are running longer are use more CPU and have more I/O relative to the CV mode jobs.

    I say use the word generally because we've only targeted the more effected, more critical, more noticeable jobs.  We have not reviewed every single job (database related or other).

     

    Yes, you are correct in assuming that the CPU is running at or near capacity (98%-100).



  • 24.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 13, 2017 08:24 AM

    WLM changes were made; IDMS was assigned its' own service class with a higher velocity goal than CICS & MQ.  CICS and MQ had their goal adjusted to help differentiate from the IDMS goal.  I'm still in the process of reviewing the performance indexes for last night, but I can see that there wasn't any noticeable impact to the elapsed times of the overnight jobs.  Nothing ran worse.  Nothing ran better.



  • 25.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 13, 2017 10:07 AM

    WLM adjustement you did is as it has to be but only helps IDMS Central Version activities. A batch job accessing IDMS databases in local mode so not running CENTRAL  loads inside its own region its own copy of IDMS nucleus.



  • 26.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 13, 2017 01:59 PM

    Understood.  I was hoping for some improvement even if it was for a subset of the jobs.

    The plan is to leave the change in place and continue monitoring.  More MIPS may be on the horizon.  We have a meeting early next week with our IBM BP to review a capacity study.



  • 27.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 18, 2017 09:39 AM

    Had a thought over the weekend; I think our issue is due to the initiator change.

    When we consolidated work from SYSB to SYSA, we increased the number of available initiators to account for the additional concurrent workload.  The initiators are static; not managed by WLM.  So when we moved the work, which was experiencing large waits on SYSB, to SYSA, the LPAR couldn't manage the workload appropriately and compounded the wait issue.  That would explain the similar CPU and I/O stats for individual jobs and the drastic increase in wait times.  

    If we drop the initiator count to throttle the incoming jobs, individuals jobs would complete faster.  However, this may not directly effect the overall batch window.  We could also use the job scheduler to throttle jobs.  I'll need to discuss action items with the team; see what they think.  

    I'll hold off on any changes until our capacity study review this Wednesday (hopefully they confirm the issue).  I think we're headed down the right path though.  Even if the causer was my miscalculation on how much work our mainframe could tackle at the same time.  :-)



  • 28.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 18, 2017 09:43 AM

    p.s. my apologize for suspecting IDMS and thank you (community and CA Support) for a crash course in IDMS behavior.  I promise I will no longer attempt to use my DB2 knowledge when interacting with IDMS.  ;-)



  • 29.  Re: Longer run times after consolidating batch to single LPAR

    Posted Dec 18, 2017 09:44 AM

    p.p.s Sorry for my poor grammar.  Ugh... more caffeine....