Endevor

 View Only
  • 1.  ENDEVOR IS DOWN !

    Posted May 15, 2025 09:51 AM
    Endevor is Development and Operations but sometimes Endevor seems to be in a no-mans land between test and prod.
     
    How can we be ready to recover from unexpected loss of some of the Endevor datasets ?
     
    Potentially a significant headache for an Endevor administrator but not like a PRODUCTION event that is going to trigger a SITE wide switch to your DR site.
     
    The manual says "Use the Endevor Unload, Reload, and Validate utilities, also known as program C1BM5000, to mitigate physical device failure or site disasters." perhaps.
     
    A focus on using Endevor UNLOAD of elements implies an ability to use these to RELOAD them in event that recovery is required. 
     
    Is there a single RELOAD job waiting for me to submit or would I have to find some previous full backups and find which incrementals to RELOAD on top building the JCL to reload in the middle of the night ?
     
    RELOAD is great when helping a user who deleted an element by mistake.
     
    But what about larger recovery scenarios that are not site wide disasters what is the best strategy ? 
     
    There may be HSM backups taken periodically but can they take into account the Endevor need to co-ordinate updates on base delta mcf and outputs and package to maintain integrity ?
     
    It seems recovery scenarios where RELOAD alone is sufficient are limited. 
     
    RELOAD could be slow and you don't want to regenerate all those outputs and create new footprints too as RELOAD does not restore the outputs and ACMQ info so only useable for un-generated source.
     
    So  it seems you must also have co-ordinated images of the output libraries at the same time as the UNLOAD.
     
    There may be some HSM backups taken periodically but are they timed to synchonise a snapshot of base delta mcf and outputs and package - are they easy to co-ordinate in recovery ?
     
    So we probably need DFDSS backups of all datasets associated with the source co-ordinated at the same time as the backup.
     
    Instead of having to use RELOAD to get the Endevor source and MCF back as it was then it could be much faster to DFDSS dump MCF BASE and DELTAs for an Environment at the same time as the outputs of the generated source for that Environment.
    Is DFDSS for all datasets together at Environment level the preferred approach ?
     
    Dump with datasets open or force users off ? 


  • 2.  RE: ENDEVOR IS DOWN !

    Posted May 15, 2025 09:52 AM

    Any other methods suitable for handling events short of those that warrant a full DR and its only an Endevor outage after all ?




  • 3.  RE: ENDEVOR IS DOWN !

    Posted May 18, 2025 10:37 PM

    I've only ever needed RELOAD (full recovery), RESTORE (individual elements), and BACKOUT (package). If restoring old elements from backup, they get placed at the entry point of Endevor so that they go through full testing again. 




  • 4.  RE: ENDEVOR IS DOWN !

    Posted May 20, 2025 01:00 PM

    Yes its very useful for that - I am concerned about the time it would take if recovering many elements and associated outputs.

    I just found this recent article  Tips on maintaining Endevor for AdministratorsTips on maintaining Endevor for Administrators 

    This strongly suggests using Endevor unload/reload versus DFDSS but mentions the option using DFDSS for backing up everything (maybe for 1 environment) and UNLOAD FULL CHECKPOINT which then allows subsequent UNLOAD INCREMENTAL to only capture changes since. Also suggests DISP=OLD for housekeeping to get exclusive access but I wonder if that could lead to users left logged on holding up the housekeeping ?

    Would there be an automated way to cancel users delaying housekeeping ? *MIM1040I MCFBACKUP WAITING FOR RESOURCES FOR 5 MINUTES     




  • 5.  RE: ENDEVOR IS DOWN !

    Broadcom Employee
    Posted May 20, 2025 02:11 PM
    The full / incremental unloads are important (I think we all agree on that)
    but yes, they require exclusive (no one can be in the Environment / system
    begin unloaded) and that's a problem for any company with offshore
    developers; i.e. it's very hard to find a time when no usage is occurring.
    We had to pivot to a mix of unloads *and* DFDSS. So we performed weekly
    full unloads every weekend and had nightly DFDSS backups; not ideal but
    Applications wouldn't agree to a nightly maintenance window. Obviously, a
    true disaster occurring mid-week would have meant the loss of recent
    changes but the DFDSS backups would at least give us a starting point.
    Also, our move processors, on a move to PROD, had a step to copy the
    current output executables; DBRMs, loadmods, etc. to a MINUS1 library. So
    if production was NDV.PROD.LOADLIB, we also had a NDV.PROD.LOADLIB.MINUS1
    that could be steplib'd to etc. We also did this when they wanted to
    obsolete something from PROD - the processor would save all the output to
    the MINUS1 libs. With all 3, you're covered for a lot of different
    situations.

    Back to unloads... Even during the weekend maintenance window, we still had
    occasional trouble with developers being in Endevor during the scheduled
    run of the full unload. To make them comply, we added a batch job that
    would cancel TSO sessions of anyone ignoring the maintenance window, I
    think we checked for an enqueue on the element catalog? I know that might
    not be an option at some companies but it was effective for us.

    To keep the developers informed (and keep them out during maintenance), we
    added code to the Endevor CLIST (see below). The 'NEWS' dataset would
    inform them of current events, remind them to say out during maintenance,
    the 'STOP' dataset (which the TSO session cancel job would allocate at the
    start / remove at the end) would simply exit them without invoking Endevor
    / QuickEdit. Not foolproof but worked well.

    IF &SYSDSN('NDV.ENDEVOR.STOP') EQ &STR(OK) THEN DO
    EX 'NDV.ENDEVOR.STOP'
    EXIT CODE(0)
    END
    IF &SYSDSN('NDV.ENDEVOR.NEWS') EQ &STR(OK) THEN DO
    EX 'NDV.ENDEVOR.NEWS'
    END



    *Dave Harding *
    Client Services Consultant • Mainframe Software Division
    Broadcom Software
    Mobile 317-403-1740 | dave.harding@broadcom.com
    Upcoming vacation: June 13 - 15, August 10 - 23

    --
    This electronic communication and the information and any files transmitted
    with it, or attached to it, are confidential and are intended solely for
    the use of the individual or entity to whom it is addressed and may contain
    information that is confidential, legally privileged, protected by privacy
    laws, or otherwise restricted from disclosure to anyone else. If you are
    not the intended recipient or the person responsible for delivering the
    e-mail to the intended recipient, you are hereby notified that any use,
    copying, distributing, dissemination, forwarding, printing, or copying of
    this e-mail is strictly prohibited. If you received this e-mail in error,
    please return the e-mail to the sender, delete it from your computer, and
    destroy any printed copy of it.




  • 6.  RE: ENDEVOR IS DOWN !

    Broadcom Employee
    Posted May 20, 2025 02:41 PM

    In addition to what my colleague Dave has already mentioned about recovery using Full/Daily backups, PITR could also be used to perform Point in Time Recovery (PITR).

    To implement PITR, it is necessary to activate the LSERV legacy component of Common Services. Additional information can be found at: https://techdocs.broadcom.com/us/en/ca-mainframe-software/devops/ca-endevor-software-change-manager/19-0/administrating/point-in-time-recovery.html.



    ------------------------------
    José B. González L.
    Client Services Consultant SPAIN
    Mainframe Software Division - Broadcom
    ------------------------------



  • 7.  RE: ENDEVOR IS DOWN !

    Posted May 20, 2025 07:45 PM

    Nice summary Dave, just one or two amplifications...

    Yes Unload (full or incremental) will hold a lock at the Environment/System level, but in practice the backup is only likely to be held up by TSO/QuickEdit users who are in an active session.  Of course if the unload starts, it will wait behind them - but that wait is exclusive which could stop other users accessing Endevor. 

    Our solution was to add a pre-backup step that examines the enqueus (you can see them yourself if you have SDSF using the "ENQ CTLIELEM * "command and/or you can also use the ISPF QUERYENQ service to build a table of active enqueus).  We wrote a rexx to find all the active users and send them a TSO message, and then wait, and check again after a couple of minutes.  If they are active longer than 30 minutes say we can either cancel them (again using SDSF API) or set a high RC to skip the backup, or just let the backup go ahead.  It will wait for the user to exit, who presumably is either activly busy coding a change or, has just wandered away without closing their session.  For us a TSO timeout normally cleans up inactive users after 4 hours.  If the backup is still late, ShiftOperations will have the messages from the Check step and can issue CANCEL for the users if still not timed out, or cancel the backup if there are now new users waiting.

    Please reach out to me directly if you want more details on the Pre-Backup step, but note there is always going to be a gap between running that step and the backup for the relevant system starting, a new Q/E user could slip in.  Broadcom have discussed options for a Quiesce - built-in command/state that would allow a graceful shutdown/limit access for maintenance windows, but it's not live yet.  If that makes sense to you, add your votes to the idea(s) in ideation or add your own suggestion.

    There is one other important wrinkle - for LSERV users.  LSERV provides an option to REPRO out your managed VSAM datasets while the system is active (you don't have to shutdown) so you can REPRO out to flat files and then use DFDSS to dump all non-vsam files with TOL(ENQ) - which should give you a fast recovery point for all datasets but after restore you should run the validate/catalog checks mentioned and be prepared to restore any 'in-flight' elements.

    Note: this even works for the return trip (so long as your clusters are defined with the REUSE attribute). VSAM ELIBs with the restore(REPRO REUSE) will even work but you need an LSERV PTF (LU12798) to ensure the ELIB is compatible with Endevor (your friendly support engineer can determine the required PTFs/Service levels). 

    To sum up:
    Full unload weekly, with daily incrementals for the ultimate flexability and integrity.
    Daily DFDSS dumps (with REPRO'd LSERV datasets) for the fastest restore, backed up by VALIDATE.
     & consider including executable libraries (LOADLIBs, and LISTINGS etc.) so you don't have to re-generate.  
    Plan and TEST your restore (ideally in a separate LPAR) and make sure you have the JCL ready to define/allocate/re-org/ etc. to hand (in Endevor and externally as plain JCL dataset on your backup machines) so you don't have to go looking.
    ...and don't do it alone!  Log an Incident with support. 
    ...and If you have a corrupted dataset/file, don't delete it, rename it!  You might be able to perform surgical restores, or fault diagnoses later. 



    ------------------------------
    Eoin O'Cleirigh
    Lead Systems Engineer @ ANZ +64273888404
    ------------------------------



  • 8.  RE: ENDEVOR IS DOWN !

    Posted May 22, 2025 06:02 AM
    Cheers Eoin, Not using LSERV, created a rexx to check for enq on MCF but will now consider CTLIELEM too. I think the need to have outputs in sync with source means the DFDSS for them needs to be run every day and therefore, if housekeeping window allows, will end up with DFDSS dump for everything every day and have full recoverability to that point solely using DFDSS restore. Will keep doing UNLOADs but thinking of that more as a nice-to-have bonus for recovering individual elements where someone deletes their source.




  • 9.  RE: ENDEVOR IS DOWN !

    Posted May 20, 2025 06:24 PM

    Hi John

    Long ago and far away, I htought there was a KD or even a chpater in teh book on this topic.  you are correct that a combination of DFDSS and unload reload, can achieve a full compliment recovery.  If memory serves the recommended approach is immediately after your full volume dfdss backups complete, you run an Endevor "full" unload as CHECKPOINT ONLY.  The rest of the week you run incremental unloads to backup any changed elements.  

    This way if Endevor were to die in teh middle of the week, you restore/recover the needed files from full hsm, then apply incremental unloads, 1 at a time (or concatenated in order ?  Im sure someone will correct me) oldest to newest.

    you'll also need to run the element catalog sync job to repair any potentially lost element pointers.

    As with all things Endevor, there are hundreds of combinations and opinions on the subject, depending on your site's configuration in regards to base/delta combinations.  

    If you should ever find yourself in such a situation, your first call should always be Level 1 support, provide them with the particulars of all your backups,  base/delta configuration and the severity of the corruption.  They will be able to get you back up and running.



    ------------------------------
    Karen
    [JobTitle]
    [CompanyName]
    [State]
    ------------------------------



  • 10.  RE: ENDEVOR IS DOWN !

    Posted May 22, 2025 05:41 AM
    Thanks Karen,

    This is useful confirmation - I was also reading this Tips on maintaining Endevor for Administrators<https: knowledge.broadcom.com external article 136717 tips-on-maintaining-endevor-for-administ.html>
    Had not noticed CHECKPOINT ONLY before.

    So yes, thinking of DFDSS dumps of MCFs BASE/DELTAS and OUTPUTS for each Environment running in parallel.
    Will use MCFs DISP=OLD to prevent access for duration of all the dumps for each environment.
    Some reorgs can go in here too and ELB reporting.
    Then UNLOAD FULL CHECKPOINT ONLY to reset the incrementals.
    Then use UNLOAD INC for changes only on each environment during week.
    Smaller environments I might do more frequent DFDSS dumps too.

    I am thinking if recovery required during week can restore from DFDSS dump and apply incrementals but there will still be output libraries more recent where elements were changed and generated the reload will recover source only and outputs e.g. LOADLIBs out of step ???

    I will also need a rexx at start of dumps to check for contention on the MCFs and cancel any users that finished work but left there screen displaying elements in Endevor.

    Realise I also need the Endevor software plus common stuff package ACMQ ELMCATLG EINDEX etc and batch admin build SCL for everything and shipment rules as could be useful too.

    I am also to move Endevor from one plex to another so that will provide a good chance to run the recovery scenarios.