ESP Workload Automation

 View Only
  • 1.  ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 09, 2017 05:36 PM

    Hello!

     

    We have had an issue where a job seemingly ran twice in the same generation of an application.  It failed, but also completed.  Looking at the history, both instances of the job were given the same job number. I thought I would see if anyone in communities has seen a similar issue before I open a ticket.  Below I have included information from the job history as well as entries for the job from the log:

     

    JOB ZB395W01, MODEL MODEL1, OWNER NONE, 30 JOBS INDEXED, 30 MAX                                 
     JOB08792 ON RDR AT 01.02.30 THU  9NOV2017, COMPLETED, CC 0, 14.2 MINS_EXEC, 0:00 CPU           
     JOB08792 ON RDR AT 01.02.30 THU  9NOV2017, COMPLETED, CC 222, CCFAIL, 0.0 MINS_EXEC, 0:00 CPU  
     JOB08380 ON RDR AT 01.00.01 WED  8NOV2017, COMPLETED, CC 0, 20.5 MINS_EXEC, 0:00 CPU           
     JOB03176 ON RDR AT 01.00.00 TUE  7NOV2017, COMPLETED, CC 0, 13.3 MINS_EXEC, 0:00 CPU           
     JOB06944 ON RDR AT 01.00.01 MON  6NOV2017, COMPLETED, CC 0, 18.2 MINS_EXEC, 0:00 CPU           
     JOB03324 ON RDR AT 05.16.23 SUN  5NOV2017, COMPLETED, CC 0, 11.7 MINS_EXEC, 0:00 CPU           
     JOB04720 ON RDR AT 01.00.01 SAT  4NOV2017, COMPLETED, CC 0, 12.6 MINS_EXEC, 0:00 CPU           

     
     
    01.00.00 THU  9NOV17 ESP6260I ResoMGR: Job ZB395W01 in Appl FACETS.1109 Resources Satisfied  
    01.00.00 THU  9NOV17 ESP6633I APPLMGR: APPL FACETS.1109 JOB ZB395W01 READIED  
    01.00.00 THU  9NOV17 ESP6108I EVENTEX: ESPPROD.FACETS scheduled, for Job ZB395W01 in Appl FACETS.1109, Unconditional 
    01.00.30 THU  9NOV17 ESP6220I  MgrMsg: . ZB395W01/FACETS.1109/MAIN State SUBDELAY Status(Transmission error)  
    01.01.58 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 ZB395W01/FACETS.1109/MAIN State EXEC SetStart Status(Executing at bcapp602) Jobno(2556)
    01.01.58 THU  9NOV17 ESP6220I          JobLogId(C91B6FB7227BF5D0E16C0F4E5E43F92908744AF2)  User(bcapp602) Host(bcapp602)           
    01.02.30 THU  9NOV17 ESP6220I  MgrMsg: . ZB395W01/FACETS.1109/MAIN State READY Status(Agent Notified) 
    01.02.30 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 ZB395W01/FACETS.1109/MAIN State EXEC SetStart Status(Executing at bcapp602) Jobno(8792) 
    01.02.30 THU  9NOV17 ESP6220I          JobLogId(7978C65C04A2F7E5ED32340903C06BCE2E42AE32)  User(bcapp602) Host(bcapp602)  
    01.02.30 THU  9NOV17 ESP6641I APPLMGR: APPL FACETS.1109 JOB ZB395W01 Failed 
    01.02.31 THU  9NOV17 ESP6015I AgentMR: 20171109 01023116+0500 PROD_MANAGER1 bcapp602 D3696D57.2D5BBEC7/A/QUERY RESPONSE SPOOLFILE D
    01.02.31 THU  9NOV17 ESP6015I          ZB395W01/FACETS.1109/MAINStart date Thu Nov 09 01:01:58 2017------ 
    01.04.43 THU  9NOV17 ESP6612I APPLMGR: APPL FACETS.1109 JOB ZB395W01 USER STATUS UPDATED BY T72363B
    01.16.40 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 ZB395W01/FACETS.1109/MAIN State COMPLETE Cmpc(0) SetEnd                   
    01.16.40 THU  9NOV17 ESP6220I          JobLogId(C91B6FB7227BF5D0E16C0F4E5E43F92908744AF2)  User(bcapp602) Host(bcapp602)   
    01.16.40 THU  9NOV17 ESP6640I APPLMGR: APPL FACETS.1109 JOB ZB395W01 Now Complete



  • 2.  Re: ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 09, 2017 05:50 PM

    Hi Jonathan, 

    In one shop I worked there was a requeue job that ran on the system. If there was contention when the job started(for example it needed a dataset that was in use) there was a process behind the scenes to S222 the job and put it back on the input queue on hold with the same job number. This prevented these jobs that couldn't run from holding initiators so other jobs could run. When the dataset became available the job was taken off of hold.

     

    It could be something like this

    2¢ 

    Don



  • 3.  Re: ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 09, 2017 05:54 PM

    Don,

     

    This is a distributed job, and doesn't use or require a dataset.  Does that matter?



  • 4.  Re: ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 09, 2017 05:57 PM

    This is not the issue I described. It was not a S222 it was a CC 222



  • 5.  Re: ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 09, 2017 06:02 PM

    The CC 222 is the result of output sent to STDOUT from the script.  The output comes from a internally written scripting framework that tries to delete an existing log file on the server (in this case BCAPP602) before writing a new one.  If the log was being used by a process, it cannot be deleted, so the script ends (and fails) with a an error code of 222.

     

    In this case, what happened was the job started (first instance) then tried to run a second time, abending with a 222.  The developer complained that they got a ticket (for the 222), but that the job actually completed, which it did.



  • 6.  Re: ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 09, 2017 06:19 PM

    I just found another instance of this that occurred on the same server at the same time last night.  What's strange is that there is a "transmitter busy" message on both at the same time (01:30).  I wonder if this is / was an agent issue?

     

    JOB OJ400W01, MODEL MODEL1, OWNER NONE, 30 JOBS INDEXED, 30 MAX                                 
     JOB02472 ON RDR AT 01.02.30 THU  9NOV2017, COMPLETED, CC 0, 2.8 MINS_EXEC, 0:00 CPU            
     JOB02472 ON RDR AT 01.02.30 THU  9NOV2017, COMPLETED, CC 222, CCFAIL, 0.0 MINS_EXEC, 0:00 CPU  
     
     
     
    01.00.00 THU  9NOV17 ESP6633I APPLMGR: APPL OJ003DLY.2619 JOB OJ400W01 READIED  
    01.00.00 THU  9NOV17 ESP6108I EVENTEX: ESPPROD.OJJOBS scheduled, for Job OJ400W01 in Appl OJ003DLY.2619, Unconditional 
    01.00.30 THU  9NOV17 ESP6220I  MgrMsg: . OJ400W01/OJ003DLY.2619/MAIN State SUBDELAY Status(Transmission error)  
    01.01.58 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 OJ400W01/OJ003DLY.2619/MAIN State EXEC SetStart Status(Executing at bcapp602) Jobno(3880)
    01.02.30 THU  9NOV17 ESP6220I  MgrMsg: . OJ400W01/OJ003DLY.2619/MAIN State READY Status(Agent Notified)
    01.02.30 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 OJ400W01/OJ003DLY.2619/MAIN State EXEC SetStart Status(Executing at bcapp602) Jobno(2472)
    01.02.30 THU  9NOV17 ESP6220I          JobLogId(DA15BE771D9A7AEB86A8517F19E94686BD5075E6)  User(bcapp602) Host(bcapp602)           
    01.02.30 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 OJ400W01/OJ003DLY.2619/MAIN State FAILED Cmpc(222) SetEnd  
    01.02.30 THU  9NOV17 ESP6641I APPLMGR: APPL OJ003DLY.2619 JOB OJ400W01 Failed 
    01.04.55 THU  9NOV17 ESP6612I APPLMGR: APPL OJ003DLY.2619 JOB OJ400W01 USER STATUS UPDATED BY T72363B 
    01.05.19 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 OJ400W01/OJ003DLY.2619/MAIN State COMPLETE Cmpc(0) SetEnd
    01.05.19 THU  9NOV17 ESP6220I  MgrMsg: bcapp602 OJ400W01/OJ003DLY.2619/MAIN State COMPLETE Cmpc(0) SetEnd                
    01.05.19 THU  9NOV17 ESP6220I          JobLogId(A530278EF46C38D59A00A633152FCFCF42EC1424)  User(bcapp602) Host(bcapp602)  
    01.05.19 THU  9NOV17 ESP6640I APPLMGR: APPL OJ003DLY.2619 JOB OJ400W01 Now Complete
    01.05.19 THU  9NOV17 ESP6610I APPLMGR: APPL OJ003DLY.2619 JOB OJ400W01 FORCED COMPLETE BY T72363B



  • 7.  Re: ESP: Job runs and abends in the same generation with the same job number
    Best Answer

    Broadcom Employee
    Posted Nov 10, 2017 08:34 AM

    Hi Jonathan,

     

    I believe here is what happened:

    - ESP sent a request to the agent, and agent didn't reply as needed within 30 seconds, so the related job was marked with Transmission error or transmitter busy;

    - Then agent replied after 1.58 min (very close to 2 mins) that now the job was executing;

    - ESP however tried to resend it after two mins.

     

    This will cause two runs on agent server, the initial request and the resent one. But since both are pointed to the same job, the last update will take place.

     

    You can check related agent receiver.log and transmitter.log, where the timestamps are more accurate.

     

    The solution that resolved similar problems is to increase RETRYINTERVAL on related AGENT definition, the default is 2 mins. So that ESP won't resend the request before or around the same time when the connection to agent is resumed.

     

    Hope this helps,

     

    Lucy



  • 8.  Re: ESP: Job runs and abends in the same generation with the same job number

    Posted Nov 10, 2017 09:35 AM

    Lucy,

    Thank you very much for this!  We are going to monitor to see if this continues to happen, and if so, we will have the server owner make the change you suggested.

     

    Thanks again!