ESP dSeries Workload Automation

 View Only
Expand all | Collapse all
  • 1.  DE 12.2

    Posted Dec 10, 2019 11:25 AM
    ​Hi,
    We recently upgraded to DE R12.2.  At the beginning of each month we run FAILED_JOBS report for previous month. For November we noticed it was much smaller than expected.  We did expect it to be smaller as we didn't take history but it is still too small.  Has anyone else noticed a difference in the FAILED jobs report?  not sure if there is an issue or not.

    When we run the failed jobs report or this sql we get 379 records (Nov 9 - today).  Happy we get the same results with these 2 because essentially they should be the same. 

    However, we every job failure, suberror, overdue job triggers an alert that triggers an application ZZAUTOCUT and this alert was triggered 1119 times Nov 9 - today. We do use auto resubmit for many jobs and we only trigger the alert if final resubmit fails.  This number should be smaller than the total number of jobs that failed. We have been tracking failed jobs and alerts for 2 years now so something changed.

    379 records - FAILED_JOBS report same results as sql Nov 9-today

    379 – number of jobs in ESP_GENERIC_JOB that have a failed state.  Because of auto resubmit we only cut tickets if final resubmit fails. This number should be less than number of times the ZZAUTOCUT application triggered

    select WA.esp_APPLICATION.APPL_NAME,WA.esp_APPLICATION.APPL_GEN_NO,WA.esp_GENERIC_JOB.JOB_NAME,WA.esp_GENERIC_JOB.state,WA.esp_GENERIC_JOb.end_date_time

    from WA.esp_APPLICATION,WA.esp_GENERIC_JOB

    where WA.esp_APPLICATION.APPL_ID=WA.esp_GENERIC_JOB.APPL_ID

    and (WA.esp_GENERIC_JOB.state like 'FAILED' or WA.esp_GENERIC_JOB.state like 'SUBERROR' or WA.esp_GENERIC_JOB.state like 'OVERDUE')

    1119 – number of times our ZZAUTOCUT auto ticket cutting job ran

    select WA.esp_APPLICATION.APPL_NAME,WA.esp_APPLICATION.APPL_GEN_NO,WA.esp_GENERIC_JOB.JOB_NAME,WA.esp_GENERIC_JOB.state,WA.esp_GENERIC_JOb.end_date_time

    from WA.esp_APPLICATION,WA.esp_GENERIC_JOB

    where WA.esp_APPLICATION.APPL_ID=WA.esp_GENERIC_JOB.APPL_ID

    and (WA.esp_APPLICATION.APPL_NAME = 'ZZAUTOCUT')

     



  • 2.  RE: DE 12.2

    Posted Dec 11, 2019 08:22 AM
    Hi Sharon,

    At least for 12.1.0 you can't get a true count of job failures. This has been addressed in the link below but I don't recall seeing anything about it being fixed in 12.2, you may want to do some testing to validate. With 12.1 if a job fails it logs with a state of FAILED. However, if it is then force completed then the record of the job as FAILED in the database is overwritten with a state of COMPLETE and condition of Forced. So then if you run the query to look for state of FAILED then you won't find it because it the record has a state of complete even though the job did fail. This would explain why you are seeing the discrepancy right?

    Please let me know if you can do some testing here. I'd like to confirm if this issue has been addressed. 

    In testing I have done just now it seems that if the job has completed and then resubmitted and fails, the record of the job having completed is overwritten with a failure record. I was not aware of this before but with this logging you can't even have a true count of how many jobs have ran and completed.

    https://community.broadcom.com/participate/ideation-home/viewidea?IdeationKey=db526718-188b-4366-8067-fbfc84b968ac


  • 3.  RE: DE 12.2

    Posted Dec 11, 2019 10:39 AM
    Thanks Travis!  I will check into more.​


  • 4.  RE: DE 12.2

    Posted Dec 11, 2019 11:54 AM
    you didn't have this issue prior to 12.1 correct? ​


  • 5.  RE: DE 12.2

    Posted Dec 11, 2019 12:00 PM
    We upgraded from 11.3 to 12.1 back in Nov last year. I don't recall having the issue before but I don't know if it was just not there in 11.3 or just something i did not come across until 12.1


  • 6.  RE: DE 12.2

    Posted Dec 11, 2019 12:11 PM
    We just upgraded from 12.0 SP2 to 12.2 Nov 9, 2019.  We tracked failures and runs monthly for years without issue and just noticed this after the upgrade.  Something must have changed in 12.1, i did review the release notes for both 12.1 and 12.2 but didn't find anything. 

    we tracked failures/runs wtih the failed jobs report.

    We also, trigger an alert for all job failures/suberror/overdue and it only triggers for final failure then we automatically cut tickets. We will know #of failures that cut a ticket. We used FAILED jobs report for actual failures so we could see how many tickets didn't cut because we utilize auto resubmit.​ 

    Will have to think on this to figure out if we just track jobs run vs failures/suberror/overdue that trigger the alert/application/job.


  • 7.  RE: DE 12.2
    Best Answer

    Broadcom Employee
    Posted Dec 12, 2019 11:18 AM
    Hi,

    I will suggest to raise a ticket for this difference in behavior as it requires further investigation.

    Ravi Kiran


  • 8.  RE: DE 12.2

    Posted Dec 19, 2019 10:29 AM

    Hi Kiran

    I have opened a ticket about failed jobs report. 

     

    20153766  Failed jobs report

    Hi, 

    the first workday of each month we run a report for prev months job failures.  Since we upgraded to 12.2 we noticed a significant decease in number of failed jobs.  Since we trigger an alert/application/job that automatically created a service center task we know we have more job failures than the failed job report is showing.  Travis Anderson, another DE user, and Kiran Kunduri from Broadcom/CA and I had a call about this Monday.  I have since been tracking it better by compare and understand what is in db to the service center tasks. We noticed a change going from 12.0 SP2 to R12.2.  Travis said they noticed the change going from R11.3 to R12.1

    What i found was:

    jobs that fail and are force completed their state is changed to complete and the job doesn't show up as failed

    any jobs with auto resubmit 1x with notify for last retry failure, the first instance doesn't show up on failed jobs report..

    What we have been doing for years now is run the failed jobs report for prev month to get total number of failures and total number of jobs run.  Then we would see how many times our alert/application/job ran.  Then we would know the difference is how many ticket were not created because we utilize auto resubmit.   I am not sure from my findings if 1 or both of my findings are contributing to the difference we are not seeing in our reporting.  I have attached several documents.  Unfortunately i have no prev reporting that shows submission instance.

     

     

     




    Attachment(s)

    xlsx
    failed job tracking.xlsx   14 KB 1 version
    xlsx
    sql failed jobs.xlsx   13 KB 1 version


  • 9.  RE: DE 12.2

    Broadcom Employee
    Posted Dec 20, 2019 12:26 AM
    Hi,

    OK. Thank you. Support will work on this issue and update you.

    -Ravi kiran


  • 10.  RE: DE 12.2

    Posted Jan 08, 2020 08:44 AM
    Support worked with the development team and has identified this as a defect and will work on fixing this in the next release.

    thanks
    Sharon​​


  • 11.  RE: DE 12.2

    Posted Jan 09, 2020 10:47 AM
    Hi Support,

         Is there a timeline when a fix will be in place?​


  • 12.  RE: DE 12.2

    Broadcom Employee
    Posted Jan 10, 2020 01:31 AM
    Hi,

    Currently the team is working on it. Will update you once it is ready.

    Ravi Kiran


  • 13.  RE: DE 12.2

    Posted Jan 22, 2020 06:07 PM
    Hi Kiran, will this update fix things with auto resubmit jobs? Right now with 12.1 if a job is setup to auto resubmit when failed then each time the failure records get overwritten. So for example, job fails, auto resubmits and fails again. There will only be 1 record of the failure in the db for the failure. However, shouldn't there really be 2 records, 1 record for each and every failure?? I would think there should never be a failure record that is overwritten.

    I know we discussed other things around this logging of failures but not sure if this was brought up. I just came across it today myself, is it something you are aware of?


  • 14.  RE: DE 12.2

    Broadcom Employee
    Posted Jan 24, 2020 12:25 AM
    Hi,

    We will look into this problem as well and update you. 
    Thank you for bringing it to the notice.

    Thanks and regards,
    Ravi Kiran


  • 15.  RE: DE 12.2

    Posted Jan 28, 2020 03:04 PM
    Hi Kiran, is there any timeline on when the next release will come out that will have this change? We are being pressed by upper management to ask this to determine what kind of timeframe we are looking at to be able to have more accurate data around the number of failures.


  • 16.  RE: DE 12.2

    Broadcom Employee
    Posted Jan 29, 2020 01:50 AM
    Hi,

    We are hopeful to provide a server patch to fix this problem by end of next month. We will update you once it is available.


    Thanks and regards,
    Ravi Kiran


  • 17.  RE: DE 12.2

    Broadcom Employee
    Posted Feb 06, 2020 04:28 AM
    Hi,

    The fix for the problem is available for 12.2 release now.

     It solves the below problems -

                  

                        Failed Job reports does not show jobs which are Force Completed and jobs that are Auto-Resubmitted automatically on Failed.

     Thanks and regards,
    Ravi Kiran

                        Note: The Failed Jobs Report will give accurate number from the time the fix is applied.




  • 18.  RE: DE 12.2

    Posted Jun 15, 2020 08:40 AM
    Hi Kiran,

    We installed this patch (T6F6016) to get us to 12.1 build 2001. I'm a little confused as to why any records of a job within the database should ever be lost, is it by design or simply an oversight. It seems that failures are logged more accurately now with this patch but still not every record of the job is logged. Example, if a job was to complete a record is logged as expected. However, if that same job is resubmitted but fails, the record of that prior completion is overwritten with the failure record. Same thing goes if the job was force completed but then resubmitted (that f/c is overwritten).

    Upon lookin in our database at a job I was testing with I see a record for submission instances 1,2,4,6 and 7 (3 and 5 were overwritten). Regardless of what occurs with the job, no records should ever be lost. We are big into providing accurate statistics at our company and right now we can't get them from the database.


  • 19.  RE: DE 12.2

    Posted Jun 15, 2020 03:02 PM
    Good point Travis. I didn't test this as we rarely resubmit completed jobs.  I am thinking it was missed in the resolution.


  • 20.  RE: DE 12.2

    Broadcom Employee
    Posted Jun 16, 2020 08:58 AM
    Hi,

    Thank you for the update Travis.
    I think only for FAILED job records , we create new records each time to have a count of them, others are by default overwritten with the latest state.

    Ravi Kiran


  • 21.  RE: DE 12.2

    Posted Jun 16, 2020 09:35 AM
    Ok if that is how it is designed then what would be the reason for that? Would you not want to know EVERY instance of when a job ran?? I'm not seeing the reason why you would have things setup to only log the latest state.