We have got SQL job failure alerts for a job which has failed yesterday that is 2nd October 10:55AM.
The check interval is 5 minutes.
Still the QOS is getting appended in the old alerts i the alarm history tab in UMP console though the job got failed afterwards as well (that is after 10:55AM 2nd october)and there were new alerts triggered for the latest job failure.
I am not able to understand why the QOS is getting populated in the old alerts that are no more relevant.Could anyone let me know if this is the expected behaviour as we have a requirement to acknowledge these alerts from CA UIM but again new alerts are getting triggered because the QOS is getting generated continuously.
It sounds a bit confusing, can you elaborate more with some screenshots.
What is the job interval which is failing, is it possible that job is getting successful in between and failing on random intervals. And what is the suppression key you are getting in both the alerts?
This is the suppression key for the job failure alerts:
Profile $profile, instance $instance, job $job_name (category $category_name), has failed. Run time of job: $rundate
Job was continuously failing for every 5 minutes yesterday and we got bulk alerts in our queue and the Database team have disabled the job from there end as the job is now running on some other node.
Now the team is asking us to close the alerts as the job is now running on a different node.
So when i acknowledge the alarm i am getting the new alarm stating that job failure happened with yesterday's date and time of the job run which should not be the case.
Suppression key is same for all the job failure alerts.
can you check the values in the threshold value ? I hope you are using the latest version of the sqlserver probe.
Please try disabling/enabling the probe and log the support case with loglevel 5 if you are still getting alerts since this needs deep investigation.