We run a process against the message log (table=MELD) to trap all severe errors that were not expected, and throw appropriate alarms. This traps dropped communications with the various agents. We run it once every 30 minutes.
This is the SQLserver query we use for that process;
-- finds all RED messages on the message log
select distinct
-- converts gmt to pacific time
dateadd(hour, datediff(hour, getutcdate(), getdate()), meld_timestamp) as TIMESTAMP
, meld_msginsert as MESSAGE_DETAILS
, meld_type
, meld_msgnr as MESSAGE_NUM
from meld
where ((meld_seen = 1 and meld_client = &$CLIENT#) or meld_msgnr = 11885)
-- *********************FILTER-RULES***************************************************
and meld_type <> 30 -- filter out all non-critical messages
and meld_msgnr <> 00050 -- filter out all "UC4 alarm popup was deliberately cancelled"
and meld_msgnr <> 11003 -- filter out bad RC messages (oncall alerts handle this)
and meld_msgnr <> 11007 -- filter out bad RC messages (oncall alerts handle this)
and meld_msgnr <> 11067 -- filter out Notification Aborted
and meld_msgnr <> 11051 -- filter out initiated CANCEL messages
and meld_msgnr <> 11506 -- filter out CANCEL messages
and meld_msgnr <> 12111 -- filter out CANCEL messages
and meld_msgnr <> 20582 -- filter out CANCEL messages
and meld_msgnr <> 11054 -- filter out "terminated with CANCEL"
and meld_msgnr <> 11069 -- filter out "aborted by escalation"
and meld_msgnr <> 11544 -- filter out "runtime of task 'X' has been exceeded. Task 'Y' will be started"
and meld_msgnr <> 11547 -- filter out "Time Checkpoint for task 'X' has been exceeded. Task 'Y' will be started"
and meld_msgnr <> 50031 -- filter out "SMTP client cannot get host information..." (This happens in DR tests)
and meld_msgnr <> 11348 -- filter out "Task 'X' was stopped for modification. MRT.DAILY can cause this.
and meld_msgnr <> 11104 -- filter out "Task 'X' was started. MRT.DAILY can cause this.
-- *********************FILTER-RULES***************************************************
and dateadd(hour, datediff(hour, getutcdate(), getdate()), meld_timestamp)
>
dateadd(minute, -180, getdate())
order by 1
;
------------------------------
Pete Wirfs
SAIF Corporation
Salem Oregon USA
------------------------------
Original Message:
Sent: 01-28-2021 07:10 PM
From: Johnny andrey
Subject: How do you monitor your AE?
Hi team,
This thread leads to what I am looking for I want to create an alert that tells me when the AE is not working by mail, but I have not succeeded, I have a powershell script that monitors all components.
But in recent days we had a network problem all components were up wp, cp, jcp etc..
But we evidenced the error was to see the logs of the AE, which had no connection to the DB because of the network problem.
How could I do it, has anyone implemented a free tool, or created an external script for this?
I tried to monitor with ORACLE JAVA MISSION CONTROL using EMI.JAR but they removed the functionality to configure the mail and send alerts.
Thank you very much for your comments
------------------------------
Computer Engineer
N.A
Original Message:
Sent: 03-31-2018 03:31 AM
From: Anon Anon
Subject: How do you monitor your AE?
Many thanks for your Input!
cheers, Wolfgang