We are encountering an issue with CA UIM 8.31 nas alerts whereby after acknowledging an alert, it re-appears with current timestamps even though the metric associated with the alert clearly indicates that the system is up and running.
At first we thought it was the probe. We encountered it in websphere_mq which initially reported that the multi-instance QM was down; in this case we disabled all alerts in the probe, acknowledged the alarm, and it came back with a critical alarm saying it is back up - so we are not even sure why a positive status would even have an alarm if the system is up and no alarms have been configured as active.
We have encountered a similar situation in sqlserver probe which kept on reporting that the status of the database is down, even though the metrics clearly indicate it was up the whole time during the period specified in the alert itself.
This all results in many alerts emailed which is not the objective. Does the problem lie in the nas probe or the individual probes where the alarms are configured (or not configured in the case of websphere_mq)? Any suggestions would be greatly appreciated.
I would start with running a sniffer in Dr. NimBUS to see whether or not the probes in question are sending the alarms.
Also, I would check to see if NAS is reposting any alarms to hub; filter for all NAS messages.
Does it work fine with default nas.cfg?
If you are not at NAS 4.80, then it is worth upgrading.
If you only have one NAS in your environment this would most likely be coming from the probe itself.
if you have mutiple NAS setup with replication and forwarding there is a chance this alarm got stuck in some sore of loop.
I would suggest updating all NAS to 4.80 from our support hotfix page.
CA Unified Infrastructure Management Hotfix Index - CA Technologies
When this type of thing is happening I generally will open Dr nimbus and look at the message bus.
Narrow down the filter to alarms,probe and robot in this case and see if the alarm is coming in new.
If it is then check probe and probe logs.
If it is not check nas.
Some times we see clients who do not have alarms broken out in their queues.
IE they are using a * queue or combine alarm with QOS_Messages.
This can cause a problem where alarms get backed up behind QOS and can come in at much later times
then when the problem happens.
And because of this delay may come in even after the event is over.
So just something else to check