We have notifier configured to send emails on device downs to our NetCool server (and others) for ticketing purposes. Yesterday afternoon, we noticed that we were not receiving emails/ticketing for devices going down. When I looked in our NOTIFIER.OUT log, the last event that was received seemed to go through half way and then had the following text:
EventMessage: Thu 23 Mar, 2017 - 16:28:12 [Event 58d42fdc-1726-109c-0180-800005001bc9 is unavailable from Archive Manager: Response not received in time ]Only displaying most recent of 3 event messages.
When I tried to stop Archive Manager from SCP on our Primary Server, it did not appear to respond...i.e. the stop Archive Manager remained grayed out after it was clicked.
When I eventually killed the Archive Manager process on our Primary Server, our Secondary server took over processing events. When I restarted the Archive Manager process on our Primary server, it did not immediately take over processing events. We left for the day assuming that event processing would remain on our Secondary server. At some point during the evening, the Primary returned to processing events and emails were being sent as expected.
Has anyone else experienced this behavior since moving to 10.2? We have had other issues with Archive Manager between our Primary and Secondary servers. However, we have applied all the recommended patches to address these. If anyone from CA is reading, does this point to Archive Manager slowing down/hanging? Is there a way to troubleshoot if this occurs again?
We think we were able to see what caused this but are not sure why this occurred. The issue occurred again today. I had clicked the VNM model of our Spectrum server and was looking at the events coming into the VNM model. Shortly after I did this, I got another message written to the NOTIFIER.OUT log:
EventMessage: Fri 24 Mar, 2017 - 11:34:50 [Event 58d53c9a-0657-103d-0296-800005006ff3 is unavailable from Archive Manager: Response not received in time ]Only displaying most recent of 25 event messages.
I did this yesterday while looking at unknown devices sending us traps. I didn't think that would have caused this issue.
Today, we didn't kill any process. We brought some test devices down to watch them alarm in Spectrum. The devices went down as expected but it took longer for the alert to get written to NOTIFIER.OUT and for the emails to get sent. Our SANM policy is set for 6 minutes. We got the email 15 - 20 minutes later. We also saw the delay in the event getting written to NOTIFIER.OUT. It seems to have caught back up. Any ideas on how looking at events against our VNM model could cause this delay in Archive Manager/Notifier?
A couple of things come to mind:
1. What is your event rate?
a. To calculate the rate, open the Attributes tab on the VNM model
b. Filter for EventsGenerated and move it to the right
c. Wait exactly 60 seconds and click the refresh.
d. Subtract the numbers and divide by 60 to see how many you are getting per second.
e. If it’s higher than 30 then you may have a busy SS…
2. What is your default event filter “time” set to? The filter is defaulted to 4 hours by default. We’ve seen where if the default is set to something much larger (ie 24 hours) it can cause an issue
3. Do you have a large DDM database?
a. Look at the /SS/DDM/scripts
4. You can turn ArchMgr debug: add the following to the .configrc and stop/restart Archmgr. Review ARCHMGR.OUT
Hope that helps
so.....our default event filter is set to 48 hours. I don't believe this was ever an issue before...when you say "We’ve seen where if the default is set to something much larger (ie 24 hours) it can cause an issue" do you mean that this is new to 10.2?
No, it’s not new to 10.2. But maybe your event rate is higher now than before? If that’s the default for all users, it can overload the ArchMgr if many users are clicking on event tabs. An easy, obvious test would be to knock it back down to 4 hours and see if the problem still happens.
I reset my default window back to 4 hours. I don’t think everyone is set to 48 hours here…just a handful of people. I will test this out on Monday morning and will also look at calculating the event rate then. I don’t want to cause any issues on a Friday afternoon.
IT Analyst, Level 2
Duke Health Technology Systems | Duke University Health System
Hock I Plaza, 10035
Visit the DHTS website<http://dhts.duke.edu> to learn about Duke Health IT services, order products and services and to report issues.
Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Oh, one other thing that might affect the Notifer is if you have customized the SetScript…
Jay, so we have an event rate of between 65 - 70. I have used your formula several times through out the day. Looking at the server operation, I do not see that there are any issues with performance. How is 30 and above determined to be a busy SpectroServer. I have moved my default event time frame back to 4 hours.