DX NetOps Manager

 View Only

CA Spectrum Best Practices:  Fault Tolerant SANM (works in Spectrum DSS environment) 

Aug 03, 2016 10:38 AM

A revision has been made to prevent the script from cycling through all active OneClick servers during the rest call.  Once it makes a successful connection, it proceeds on with the script.  It will only cycle through all OneClicks if one or more cannot be queried.

 

**UPDATE**

The same SetScript and ClearScript can be used on both primary and secondary servers.

Statistics
0 Favorited
4 Views
1 Files
0 Shares
3 Downloads
Attachment(s)
pdf file
Spectrum_FT_SANM_v5.pdf   789 KB   1 version
Uploaded - May 29, 2019

Tags and Keywords

Comments

Oct 13, 2016 10:45 AM

Hello Sean,

 

Currently, as soon as GAS registers the alarm service on the secondary, the notification service kicks in, so it's all a matter of how long it takes for the failover to complete.  In instances of long activation times, this should not be an issue.  My current customer had extremely long activation times at one point > 5 hours. SANM still processed but at a slightly slower rate until the models were 100% activated.

 

I ran a test today again in the Dev environment and as soon as the failover was complete (MLS failover), I got bombarded with notifications.  I have been unable to verify if there is any store and forward mechanism.

 

I do not have any heart beat checks in the code - this all based off of the out-of-box functionality.  If this is something you'd like to try out, be my guest. Hopefully with a little more work, we can get this or something similar added into the official Spectrum code.

Sep 05, 2016 12:00 PM

Karen, this is a very clever solution. Kudos! 


Follow-up questions regarding the delay between the time an MLS fails and when the OC detects and registers it? 

- Is there a heartbeat between OC and MLS and how do you configure that frequency?

- How does GAS (global alarm sync) affect this solutions timing in failure and recovery?

- how much delay does this test introduce into notification? In a major outage (eg 1,000 alarms) could this add up to measurable aggregate delay?


Not that delay is bad, just that calling it out helps to anticipate worst case aggregate delay. For instance, if delay is 5 minutes in a major outage, it may be wise to wait until alarms settle before restoring the primary SS. 

Related Entries and Links

No Related Resource entered.