I have Spectrum 9.4.2 configured in a distributed and fault tolerance environment. The fault tolerance is a Passive one.
Scenario:- I have discovered the network devices in both of the active landscapes. If the one SS goes down and the fault-tolerance SS doesn't comes up so the devices are still getting monitored through the other landscape. Is it a good practice to follow?
Does it has any impact on correlation/root cause?
I would think this is not a good practice, since that would lead to duplicate alarms on a normal day-to-day basis (the same alarm asserting on both landscapes). The only way I could see this working for your Operators is if you excluded that other Active Landscape from their Alarm Filter, and only included it during a Failure event like you described - but that's a slower manual process and prone to error.
It's very obvious to an Operator when the Secondary SS hasn't taken over (Red vs. Yellow border in OneClick), so your team should be able to notify you promptly if something doesn't look right. There are also a few OOtB Alarms that Spectrum will assert if something goes awry with the Fault Tolerance (ie: contact lost to Secondary SS, alarm synchronization failing, etc.)
I suggest you 1.) Run through a few scheduled failures of the Primary to show the Secondary SS will take over correctly, and then 2.) Just handle any issues with the Secondary as they arise. For what it's worth, I've yet to have a problem with the Secondary SS.
A few things to consider:
- How frequently you run your OnlineBackups dictates the "freshness" of the Secondary's database of devices/models
- On the Secondary, increase "max_event_records" in the $SPECROOT/SS/.vnmrc so the Secondary retains a longer history of events to then sync back to the Primary
- On the Secondary, set "secondary_polling=yes" in the $SPECROOT/SS/.vnmrc so the Secondary is "Hot" and can takeover immediately
.. adding few more facts to Justins comment:
We see CA Spectrum FT-pairing works super and seeing a "red-framed" OC-console (which indicates an OC-server problem) - or a "red" / or "yellow" Landscape status (down or switched status) is the best practice. As this is truly "directly visilble". We see FT-pairs running stable for 6++ month. Clearly the OLB/OnlineBackup should be done once per night to ensure sync is always fine.
Setup the secondary to "active polling" / hot-standby status will cause double amount of SNMP polls for the devices (you should consider this) and will bring you the advantage of "few seconds" quicker update in case of a FT-paired switching from primary_ to secondary_SpectroSERVER. Next code improvement here is the FT-pair capable ArchiveManager which allows to keep secondary-SpectroSERVER active for "long time" without that risk from the past to have a limited event-cache at secondary only.
So - finally - I would encourage you to address any instablity for the SpectroSERVER / for the monitoring to a support issue/case - as we would like to improve Spectrum being reliable and stable at 100% even you run without FT-paired SpectroSERVERs. :-) Cheers, Joerg
There is another option you can consider here if you really want to run a third SpectroServer in case the secondary doesn't kick in. For this scenario you can run an additional fault tolerant server, a tertiary SpectroServer. In fact you can have a 4th, 5th, 6th, etc if you really want to. Is there some reason you haven't considered this?
For info you do this by setting the precedence number for the tertiary server higher than that for the secondary. It's all described in the Distributed SpectroServer manual. I haven't ever come across anyone that has run a tertiary SpectroServer in a FT environment but it should work seamlessly and give you a 3rd FT server.