DX Infrastructure Manager

Expand all | Collapse all

Nimsoft Disater Recovery

  • 1.  Nimsoft Disater Recovery

    Posted 08-12-2014 09:02 PM



    Have anyone implemented a disaster server for Nimsoft? Which is the best solution? Would database mirroring alone suffice the need? If not, are there any dedicated probes for this. Are there any documentation that supports this.




    Ananda Guberan K

  • 2.  Re: Nimsoft Disater Recovery

    Posted 08-14-2014 10:43 AM

    In addition to dual site database, we also have the primary hub mirrored in two different sites. You need to deactivate a bunch of the probes, and use HA probe to activate them should the active primary hub go down.


    We do this in all layers of our hub infrastructure.

  • 3.  Re: Nimsoft Disater Recovery

    Posted 08-14-2014 06:24 PM

    Hi Andres,


    When we use HA probe, the change from primary to secondary happens only when the entire robot goes down atleast in my environment. Is there a way that we can configure in HA probe if just any one of the primary probes (like NAS) goes down switch over to the secondary?




    Ananda Guberan K

  • 4.  Re: Nimsoft Disater Recovery

    Posted 08-15-2014 09:28 AM

    I guess you would have to do something with lua scripts (or other SDKs) using some callback magic.

    As far as I know, HA probe is only able to check if hub is responding or not.

  • 5.  Re: Nimsoft Disater Recovery

    Posted 08-19-2014 01:42 PM

    We do also some "HA", well to be honest it doesnt deserve the meaning HA :->


    We have two QoS-DB's running, our primary exports its data every night and we do a transactionlog on 15min base, both are transfered to the backup DB, the DB imported every night and the transactionlogs would be if a takeover should occur (well manually yes)

    The Primary hub has a "secondary" once there was a HA-Probe running, but for nas I has also do some manually, because the nas replication copies the nas.cfg but in the "wrong place" for smoothless and automatic operation, should script something but doesnt trust the alarms :->


    Therefore we stopped the HA-Probe and going to switch purely manuell, because we had too much problems with hub-timeouts etc.


    Yes we use the ha-probe for one thing, fire up a backup emailgtw-Probe :->

  • 6.  Re: Nimsoft Disater Recovery

    Posted 08-19-2014 05:37 PM

    Yes, even i have the same feeling towards HA :smileyhappy:


    I am able to make use of this facility only when the hub probe fails.






  • 7.  Re: Nimsoft Disater Recovery

    Posted 08-22-2014 02:16 PM



    Just discovered, that our solution doesnt work anymore :-(, had that running till NMS 3.31 :->


    Just tried to switch from my primary Hub/DB-Package to my standby Hub w/mirrored DB, so stopped my main-Hub  + main-DB, in fact switched that machines off. Fired up the data_engine on the "new" primary, formerly standby Hub, and what it's telling me?

    You are a secondary hub, you cannot configure the data_engine, <sarcasm> nice.... </sarcasm>

    Till 5.61 that was working pretty fine, now at 6.5 "no no" :-(


    Well had a glimpse into the DB, nice table found "dbo.tbn_de_Controller" there are the  data_engines and its state found, well switching "primary" <-> "secondary" and my old concept fits, till I switch it back :->


    Hmm... but thats no solution, <sarcasm> thanks a lot CA for not writing that official </sarcasm> however, going to rethink our Standby-Concept, propably switching to VM and using nagios to monitor that VM :->





  • 8.  Re: Nimsoft Disater Recovery

    Posted 08-22-2014 02:35 PM

    Now, this scares me. We are running in 6.5 and have planned for upgrading earlier next week to 7.6. I am not sure if i can get a good sign. With just HA probe in our set we really need to drive a lot to achieve.



    Thanks for all the support :smileyhappy:

  • 9.  Re: Nimsoft Disater Recovery

    Posted 12-23-2014 10:35 PM

    This is all unofficial at this point, but I've pieced together some information.  There is a lot more I don't know.  I've attributed sources where I could since this gets pretty far off the reservation.  


    It's noteworthy that the support team doesn't know how HA is supposed to work and was still running under the old idea based on the re-released but not updated HA doc for 8.0.  A spattering of KB articles and some other contradictory documentation seems to lay out the new way some of this is supposed to work.  I get the impression that the data_engine had significant improvements that are part of the product but completely unkown by customers and mostly forgotten or lost at CA possibly in an ownership transition.


    Run Multiple data_engines! Yes really.  Apparenty this is "Typical" :smileysurprised:


    HA probe 1.4 document directly contradicts the HA guid for 8.0 and all other documentation showing that only one DE should run at a time and that HA probe could be used to flip it.  Support told me it multiple data_engines supported and that they hadn't heard it being on the road map.  I've know that was on the map since at least '11.


    "Important: Typically, the data_engine is configured to be always running on the secondary hub. The secondary data_engine cannot perform maintenance operations, which are relegated to the primary data_engine, but it can provide connection information to clients that request such information as well as persist QoS information to the NIS database."


    http://docs.nimsoft.com/prodhelp/en_US/Probes/Catalog/ha/1.4/index.htm?toc.htm?2025849.html  - HA probe 1.4


    Primary secondary relationship between data_engines negotiates through DB and can be manipulated via a callback.


    How communication happens within the DB is not spelled out, but multiple data_engines will learn and know about eachother even if you install under a different security domain pointed at the same DB while the other DE is disabled.  Mgrubers digging revealed at least part of this appears to be via the tbn_de_Controller table.  The new DE will come online as secondary and blank out the maintenance options with a note that it is secondary.  This functionality has been in place since around 2012 based on the date of KB 3292.   From my testing, it seems some of the callbacks that look like they would demote the primary won't work and give the error that they should only be run by another data_engine even when you execute them with super user priv.  The promote primary does work on the secondary.  It appears to orchistrate the switch and safely promote itself.  As near as I can tell, the secondary avoids updating data_defs directly.  I suspect it uses the hidden (expert mode) _insert_qos_definition callback to the primary in order to create a new definition when it needs to without conflict.  It also disables maintenance and does some type of config sync which may be partially db and partially callback based (unknown).


    From kb3292: https://na4.salesforce.com/articles/TroubleshootingObj/The-primary-data-engine-is-set-incorrectly?popup=true#




    ·       Cannot edit the configuration of data_engine probe

    ·       Message appears:

    There are multiple data_engines using the same NIS database.

    Current data_engine is working in a 'secondary' mode, which means it will not perform maintenance.

    Current primary data_engine is at: /<Domain>/<hub>/<Robot>/data_engine

    ·       In raw configure for the data_engine the data_engine_id key is set to a value other than 1 on the system which is expected to be the primary. (1 indicates that this is the primary.)



    In the setup section of the data_engine config, add the following key:

    show_admin_upgrade_primary = yes

    This will cause the callback admin_upgrade_primary to be available in the data_engine - call this with the probe utility (highlight the data_engine and press ctrl-p), and the data_engine should be properly promoted to primary.

    This functionality exists to prevent conflicting maintenance and parallel updates of  QOS definitions and data.



    Nis bridge in NAS


    Only one should run, the secondary nas will queue data and send it to the primary when it comes back online. HA 1.4 doc.




    HA probe doc claims to be able to do this.


    Data engine best practices


    This article explains how to implement parallel processing partitioning and other features for high throughput on mssql.




    Benchmarking in the DE doc describes what the options mean a little better.




    Unknown – Other database probe setup?


    What about all of the other *_engine qos_processor etc probes? Should only one be running? Should they be managed by HA probe?




    I’m guessing UMP can use the redundant DE? What are the implications for geographical redundancy?




    You can run more than one set of root infrastructure in a security domain including multiple sets of data_engines talking to different databases. Discovery_server seems to be the big place of contention since it polls and cleans downstream niscache.


    You may also run a data_engine in one security domain as primary, and a secondary in another security domain. They seem to negotiate the relationship fine via the database, but I suspect callbacks to _insert_qos_defs would fail. Not sure what the impact of that would be.


    This may lead to viable stage / test / release options or better options for DR redundancy or sandboxing an upgrade.  




    Nas replication including one way nas replication with the alert responder option can be combined with the full replication in various configurations according to a convo with someone close to the dev team for that probe. (Sorry for not crediting. You were very helpful, but I’m bad with names.)


    The nas also adds a field telling you what nas processed an alarm.


    I don’t have time to document the idea, but a tiered / hierarchical nas where processing happens mostly on the dist hub with DR replication of root hubs and AO processing / bridge on the root seems possible combined with a ITSM integration that is active/active on both sides using the nas field to avoid taking redundant actions seems possible.


    We also have the advantage of a little probe that can use the bulk_post delivery method (post_queue in the hub) to send messages upstream with redundant targets across redundant tunnels. IE: If you cant send to the primary, send to the secondary. This really simplifies HA by making failover of message paths immediate and delegates the action on failed paths down stream. It’s also way more efficient and eliminates the winsock induced get_queue / subscriber limitations.



  • 10.  Re: Nimsoft Disater Recovery

    Posted 12-24-2014 01:02 PM

    Good stuff Ray!


    I'm just setting up new environment for us that will make use of the HA probe. I had previously disabled data_engine on the failover hub to avoid "double maintenance". I wasn't aware at all that they communicate like that. In my case it shouldn't make a big difference whether I run it or activate it on demand, but it's one less thing that can fail.


    Also wasn't aware that HA can replicate AO (and I guess PP too?) rules. I actually created a probe that checks audit messages from hubs that have nas and replicates the rules from the "primary" nas to the secondary. I'm also planning making the secondary "read-only", so the probe will likely revert all changes done on that (except for changes done by the probe). Don't know if HA (or maybe generic_cluster) can do that, but now I'm definitely going to look into it.



  • 11.  Re: Nimsoft Disater Recovery

    Posted 12-29-2014 11:47 AM

    Ray, excellent information!

  • 12.  Re: Nimsoft Disater Recovery

    Posted 12-30-2014 09:36 AM



    I was doing some testing with running multiple data_engines and came across a missing table with de_upgrade. I opened a case about it and it went back to engineering. It came back with the following comment:


    "We have discussed this with development and here's the official word...


    running multiple active data_engines (e.g. master/slave or horizontal scaling of data_engine) is not supported.


    This was at one time on the roadmap but is not anymore... The callbacks such as de_upgrade which are related to running multiple data_engines are actually slated to be removed from the data_engine in a future release -- we actually never really intended them to get into the official product in the first place, it was a last-minute beta implementation, currently we don't do any testing around this and the result of running multiple data_engines and/or manipulating the state of data_engines (e.g. primary/secondary) is currently unknown and undefined."



  • 13.  Re: Nimsoft Disater Recovery

    Posted 12-30-2014 09:49 AM

    Dang. So, shut down all secondary data_engines.

  • 14.  Re: Nimsoft Disater Recovery

    Posted 12-30-2014 10:41 PM

    The whole subject is confusing.  I got the same response.  Then I asked why if it's not supported and slated for removal, was it put into the HA probe documentation in June of this year and referred to as typical.  And what is the back story on removal of a long held objective with obvious HA and scalability benifits?


    Last I heard the HA team and the data_engine team were being pulled in together to get some consensus.  I suspect it will get clearer when the architect level delopers get back after the new year. 

  • 15.  Re: Nimsoft Disater Recovery

    Posted 12-31-2014 10:43 PM

    So the plot thickens.  Multi data_engine for DR might be supported, but they don't test and support multi data_engine for scaling.  There is a lack of consensus and we should have more info when the full team is back in after the holiday.


    Also...  Hunch confirmed.   The multi-data_engine work goes back to 2011 in Oslo.  gmh is Guttorm Husveg.  Multi-data_engine support might have one of his last conrtibutions before leaving.  I suspect the new team never continued that work and may have been under the impression that it was never in use, while the HA team may have been recommeding the use of it.


    From an embedded change log in nis/sql/sqlserver/sqlserver_dataengine_create.sql


    26.05.2011 GMH: added tbn_de_Controller, spn_de_Register and spn_de_GetRegistration (US3650)


    03.08.2011 gmh: Changed min raw age to 1 in spn_de_DataMaint_Configure and added similar check to spn_de_qos_put_config


    11.08.2011 gmh: updated t_qos_snapshot/spn_de_UpdateQosSnapshot to handle multiple data_engines.

  • 16.  Re: Nimsoft Disater Recovery

    Posted 01-02-2015 12:21 AM

    Hmh, will be interesting to see what they end up doing. For now I deactived the DE on my secondary and leave it up to the HA probe to activate it.