This is all unofficial at this point, but I've pieced together some information. There is a lot more I don't know. I've attributed sources where I could since this gets pretty far off the reservation.
It's noteworthy that the support team doesn't know how HA is supposed to work and was still running under the old idea based on the re-released but not updated HA doc for 8.0. A spattering of KB articles and some other contradictory documentation seems to lay out the new way some of this is supposed to work. I get the impression that the data_engine had significant improvements that are part of the product but completely unkown by customers and mostly forgotten or lost at CA possibly in an ownership transition.
Run Multiple data_engines! Yes really. Apparenty this is "Typical"
HA probe 1.4 document directly contradicts the HA guid for 8.0 and all other documentation showing that only one DE should run at a time and that HA probe could be used to flip it. Support told me it multiple data_engines supported and that they hadn't heard it being on the road map. I've know that was on the map since at least '11.
"Important: Typically, the data_engine is configured to be always running on the secondary hub. The secondary data_engine cannot perform maintenance operations, which are relegated to the primary data_engine, but it can provide connection information to clients that request such information as well as persist QoS information to the NIS database."
http://docs.nimsoft.com/prodhelp/en_US/Probes/Catalog/ha/1.4/index.htm?toc.htm?2025849.html - HA probe 1.4
Primary secondary relationship between data_engines negotiates through DB and can be manipulated via a callback.
How communication happens within the DB is not spelled out, but multiple data_engines will learn and know about eachother even if you install under a different security domain pointed at the same DB while the other DE is disabled. Mgrubers digging revealed at least part of this appears to be via the tbn_de_Controller table. The new DE will come online as secondary and blank out the maintenance options with a note that it is secondary. This functionality has been in place since around 2012 based on the date of KB 3292. From my testing, it seems some of the callbacks that look like they would demote the primary won't work and give the error that they should only be run by another data_engine even when you execute them with super user priv. The promote primary does work on the secondary. It appears to orchistrate the switch and safely promote itself. As near as I can tell, the secondary avoids updating data_defs directly. I suspect it uses the hidden (expert mode) _insert_qos_definition callback to the primary in order to create a new definition when it needs to without conflict. It also disables maintenance and does some type of config sync which may be partially db and partially callback based (unknown).
From kb3292: https://na4.salesforce.com/articles/TroubleshootingObj/The-primary-data-engine-is-set-incorrectly?popup=true#
Symptom | Symptoms:
· Cannot edit the configuration of data_engine probe · Message appears: There are multiple data_engines using the same NIS database. Current data_engine is working in a 'secondary' mode, which means it will not perform maintenance. Current primary data_engine is at: /<Domain>/<hub>/<Robot>/data_engine · In raw configure for the data_engine the data_engine_id key is set to a value other than 1 on the system which is expected to be the primary. (1 indicates that this is the primary.) |
|
|
Solution | In the setup section of the data_engine config, add the following key: show_admin_upgrade_primary = yes This will cause the callback admin_upgrade_primary to be available in the data_engine - call this with the probe utility (highlight the data_engine and press ctrl-p), and the data_engine should be properly promoted to primary.
This functionality exists to prevent conflicting maintenance and parallel updates of QOS definitions and data.
|
|
Nis bridge in NAS
Only one should run, the secondary nas will queue data and send it to the primary when it comes back online. HA 1.4 doc.
AO
HA probe doc claims to be able to do this.
Data engine best practices
This article explains how to implement parallel processing partitioning and other features for high throughput on mssql.
https://na4.salesforce.com/articles/Best_Practices/Data-engine-Best-Practices?popup=true
Benchmarking in the DE doc describes what the options mean a little better.
https://wiki.ca.com/display/UIMPGA/data_engine+IM+Basic+Benchmarking+v8.0
Unknown – Other database probe setup?
What about all of the other *_engine qos_processor etc probes? Should only one be running? Should they be managed by HA probe?
UMP?
I’m guessing UMP can use the redundant DE? What are the implications for geographical redundancy?
Experimental:
You can run more than one set of root infrastructure in a security domain including multiple sets of data_engines talking to different databases. Discovery_server seems to be the big place of contention since it polls and cleans downstream niscache.
You may also run a data_engine in one security domain as primary, and a secondary in another security domain. They seem to negotiate the relationship fine via the database, but I suspect callbacks to _insert_qos_defs would fail. Not sure what the impact of that would be.
This may lead to viable stage / test / release options or better options for DR redundancy or sandboxing an upgrade.
Ramble
Nas replication including one way nas replication with the alert responder option can be combined with the full replication in various configurations according to a convo with someone close to the dev team for that probe. (Sorry for not crediting. You were very helpful, but I’m bad with names.)
The nas also adds a field telling you what nas processed an alarm.
I don’t have time to document the idea, but a tiered / hierarchical nas where processing happens mostly on the dist hub with DR replication of root hubs and AO processing / bridge on the root seems possible combined with a ITSM integration that is active/active on both sides using the nas field to avoid taking redundant actions seems possible.
We also have the advantage of a little probe that can use the bulk_post delivery method (post_queue in the hub) to send messages upstream with redundant targets across redundant tunnels. IE: If you cant send to the primary, send to the secondary. This really simplifies HA by making failover of message paths immediate and delegates the action on failed paths down stream. It’s also way more efficient and eliminates the winsock induced get_queue / subscriber limitations.