Attached is the lua script which I've written to failover a multi instance UMP from a primary hub to a secondary hub.
As most of you know HA probe only stops and starts probes (and queues), which works for UMP if you only have UMP local to the primary and secondary hub. But when you have one or many separate robots hosting UMP then this is difficult to achieve and it turns into a manual process. With a single instance of UMP, reconfiguring to point to a different data engine and nas is relatively straight forward process and therefore is more often sufficient for smaller installations, but when there are two or more instances, setup as multi-instance UMP, then this is more of a task and its easy for the user to make mistakes.
So when you have one or more UMP instances running on separate robots and a primary and secondary hub controlled by HA then you may find this script useful to "re-point" UMP at the failed-over data engine and nas etc.
The script is triggered by the alarm generated from HA probe which is running on the secondary hub. The script is installed on the nas on the secondary hub and I have set it up to trigger on the alarms which start "Initiating failover" and "Stopping failover", using this to decide which hub to configure UMP to point to.
Setup is easy, just put in the script the nimsoft addresses of the two hubs and the hub names, and then populate the array of UMP robotnames.
The first thing the script does is check that the UMP robots, themselves, have moved to their secondary hub, as I'm assuming they have a primary hub which has just gone down.
It then just loops through each instance, reconfiguring the DAP, Dashboard Engine and WASP with the secondary nas and data engine.
In my particular customer installation, they don't use all the functions available to UMP and therefore not all the keys are updated, but for anyone else to use this, it should be easy to just copy and paste extra sections for any other keys you need to change.
This has been fully tested in a 2 node UMP installation but the production environment is a 5 robot Multi Instance UMP deployment and works the same way, just looping through an extra 3 times.
Any bugs/improvements or requests for features, please post back.
I just created an Idea two days ago for createing these features into UMP.
Here's the latest UMP-Failover script which I have just tested for 8.4 sp1. (will work on 8.4 as well)
Main updates are for new keys in wasp.
There have been quite a few improvements made since my original version and here are a couple of highlights worth mentioning.
1. The first improvement is the running of the script on both primary and secondary NAS's. The secondary NAS's profile will be triggered by the "Initiating failover" alarm from the HA probe (which only runs on the secondary). The primary NAS's profile will be triggered by the "Stopping failover" alarm which also comes from the HA probe and is replicated over from the secondary. [ this setup is dependent on NAS replication between primary and secondary ]. This is needed if the NAS- AO is controlled by the HA probe - on HA failback the secondary NAS AO is deactivated so the script would never be triggered. Note : both the Primary and Secondary scripts are exactly the same.
2. The script has the ability to turn on the nis_bridge on the secondary hub NAS so that Alarms will be inserted into the database and therefore available in UMP. This is important if the secondary hub is to be used for any length of time.
A key in the input file will set this in motion
3. Input file for parameters for your environment so that you don't have to edit the script. This is located on each hub, in the NAS directory (can be anywhere -just update the script) and if you cannot visit the host due to remote access limitations, this should be placed there using a package. Here is an example of an inputfile
PriHubAddr = /rc82-dom/rc82_phub/
SecHubAddr = /rc82-dom/rc82_shub/
PriHub = colro22-i147145
SecHub = colro22-i159911
nis_bridge_activate = yes
umpname = colro22-i147365,colro22-i176236
PriIP = 10.131.nn.nn
SecIP = 10.131.nn.nn
wasp = 100
robot = 20
This is taken from a secondary hub NAS and note the nis_bridge_activate key is set to "yes". The inputfile for the primary would be the same apart from nis_bridge_activate_key is set to "no". This doesn't mean it will turn off the nis_bridge on the primary....only the secondary. The script will only turn on/off the nis_bridge on the secondary NAS and doesn't touch the primary NAS. What this does mean is that when it turns the nis_bridge on, on the secondary, it has to restart the secondary NAS because it does a probe config set, even though the script hasn't finished. The nis_bridge activate is the last operation that the script does and previous updates have already been committed. This means that the secondary profile always finishes in an error state, but this is fine. Primary NAS profile for ump failover should finish without errors.
4. Note the fname in the script which might need to be updated with Program Files (x86) for the location of the inputfile.
Two scripts attached, failover for Secondary and failback for Primary (both are identical but just want to emphasize the use of both NAS's)
Thanks for posting this Rowan. It is amazing the life it has lived since 2012 inception.
Thanks Rowan. It actually helps for one of my customer requirement.
I have one more requirement from customer where in he has UIM DB in primary data center and data is getting replicated to DR site. In case primary UIM DB fails customer wants to move connection of primary hub to DR DB. Would you have any script to modify hub configuration (I think data_engine) in such scenarios.
I think, I should be able to modify attached script for such scenario.
Is there any alarm generated in case UIM DB is not accessible to primary hub and ump?
Here is an example of alarm (off data_engine) when backend UIM database (MS SQL Server) is down.
[Microsoft OLE DB Provider for SQL Server] [DBNETLIB][ConnectionRead (recv()).]General network error. Check your network documentation.
Jitendra, create a script which is triggered by the data engine alarm (and maybe a couple of other alarms -sqlserver database alive etc) and do a probe_config_set on the database connection details.
Not many customers do this but if it works for you, great.
Thanks Rowan. Let me try this.
For those of you who are using this ump failover script, i have updated for UIM version 8.5.1.
Two main updates..
1. Cope with new key in wasp.cfg/ump_common section of ems.
2. Failover for the CABI robot as well as UMP robots
For number 2, I have added a new key in the inputfile which should look like the following:
cabirobotname = <CABI robotname>
eg cabirobotname = colro22-i123456
Also needed is the cabi robotname in the list of ump robots key.
So in the example above, the key will now look like...
umpname = colro22-i147365,colro22-i176236,colro22-i123456
In this example there are 2 ump robots and 1 cabi robot.
The script will change the data_engine key in the setup section of the CABI wasp.
There are two scripts attached to highlight that one needs to be on the primary nas and secondary nas, but they are both the same script.
I have tested this on a 2 ump, 1 cabi system, all on version 8.5.1 of UIM
Hi Rowan thanks for the script.
But i have some trouble understanding how to apply it as im rather new in CA UIM.
Could you help me understand how to configure the script for ump failover?
i have nas probe on my primary hub, and a secondary nas prob on my ha primary hub. So im supposed to configure the secondary nas to run your script correct? May i know how is it done.
Im confused about UMP robot configuration, do you meant the robots on both my primary ump and secondary ump server? Or where do i configure them?
The scripts need to be placed on both NAS's, failback on primary and failover on secondary (HA primary as you call it). Then create the inputfile.txt which you need to place in a conf folder in the NAS probe folder with the contents as described above. Then create the NAS profile which is looking for the alarm from HA probe with message "inititating failover" (secondary hub NAS) and "stopping failover" (primary hub NAS). It does rely on replication between the NAS's being activated.
hope that helps
Its seems after failover, i was able to login to the UMP servers. So i believe the script had taken effect. (previously couldn't login).
However, at the starting dashboard, when it starts initializing, it shows the following error. I tried refreshing, still the same error appears. Do you know the cause?
An unknown error has occurred.Refreshing your browser may resolve the issue.
Details:com.firehunter.ump.exceptions.DataFactoryException : com.firehunter.ump.exceptions.DataFactoryException: Column 'CM_CONFIG_ITEM_TO_MASTER.master_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.Please check the log for more information.Stack Trace:(1) error, com.firehunter.ump.exceptions.DataFactoryException: Column 'CM_CONFIG_ITEM_TO_MASTER.master_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.: Column 'CM_CONFIG_ITEM_TO_MASTER.master_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. at com.firehunter.usm.alarms.NisDbAlarmProvider.getAlarmSummary(NisDbAlarmProvider.java:670) at com.firehunter.usm.alarms.NisDbAlarmProvider.getAlarmSummary(NisDbAlarmProvider.java:642) at com.firehunter.usm.AlarmUtils.getAlarmSummary(AlarmUtils.java:970) at com.firehunter.usm.DataFactory.getRoot(DataFactory.java:4227) at com.firehunter.usm.DataFactory.getCacheEntry(DataFactory.java:3649) at com.firehunter.usm.DataFactory.getGroups(DataFactory.java:3353) at com.firehunter.usm.DataFactory.getGroups(DataFactory.java:2937) at com.firehunter.usm.DataFactory.getGroups(DataFactory.java:2928) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at flex.messaging.services.remoting.adapters.JavaAdapter.invoke(JavaAdapter.java:421) at flex.messaging.services.RemotingService.serviceMessage(RemotingService.java:183) at flex.messaging.MessageBroker.routeMessageToService(MessageBroker.java:1503) at flex.messaging.endpoints.AbstractEndpoint.serviceMessage(AbstractEndpoint.java:884) at flex.messaging.endpoints.amf.MessageBrokerFilter.invoke(MessageBrokerFilter.java:121) at flex.messaging.endpoints.amf.LegacyFilter.invoke(LegacyFilter.java:158) at flex.messaging.endpoints.amf.SessionFilter.invoke(SessionFilter.java:44) at flex.messaging.endpoints.amf.BatchProcessFilter.invoke(BatchProcessFilter.java:67) at flex.messaging.endpoints.amf.SerializationFilter.invoke(SerializationFilter.java:146) at flex.messaging.endpoints.BaseHTTPEndpoint.service(BaseHTTPEndpoint.java:278) at flex.messaging.MessageBrokerServlet.service(MessageBrokerServlet.java:322) at javax.servlet.http.HttpServlet.service(HttpServlet.java:731) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at com.firehunter.ump.auth.InvalidHttpSessionFilter.doFilter(InvalidHttpSessionFilter.java:29) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.doFilter(InvokerFilterChain.java:73) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.doFilter(InvokerFilterChain.java:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.liferay.portal.kernel.bean.ClassLoaderBeanHandler.invoke(ClassLoaderBeanHandler.java:67) at com.sun.proxy.$Proxy874.doFilter(Unknown Source) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.doFilter(InvokerFilterChain.java:73) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.processDirectCallFilter(InvokerFilterChain.java:168) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.doFilter(InvokerFilterChain.java:96) at com.liferay.portal.kernel.servlet.PortalClassLoaderFilter.doFilter(PortalClassLoaderFilter.java:72) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.processDoFilter(InvokerFilterChain.java:207) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.doFilter(InvokerFilterChain.java:109) at com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilter.doFilter(InvokerFilter.java:84) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:218) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:442) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1083) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:640) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745)Caused by: com.firehunter.ump.exceptions.DataFactoryException: Column 'CM_CONFIG_ITEM_TO_MASTER.master_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:196) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1454) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:388) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:338) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4026) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1416) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:185) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:160) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:281) at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96) at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96) at com.firehunter.ump.db.PartialPreparedStatement.executeInternal(PartialPreparedStatement.java:284) at com.firehunter.ump.db.PartialPreparedStatement.execute(PartialPreparedStatement.java:260) at com.firehunter.usm.alarms.NisDbAlarmProvider.getUsmAlarmSummaries(NisDbAlarmProvider.java:939) at com.firehunter.usm.alarms.NisDbAlarmProvider.getAlarmSummary(NisDbAlarmProvider.java:668) ... 59 more
Sorry, I don’t know the reason for that error.
I suggest raising a case with CA support to get someone to take a look at your environment.
Does it only happen when UMP is pointing to the secondary?
What about when you failback?
hmm, yes it only happens when its pointing to the secondary hub (when failover kicks in). normally it works well before failover. When i failback, everything too works well as per normal
One question, how do we know if the script had worked. Other than able to login. Was the wasp.cfg changed on the ump server itself too? I noticed the timestamp changed but the values seems to be the same.
Can you compare the wasp.cfg when you failover to the secondary with the original when pointing to the primary?
hmm, i noticed the time stamp for the wasp.cfg for ump server changed after failover. I compared both wasp.cfg (before failover and after failover). The contents seems similar, except the wasp.cfg (after failover) has the <ump_common> tag moved to the top, see below, and all pointing to the primary hub. Is this correct? Did the script did what it supposed to do?
<ump_common> maintenance_mode = /cauimdomain/cauim001_hub/cauim001/maintenance_mode ems = /cauimdomain/cauim001_hub/cauim001/ems nas = /cauimdomain/cauim001_hub/cauim001/nas ace = /cauimdomain/cauim001_hub/cauim001/ace automated_deployment_engine = /cauimdomain/cauim001_hub/cauim001/automated_deployment_engine discovery_server = /cauimdomain/cauim001_hub/cauim001/discovery_server mpse = /cauimdomain/cauim001_hub/cauim001/mpse sla_engine = /cauimdomain/cauim001_hub/cauim001/sla_engine udm_manager = /cauimdomain/cauim001_hub/cauim001/udm_manager service_host = service_host</ump_common>
After failover they (ump_common keys) should be pointing to the secondary hub.
Can you check the nas.log when you see the “initiating failover” alarm and you will see if the script is triggered.
The Script will write about 30 lines or so to the nas.log (search for ump_failover)
Also check your profile in the NAS has been triggered (right click/view activity)
oh no...seems like my nas profile is not running the script (i couldn't see the alarm u mentioned), could u help see if the settings i made on the secondary NAS are correct? Thanks!
Firstly if you don’t see the alarm then that means your HA isn’t running or you need to log into the secondary NAS in Infrastructure manager as that is where it will be processed.
Secondly try the following in the message string –
And select all the severities in the profile (I can’t remember the alarm severity)