We are currently concluding a long-term professional services engagement with CA to build out an HA / BC infrastructure for Privileged Identity Manager. At the end of September we conducted a BC test of the environment and learned a very painful, four day lesson on how difficult it is to fail over to our contingency environment, even with the hands-on assistance of multiple CA consultants who had designed the failover model and the CA support team. The reason? The environment is highly complex, with many, many needs for manual changes to configuration files to conduct the fail over to our BC environment at the alternate datacenter, and then the fail back to normal production. Most of our issues stem from the complexity surrounding Distribution Server processes and functions. There are DH, DMS, DH_WRITER, JBoss, Tibco queues... the configuration of which is highly specialized and requires deep understanding of what you are doing in order to avoid making mistakes in a complex HA/BC environment when you have to fail over. It is so complex, even our CA consulting team seemed to not have a very good handle on how and what the configuration steps should be to make failover work, even though they designed it. Configuration changes have to be done manually, no exceptions, on at least 15 or so different files in our current environment which CA designed for us. If we were to add more servers in the future for higher performance and/or resilience, this makes the configuration complexity even more challenging when we have to deal with an emergency failover: which as everyone knows is a terrible time to have problems with the failover process itself. It's not humanly possible to make all of these changes, even with good instructional documentation and years of experience with the product, without making mistakes. Even CA consultants who have extensive product experience were making configuration mistakes on our BC failover test.
If there were an intelligent front-end that helps navigate us through the PIM infrastructure failover process, this would help tremendously. Currently, we have to modify multiple configuration files manually on multiple infrastructure server components. If we make a single typo, or put in a wrong server name or port number, it can take days to find where the problem is even with the assistance of CA consulting and support. A failover solution that is aware of all of the infrastructure components throughout the environment (both regular production and BC) and does not allow us to put in bad values (or at least warns us if we do and would make us override) would help a lot. Perhaps even add to this solution an overall environment health-check so that we can be proactively alerted when a PIM server infrastructure component has failed or is not responding, rather than seeing symptoms on endpoints that there's some bigger problem well after the problem has occurred, and having to troubleshoot and backtrack to the source of a failed infrastructure service.