We have identified an unexpected behavior in CA SSO 12.7 bi-clustersetup that leads to down time condition. We perceive it as a deviationfrom the intended functionality and would like it to be resolved byproducing a patch for current or closest future release. Here is thescenario.
The Setup: CA SSO PS infrastructure is setup in 2 clusters with 3nodes in each (1.1, 1.2, 1.3 and 2.1, 2.2, 2.3 respectively) and afailover threshold of 50%. The "enable failover" feature betweencluster nodes is turned off.
The bug: In a failover scenario we managed to reach a reproducablestate where the entire agent infrastructure was down while 2 out of 6Policy servers were still up but idling by executing the followingfailover test scenario:
1. Nodes 1.1 and 1,2 are shutdown. Result: All agents gradually failover from Cluster 1, as expected, due to availability dropping to 33% and Cluster 2 becoming preferred.
2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2 because it still is on 66%.
3. Node 2.3 is shutdown. Result: All agents are down and disconnected from Cluster 1 and Cluster 2 which still have 33% capacity each.
This is obviously a problem: 2 servers are still up but do nothing,while the entire agents environment in both data centers is down. Theonly way to workaround the bug is to NOT use failover threshold atall, i.e. setting it to 33% so that agents keep hammering the poorcluster 1 until it faints off, all the while cluster 2 would enjoy its100% capacity. This has to be addressed.
Here's a sample to illustrate it :
We have 6 Policy Servers configured in 2 clusters as follows:
Cluster A : 1.1, 1.2, 1.3 Cluster B : 2.1, 2.2, 2.3
The failover threshold is set to 50%, which means that the clusterwill be considered down when there is a minimum of 50% of the PolicyServers in that cluster unavailable. They do the following:
1. Nodes 1.1 and 1.2 are shutdown. Result: All agents gradually failover from Cluster 1, as expected, due to availability dropping to 33% and Cluster 2 becoming active.
We expect (as the doc mentioned) to have requests still going to
Cluster B available nodes.
The behavior observed is expected and working as designed. To avoidthe cluster to be considered down when there is still 1 Policy Serverup, the Failover threshold should be set to less than 30%. Anotheroption is to set a third cluster with all the Policy Servers to therewould be the two available nodes there.
KB : KB000099632