Symantec Access Management

Tech Tip : CA Single Sign-On : Clusters fail while nodes are still up

  • 1.  Tech Tip : CA Single Sign-On : Clusters fail while nodes are still up

    Broadcom Employee
    Posted 06-01-2018 07:00 AM

    Issue:

     

    We have identified an unexpected behavior in CA SSO 12.7 bi-cluster
    setup that leads to down time condition. We perceive it as a deviation
    from the intended functionality and would like it to be resolved by
    producing a patch for current or closest future release. Here is the
    scenario.

     

    The Setup: CA SSO PS infrastructure is setup in 2 clusters with 3
    nodes in each (1.1, 1.2, 1.3 and 2.1, 2.2, 2.3 respectively) and a
    failover threshold of 50%. The "enable failover" feature between
    cluster nodes is turned off.

     

    The bug: In a failover scenario we managed to reach a reproducable
    state where the entire agent infrastructure was down while 2 out of 6
    Policy servers were still up but idling by executing the following
    failover test scenario:

     

    Steps :

     

    1. Nodes 1.1 and 1,2 are shutdown. Result: All agents gradually
    failover from Cluster 1, as expected, due to availability dropping
    to 33% and Cluster 2 becoming preferred.

     

    2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2
    because it still is on 66%.

     

    3. Node 2.3 is shutdown. Result: All agents are down and disconnected
    from Cluster 1 and Cluster 2 which still have 33% capacity each.

     

    This is obviously a problem: 2 servers are still up but do nothing,
    while the entire agents environment in both data centers is down. The
    only way to workaround the bug is to NOT use failover threshold at
    all, i.e. setting it to 33% so that agents keep hammering the poor
    cluster 1 until it faints off, all the while cluster 2 would enjoy its
    100% capacity. This has to be addressed.

     

    Here's a sample to illustrate it :

     

    We have 6 Policy Servers configured in 2 clusters as follows:

     

    Cluster A : 1.1, 1.2, 1.3
    Cluster B : 2.1, 2.2, 2.3

     

    The failover threshold is set to 50%, which means that the cluster
    will be considered down when there is a minimum of 50% of the Policy
    Servers in that cluster unavailable. They do the following:

     

    Steps :

     

    1. Nodes 1.1 and 1.2 are shutdown. Result: All agents gradually
    failover from Cluster 1, as expected, due to availability dropping
    to 33% and Cluster 2 becoming active.

     

    2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2
    because it still is on 66%.

     

    3. Node 2.3 is shutdown. Result: All agents are down and disconnected
    from Cluster 1 and Cluster 2 which still have 33% capacity each.

     

    We expect (as the doc mentioned) to have requests still going to

     

    Cluster B available nodes.

     

    Resolution:

     

    The behavior observed is expected and working as designed. To avoid
    the cluster to be considered down when there is still 1 Policy Server
    up, the Failover threshold should be set to less than 30%. Another
    option is to set a third cluster with all the Policy Servers to there
    would be the two available nodes there.

     

    KB : KB000099632