Symantec Privileged Access Management

 View Only

 Recovery in the event of a failure in a cluster

MARUBUN SUPPORT's profile image
MARUBUN SUPPORT posted Apr 09, 2025 05:36 AM

Hi Team,


I received a question from a customer and would like an answer.

[Product]
PAM


[Current situation]
Currently, we are operating a cluster with two members (A and B).
An error occurred, which caused services to stop on both members (A) and (B).
After that, we stopped the server of member (B) where the error was thought to have occurred, 
and checked the operation of member (A), but it did not recover.
After that, we restarted member (A), and the following message was displayed.

PAM-CMN-5190: This CA PAM appliance lost the connection to the member(s) in the primary site and is in the mode only admin level users can login.


[Question]
In the above situation, is it possible to operate with a single system (single system operation) by removing member (A) from the cluster?
If there are any risks in this case, please let us know.

After removing it from the cluster, is it possible to re-register and restore the original cluster configuration (two-system configuration)?
If possible, we would appreciate it if you could let us know the necessary conditions and procedures.

Thanks,

Joseph Fry's profile image
Broadcom Employee Joseph Fry

Sounds like a classic split-brain / loss of quorum scenario.  The documentation discusses recovery:
https://techdocs.broadcom.com/us/en/symantec-security-software/identity-security/privileged-access-manager/4-1-1/deploying/set-up-a-cluster/cluster-synchronization-promotion-and-recovery.html

What you propose should work fine.  The database on 'B' should be the current/correct data, so long as you start the rebuilt cluster using that node as the source, it should recover well.

Please advise the customer to take snapshots/backups before proceeding.

And please explain to the customer that it is for this reason that we NEVER recommend a primary site with two nodes.  If adding a third node is impossible, then it is better to split those two nodes into two different sites.  You will still have an outage if the primary node goes offline, but it will self heal when the node recovers.

MARUBUN SUPPORT's profile image
MARUBUN SUPPORT

> The database on 'B' should be the current/correct data, so long as you start the rebuilt cluster using that node as the source, it should recover well.

When operating with two members, if a problem occurs with one of the members, only the remaining member will take over. Is such a setting possible? If a problem occurs during operation and one member is removed, will the system still be able to operate?

Joseph Fry's profile image
Broadcom Employee Joseph Fry

When operating with two members, if a problem occurs with one of the members, only the remaining member will take over. Is such a setting possible? If a problem occurs during operation and one member is removed, will the system still be able to operate?

This is impossible.  Imagine if the network between the two nodes breaks... both nodes would then think the other node is offline and would continue to operate.  This is "split-brained".  When connectivity is restored, both nodes would have different data and there would be no way of knowing which database is most accurate without a VERY complex analysis and merge of every database operation that occurred.

To prevent this from occurring, PAM nodes will stop all database operations (and thus user activity), when it detects a loss of quorum.  A quorum is connectivity with greater than 50% of nodes.  With just two nodes, quorum is two nodes.  With 3 nodes, two nodes is a quorum, the remaining node would not have quorum and would halt db operations until it is able to communicate with the cluster.

This is why, with just two nodes, the recommendation is to separate them into different logical sites.  This will mean that one node is always the primary.  The cluster will self heal if a node is rebooted or experiences a temporary outage; if there is a prolonged outage of the primary node, then you can promote the secondary site and resume operations with confidence that the data is intact.

I hope this makes sense.

MARUBUN SUPPORT's profile image
MARUBUN SUPPORT

> The cluster will self heal if a node is rebooted or experiences a temporary outage; 

If we split into these two logical sites, will the PAM service continue to operate on the remaining site if one of them stops?
Or will the PAM service stop unless we restart one of the sites and return it to the multi-cluster?

Thanks, 

Joseph Fry's profile image
Broadcom Employee Joseph Fry

If we split into these two logical sites, will the PAM service continue to operate on the remaining site if one of them stops?
Or will the PAM service stop unless we restart one of the sites and return it to the multi-cluster?

If split into two sites... one node becomes the primary site and the other is a secondary site. 

If the secondary site node fails, the primary site is not impacted and can continue to service user sessions.

If the primary site node fails, then the entire cluster halts until either the primary site is restored OR the secondary site is manually promoted as the new primary site.

This is actually more resilient than a two node primary site, which will halt if either node fails and will require manual intervention to recover.

The primary disadvantage of splitting the nodes is that the built in load balancer will no longer work.  Users will need to manually select which node to log into, or an external load balancer would need to be deployed.

Keep in mind that the optimal solution would be to simply have three nodes in the primary site.  Additional nodes are relatively inexpensive, and a three node primary site is resilient against any single node failing.