LDAP failover performance in general
What is reasonable LDAP failover time, and can it achieve 0-second system downtime?
Any high-performance failover setup should expect some time lapse between when the primary server is down and the secondary server is up, and start transitioning traffic over to the new LDAP server.
The exact time lapse can be vastly different from system to system, as it depends on a lot of factors, including vendor, design, network, machine capability, etc.
Siteminder can use CA LDAP as a policy store, user store and session store.
Any failover scenario recoveries under 60 seconds is considered reasonable, considering that the LDAP ping interval by default is at 30 seconds. 60 seconds is the longest round trip between last server heart bit check and the most recent one. On top of this, during failover, PS will need to rebind LDAP which is a very expensive operation.
If the client uses CA Directory as a solution, they should double-check their CA Directory configuration, to ensure its configuration is optimized.
Check the CA Directory knowledge files to add "dsp-idle-time = 30" to each router DSA and "dsp-idle-time = 40" to each data DSA.
If there is no router DSA, then just set "dsp-idle-time = 30" to each data DSA in the knowledge files. Modify the CA Directory limits files-set maximum operation time to 60 seconds, with "set max-op-time = 60".
What this change does is to ensure that LDAP disconnects the idle socket connection, or a dead failover LDAP connection, when the time limit is reached.
By default, these idle connections in CA Directory are set at 600 seconds, which is waiting too long during failover incident.
Policy Server Guides -> Policy Server Configuration Guide -> User Directories -> LDAP Load Balancing and Failover
Thanks for this article.
We are using CA Directory as Authentication Store. I have few questions about connections between CA Policy Server & CA Directory.
1. Is there any option to customize ping interval from Policy server. (e.g. reducing ping interval from 30 to 5 sec)?
2. If the authentication TPS is for e.g. approximately 300, then 60 sec fail over is too long and will cause a lot of impact on in flight transactions during the time when policy server is failing over to other ca directory node. What is your view on this?
3. None of our authentication call exceeds even 1 second. A typical load balancer can have LDAP monitor setup where it can make heartbeat LDAP search query every second and upon 2 or 3 failure, it can mark down the node and failover to other active node. We have been told by CA that load balancer is not recommended between policy server & CA Directory. But not having LB between it cause too many in-flight transactions to fail. Would like to know your views on it.
4. You mentioned that rebind is an expensive operation from Policy server, can you explain bit more on it. I am not able to understand why it will be an expensive operation, it's just a bind call, isn't it.