Ran into this issue with a Customer and saw a couple of blogs with no solution and why's defined clearly. Thought once the issue is resolved, blog this out.
The policy server tried to use a connection that from LDAP (UserStore) standpoint, that had not been used for over an hour. Since there was no activity on the connection, somethinng along the network (for e.g. a firewall) drops the connection (e.g. due to a firewall idle timeout). The connection now no longer exists, however policy server send the next SEARCH (during IsAuth call) request over this connection and times out after 30 odd seconds. There is no request being logged on LDAP Side during this time, because the connection was closed by Firewall. The policy server then marks this connection as dirty and sets it to CLOSE PENDING. Then retries a new connection to the same LDAP Server and succeeds. Policy Server then resends the SEARCH (during IsAuth call) request over this connection and returns a quick response within a sub second. This request is logged on LDAP Side. This behavior introduces severe latency issues across a cluster of policy servers after idling (beyond the firewall timeouts).
[SmDsLdapProvider.cpp:1783][CSmDsLdapProvider::SearchImpl][search filter is : (cn=AAAAAA)]
[SmDsLdapConnMgr.cpp:1190][LogMessage:ERROR:[sm-Ldap-02230] Error# '85' during search: 'error: Timed out' Search Query = '(cn=AAAAAA)']
[SmDsLdapConnMgr.cpp:1201][CSmDsLdapConn::SearchExts][LDAP search of (cn=AAAAAA) took 30 seconds and 31460 microseconds]
[SmDsLdapFunctionImpl.cpp:3155][CSmDsLdapProvider::SearchExts][Ldap Search failed, ErrorMsg is Timed out]
[SmDsLdapConnMgr.cpp:501][CSmDsLdapConnMgr::AddDeadHandleList][Marked dir connection (seq: 3) ldapserver.ca.com:1111 as Close Pending]
[SmDsLdapConnMgr.cpp:501][CSmDsLdapConnMgr::AddDeadHandleList][Marked dir connection (seq: 1) ldapserver.ca.com:1111 as Close Pending]
[SmDsLdapConnMgr.cpp:501][CSmDsLdapConnMgr::AddDeadHandleList][Marked user connection (seq: 2) ldapserver.ca.com:1111 as Close Pending]
[SmDsLdapConnMgr.cpp:895][IsAvailable][Successful V3 Bind server][ldapserver.ca.com]
[SmDsLdapConnMgr.cpp:628][PingServer][LDAP Server Ping Successful][ldapserver.ca.com]
[SmDsLdapFunctionImpl.cpp:2110][CSmDsLdapProvider::RebindServer][Reconnect to server 'ldapserver.ca.com:1111' as it's previous connections are closed and it is available for connecting now]
[SmDsLdapFunctionImpl.cpp:2203][CSmDsLdapProvider::RebindServer][Rebind attempt on 'dir' connection to best LDAP server 'ldapserver.ca.com:1111']
[SmDsLdapConnMgr.cpp:1201][CSmDsLdapConn::SearchExts][LDAP search of (cn=AAAAAA) took 0 seconds and 2190 microseconds]
[SmDsLdapProvider.cpp:2311][CSmDsLdapProvider::Search][Ldap Search callout succeeds.][(Search) Base: 'dc=ca,dc=com', Filter: '(cn=AAAAAA)'. Status: 1 entries]
Nevertheless both these blog summarize the exact behavior.
Enhancement Request : https://communities.ca.com/ideas/235729474
Discussion : https://communities.ca.com/thread/117408416.
By Design :
Policy server will only close a connections to a user store on a network error or search timeout. Therefore at the moment, there is nothing within the product that tells the policy server to close a connection after a stipulated time. This effectively means the connection that the policy server opens to user store / backend would be open indefinitely.
If a firewall is going to shut the connection (which is the scenario in our case), firewall should send a RST. If a RST is sent, when policy server attempts to use it will know right away the connections has been broken and will rebind instead of waiting the timeout value. Therefore if we were to go down this path, then we need to involve the firewall / network teams and make sure that the firewall would send a RST if it is closing a connection.
Set an IdleTimeout value on the LDAP UserStore. Do note, that this IdleTimeout needs to be shorter than the Timeout amongst all the firewalls on the network. Example : If there are three firewalls between Policy Server and LDAP. Firewall-1 and Firewall-2 have 3 hours IdleTimeout. Firewall-3 has 1 hour IdleTimeout. Then the IdleTimeout on LDAP needs to be 50mins or 55 mins. Such that LDAP sends a RST to Policy Server before the firewalls drop the connection. If a RST is sent, when policy server attempts to use it will know right away the connections has been broken and will rebind instead of waiting the timeout value. Therefore if we were to go down this path, then we need to involve the LDAP teams and make sure that the LDAP would send a RST if it is closing a connection.
Max time is our search timeout used by our USR and Dir connection – default of 30 is high. We can adjust it lower, maybe review the LDAP access log to see the average search times then add a second or 2. Too low may cause multiple failover events.
We spoke about collocating Policy Server and LDAPs. We did have a concern of failover usecase. There is a Solution to make Policy Server’s within a Datacentre to talk to a LDAP locally first before failing over to LDAP across the DataCentre. We can use a masking solution e.g. UD1.ca.com in DC1 has host entry IP address of UStore in DC1. Likewise UD1.ca.com in DC2 has host entry IP Address of UStore in DC2. This would mean when request fail from DC1 to DC2; DC2 Policy Servers would refer to UD1.ca.com, which infact in the host file is the IP Address of the Local LDAP in DC2. Similarly for DC1.
ADDITIONAL NOTES :
A. Every Customer Environment is Unique. Therefore these options need to be evaluated / applied / tested in order of ease of implementation feasibility and long term sustainability.
B. Other are most welcome to add on any other solutions suggestions, that have been implemented.
Connections: Snippet from [C] Additional NOTES. Very good info.
Siteminder holds 3 connections open to each LDAP user directory. They are as follows:
- DIR: This connection is held open by the user who is configured in the “Credentials and Connections” tab of the User Directory. The initial search for users in authentication is done over this connection, and any WRITE operations (due to Password Services) are also done over this connection. There are a lot of questions about what permissions the Admin user needs to have, and it’s simple… if you will not be using Password Services, then the user just needs to have READ permissions to the section of the user directory where your users are (whatever you put in the Root DN of the directory). If you will be using Password Services, then that user needs to at least have the same READ permissions, and also have WRITE permissions to the attributes in the User Attributes tab of the user directory.
- USR: This connection is used by authentication to try the BINDs. No other data is sent across this connection. A BIND is attempted and if it works, then the user is authenticated. The connection is left in that state, owned by the user who just bound. When the next authentication attempt is made, the handling depends on the type of directory. If the directory supports REBIND, then that is what is done, meaning the connection is never broken down, it is just overtaken by the new BIND. If the directory does not support REBIND (like AD), then the connection is Unbound and then a BIND is performed with the new user.
- PING: This connection is used to monitor the health of the directory. It sends a very basic search to the directory. If it gets a response, then the directory is considered healthy, so the Policy Server will continue to send requests there.