I'm going to ask the question posed in RequestTimeout in a different way...
I'm working in a large environment that is periodically experiencing heavy loads. The referenced post suggests dodges the value of RequestTimeout and suggests adding more policy servers. Realizing it's not possible to recommend an optimal value of RequestTimeout to suit all situations, I would appreciate some feedback on the following:
Hi Richard - To confirm your questions/statemens
Statement - Increase RequestTimeout a significant amount from the default of 60 to, say 120. Observe results at 120 and reduce RequestTimeout 10 seconds at a time until the "optimal" value is determined.
Answer - Yes this is reasonable
Question- Is there an upper limit to RequestTimeout that you would consider unreasonable?
Answer -Throwing this question around my fellow global senior team
Statement - Increase RequestTimeout in 10-second increments and stop when connections are no longer timing out under heavy load.
Answer - Yes, this is a good plan
Question- Do you have any anecdotes to share regarding the circumstances and results when you tuned RequestTimeout in an enterprise environment?
Answer - Can't remember if there is any documentation on this , but we can ask around .
RequestTimeout appears in two places.
The testing approach laid out is good for testing RequestTimeout.
I personally feel increasing RequestTimeout is not a solution. From a User experience perspective, one would not accept a time lag of 60 seconds for a response. Hence increasing RequestTimeout further slows down the entire process of failover / resending / initiating a subsequent request once the first request fails. Hence in all my years of experience I never increased the value of RequestTimeout. I'd prefer to fail fast and get to the next active Policy Server, than to staying stuck to a slow responding Policy Server for a longer time. That being said, I'd always check is the slow response across all the Policy Servers OR just one Policy Server.
Under heavy load, have we checked on the Policy Server side what is the health of Policy Server "threads", no of connections from Web Agent's and connections to backend server. Generally the bottleneck often will be between Policy Server and backend; causing the Policy Server to respond slowly. In the newer versions of CA SSO Policy Server on smps.log a message is spewed out if the execution time is greater than 5 seconds. Do we see any of those messages under heavy load conditions. Do we see a spike in backend execution times in smtracedefault.log. Hence really the point I'm getting to is, have we checked the health of Policy Server and ascertained why is it slow ?
All good points HubertDennis, and I can't argue with your assessment that increasing RequestTimeout is not a solution. I think it's being perceived as a temporary fix to alleviate application hangs while the root cause is under investigation. My customer has actually suggested decreasing RequestTimeout as you alluded when you said you would "prefer to fail fast and get to the next active Policy Server". Have you ever reduced RequestTimeout? If so, how did you approach it?
Regarding bottlenecks: Yes, execution times >5 seconds have been observed, but the customer has been unwilling to have a full suite of tracing enabled so the exact nature of the bottleneck may be characterized. The problem now is that so much capacity has been added -- along with some other tuning changes such as SM_ENABLE_TCP_KEEPALIVE and some per-policy-server tuning of LDAP user directory configuration -- that the max use of policy server threads is now about half what it used to be. We don't have enough direct evidence to know if the root cause is SiteMinder capacity, slow response from LDAP, network congestion, or ??? So, we wanted to investigate whether RequestTimeout could offer some temporary relief.
The gains of playing with RequestTimeout is not strategic is my experience. May be it may give some short term respite depending on what we would like to showcase (as you rightly stated on your temporary goal). I very rarely play with RequestTimeout, because if "all Policy Server in the cluster are experiencing / reporting slowness"; then no matter how soon we failover to next Policy Server OR stay stuck to a Policy Server - end result is slow.
I'd focus my attention on seeing what is taking time on the Policy Server and try to fix that. The mileage in spending time in fixing that is way more. Than spending time tweaking RequestTimeout to find some bearable optimal value for a shorter time and then spending more extra time figuring out the issue on Policy Server.
Have we checked does the login process / page traversals take more than 60 seconds ? I guess not. But it is mentioned we are seeing Timeout errors, where and within how many seconds ? Is a firewall in play here (between Web Agent and Policy Server) which probably closes connections after 20 or 30 seconds. May be that is an issue and we are not really hitting the 60 second RequestTimeout. Specifically on Web Agent, if we are hitting RequestTimeout then in logs we should be able to see this if Web Agent is marking a Policy Server Active OR inactive.
Add the following components to the WebAgentTrace.conf :AgentFunc, Agent_Con_Manager
Also execute this before starting the apache:export SM_TLI_LOG_FILE="__webagent_home__/logs/wa_tli_log.log
Error Logs and Trace Logs - CA Single Sign-On - 12.52 SP1 - CA Technologies Documentation