Recently we had an issue in our prod environment where due to a network outage a subset of active directories could not connect to the policy server ,which resulted in policy server queues going up and it reached a point where policy server stopped processing requests all together.
Although the network outage lasted for 90 seconds , policy server did not connect back to the AD's after network recovery and kept on timing out while trying to establish connections to the AD's.
We had to restart the policy server which then restored the connectivity.
With respect the socket connections on policy server we generally hover around 2500 but during that outage it spiked up to 8000.
Need expert guidance on why would policy server not restore connectivity back to the user directories even after network connectivity had restored itself. Is it a capacity issue , although our CPU usage for policy server stays at around 15-20% and our general processing is around
Average Throughput : 95 (request/sec)
Average Transaction Time: 11.555901ms
Policy Server version L R12sp3cr07
Policy server queue goes up when there is network outage to backend store is expected as policy server can't process the requests fast enough.
By design, Policy server should be able to restore the connection when the backend store back online. If it didn't reconnect by itself, I suspect the policy server could be in hang state and that's explain why the restart helps to restore the connection.If the policy server is on unix system, pstack capture at the time will help to understand what policy server was doing. If this is on windows, process monitor might gives us some clue.
It's hard to determine whether this is due to capacity issue or not. The policy server trace log will provide additional information if this is related to the capacity issue.
I agree without logs it is hard to predict if its a capacity problem but overtime i have captured the response times and avg transactions using smtrace tool analyzer and i see 100/sec with 10-11 ms response time from policy server with 18-20% CPU usage on the policy server process which looks good to me.
Its a UNIX box and with the criticality to bring it back up i had to restart it , i will take care next time to take a thread dump.
What i feel looking at the stats is that because of the high queues and everything policy server ping thread didn't get a response on time and timed out but this is only theoretical explanation , i will be generating traces next time i see.
Vivek, due to changes over time, namely the addition of the lines with the queue size, minimal tracing to know the queue size and monitor by that over stats is something i would recommend.
Thanks Josh, We do monitor the queues , i have never seen our policy server queue up except for the time when these time outs happen, as suggested i will get some traces going to see what does policy server report during those times.
In addition to the policy server trace log and the pstack (I suggest capture 3 pstack with 1 minute interval each), please run a cron job to run smpolicysrv -stats every 5 minutes. This will print the policy server statistic to smps log and we can get better information on the policy server status at that time.
We do run the policy server stats every 10 minutes and i see the policy server queue depth as 0 for most of the times except during the timeouts when the queue's start to grow up , if you could please help me understand how the failover works in case an IP is marked as bad, that would be of great help.
In general, there are two main fail-over scenarios.
scenario 1 is when a request fails with a network error. In this case, the connection is re-initialized. The current and all the subsequent requests will be sent to the new server.
scenario 2 is when prior to receiving a request the ping thread detects that the server is not available. The connection to this server is then marked as bad, the request thread creates a new connection and all the subsequent requests will be sent over the new connection.
is the stats just for queue size?
If so, and the version is 6sp6, 12.0 sp3 or any 12.5, wouldn't is be more efficient to use the trace log given the count is displayed there?
just thinking that if he's in prod then less tax on the system is desirable.
Not only the queue size, but also want to see the Connection. The stats provide more straight forward information and glad that Vivek has that run in the system every 10 minutes.
have you tried exploring thread pool
You can tweak the High Priority Thread pool; that should improve the connections management
if it is one time event not sure if you can reCreate it to test. ( considering you cannot bring down production and you can not bring traffic in DEV or QA )
if this is repeat event you might want to use Wily if available
You can also request CA to assist / provide the scripts collecting the dumps if you do not have them already
the dump analysis can help identify information
Thanks for the reply santosh.
Currently we dont have wily and since it was a one time outage only so its hard to reproduce.Moreover there are no queues w.r.t either High Priority on Normal priority.
Regarding High Priority Thread pool, can you please elaborate on it , are you talking about increasing the number of threads ?
Quick answer : yes I was talking about increasing the number of thread
Long answer: there is so many things to consider before you tweak your config
you certainly want the script in place to capture the policy server dump ; to give you data if any dump generated in future
High Priority Thread job is to ensure Policy server is able to manage connections where as Normal priority thread ensure authentications
if you foresee more connection issue in future you may want to review OS limitation, sockets and high priority thread count
a) you have multiple policy server and one of them crashed
b) certain set of agents and policy server stop working as expected
c) other servers worked as expected
d) the server support different geographical location or group of apps
is that correct ?
Well nothing crashed but what happened is that we have some active directories located at a different GEO location than our policy servers , Now due to a network outage PS couldn't connect to the the AD's in question which is fine, but problem happened when network had recovered , even then SM was timing out to the active directories and i had to restart the PS to bring it back to a functional state.
And yes i totally agree tweaking the configuration is a complex activity with many variables.