Symantec Access Management

 View Only
Expand all | Collapse all

Policy Server High Queue after network outage and AD timeout

  • 1.  Policy Server High Queue after network outage and AD timeout

    Posted Mar 18, 2015 09:53 AM

    Recently we had an issue in our prod environment where due to a network outage a subset of active directories could not connect to the policy server ,which resulted in policy server queues going up and it reached a point where policy server stopped processing requests all together.

     

    Although the network outage lasted for 90 seconds , policy server did not connect back to the AD's after network recovery and kept on timing out while trying to establish connections to the AD's.

    We had to restart the policy server which then restored the connectivity.

     

    With respect the socket connections on policy server we generally hover around 2500 but during that outage it spiked up to 8000.

     

    Need expert guidance on why would policy server not restore connectivity back to the user directories even after network connectivity had restored itself. Is it a capacity issue , although our CPU usage for policy server stays at around 15-20% and our general processing is around

    Average Throughput : 95 (request/sec)

    Average Transaction Time: 11.555901ms

    Policy Server version L R12sp3cr07



  • 2.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 19, 2015 01:53 AM

    Hi,

    Policy server queue goes up when there is network outage to backend store is expected as policy server can't process the requests fast enough.

    By design, Policy server should be able to restore the connection when the backend store back online. If it didn't reconnect by itself, I suspect the policy server could be in hang state and that's explain why the restart helps to restore the connection.
    If the policy server is on unix system, pstack capture at the time will help to understand what policy server was doing. If this is on windows, process monitor might gives us some clue.

    It's hard to determine whether this is due to capacity issue or not. The policy server trace log will provide additional information if this is related to the capacity issue.

     

    Regards,

    Kar Meng



  • 3.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 19, 2015 08:40 AM

    Thanks Karmeng,

     

    I agree without logs it is hard to predict if its a capacity problem  but overtime i have captured the response times and avg transactions using smtrace tool analyzer and i see 100/sec with 10-11 ms response time from policy server with 18-20% CPU usage on the policy server process which looks good to me.

     

    Its a UNIX box and with the criticality to bring it back up i had to restart it , i will take care next time to take a thread dump.

     

    What i feel looking at the stats is that because of the high queues and everything policy server ping thread didn't get a response on time and timed out but this is only theoretical explanation , i will be generating traces next time i see.



  • 4.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 19, 2015 09:10 AM

    Vivek, due to changes over time, namely the addition of the lines with the queue size, minimal tracing to know the queue size and monitor by that over stats is something i would recommend.



  • 5.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 19, 2015 09:16 AM

    Thanks Josh, We do monitor the queues , i have never seen our policy server queue up except for the time when these time outs happen, as suggested i will get some traces going to see what does policy server report during those times.



  • 6.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 22, 2015 09:24 PM

    Hi,

    In addition to the policy server trace log and the pstack (I suggest capture 3 pstack with 1 minute interval each), please run a cron job to run smpolicysrv -stats every 5 minutes. This will print the policy server statistic to smps log and we can get better information on the policy server status at that time.

    Thanks.

     

    Regards,

    Kar Meng

     




  • 7.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 23, 2015 08:32 AM

    HI Karmeng,

     

    We do run the policy server stats every 10 minutes and i see the policy server queue depth as 0 for most of the times except during the timeouts when the queue's start to grow up , if you could please help me understand how the failover works in case an IP is marked as bad, that would be of great help.

     

    Thanks



  • 8.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 23, 2015 07:25 PM

    Hi Vivek,

    In general, there are two main fail-over scenarios.

    scenario 1 is when a request fails with a network error. In this case, the connection is re-initialized. The current and all the subsequent requests will be sent to the new server.

    scenario 2 is when prior to receiving a request the ping thread detects that the server is not available. The connection to this server is then marked as bad, the request thread creates a new connection and all the subsequent requests will be sent over the new connection.

     

    Regards,

    Kar Meng



  • 9.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 23, 2015 08:46 AM

    hey Kar,

     

    is the stats just for queue size?

    If so, and the version is 6sp6, 12.0 sp3 or any 12.5, wouldn't is be more efficient to use the trace log given the count is displayed there?

     

    just thinking that if he's in prod then less tax on the system is desirable.



  • 10.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 23, 2015 07:28 PM

    Hi Josh,

     

    Not only the queue size, but also want to see the Connection. The stats provide more straight forward information and glad that Vivek has that run in the system every 10 minutes.



  • 11.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 27, 2015 01:41 PM

    have you tried exploring thread pool

     

    You can tweak the High Priority Thread pool; that should improve the connections management



  • 12.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 27, 2015 01:51 PM

    if it is one time event not sure if you can reCreate it to test. ( considering you cannot bring down production  and you can not bring traffic in DEV or QA )

     

    if this is repeat event you might want to use Wily if available

    You can also request CA to assist / provide the scripts collecting the dumps if you do not have them already

    the dump analysis can help identify information



  • 13.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 27, 2015 01:57 PM

    Thanks for the reply santosh.

     

    Currently we dont have wily and since it was a one time outage only so its hard to reproduce.Moreover there are no queues w.r.t either High Priority on Normal priority.

     

    Regarding High Priority Thread pool, can you please elaborate on it , are you talking about increasing the number of threads ?

     

    Thanks



  • 14.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 27, 2015 02:45 PM

    Vivek

     

    Quick answer : yes I was talking about increasing the number of thread

     

    Long answer: there is so many things to consider before you tweak your config

     

    you certainly want the script in place to  capture the policy server dump ; to give you data if any dump generated in future

     

    High Priority Thread job is to ensure Policy server is able to manage connections where as Normal priority thread ensure authentications

     

    if you foresee more connection issue in future you may want to review OS limitation, sockets and high priority thread count



  • 15.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 27, 2015 02:57 PM

    I assume

     

    a) you have multiple policy server and one of them crashed

    b) certain set of agents and policy server stop working as expected

    c) other servers worked as expected

    d) the server support different geographical location or group of apps

     

    is that correct ?



  • 16.  Re: Policy Server High Queue after network outage and AD timeout

    Posted Mar 27, 2015 03:33 PM

    Well nothing crashed but what happened is that we have some active directories located at a different GEO location than our policy servers , Now due to a network outage PS couldn't connect to the the AD's in question which is fine, but problem happened when network had recovered , even then SM was timing out to the active directories and i had to restart the PS to bring it back to a functional state.

     

    And yes i totally agree tweaking the configuration is a complex activity with many variables.