We have an issue with policy server. Sometimes it goes down. The policy server is set up with 2 servers in clustered mode. We have 1 million users in the directory. What could be the reason for the policy server goes down?. Is it because of the overload on Policy server?. If yes what is the solution for this. Due to security reasons, we cannot pull out the logs. I just want to know your thoughts .
You NEED to look at the logs. The logs will tell you what is happening.
You have 1 million users, but how many SSO transactions (AuthAccept, AuthReject, AzAccept, ValidateAccept, etc.) are being processed per minute?
I need to look into the configuration to confirm this. I am new in this project. I just want to ask you if there is any possibility that overloaded requests make the policy server down?
this question is loaded.
what do you mean by "going down"
if 12.5 and later, or a later version of 12.0 (sp3, potential CR dependant)
you can use the "count is #.]"
where # is a number to find the issue.
I mean the policy server is shutting down sometimes. What I want to know is, if the number of users accessing the application exceeds a certain limit, will it affect the policy server?. I didnt get any access to the servers. I will look into the logs once I get access to the servers,
Number of users should not cause the policy server to stop and actually shut off.
until you see the logs there is no way to even speculate.
Ok. I will look into the logs and get back to you. Thank you for the help Josh.
you may not be able to share logs with the community, but that DOES mean we have no way of making a valid assumption.
try the logging post for some pointers on how to get logging right: CA SiteMinder Logging
that could help.
Thank You Josh.
If the policy server is shutting down by itself, it sounds like crash happen. I'm not sure the operating system used in the environment. If it is Windows, you can try monitor the policy server process via debug diagnostic tool (Download Debug Diagnostic Tool v2.0 from Official Microsoft Download Center). If it is on Unix, try to check if any core dump generated.
If the policy server goes down was due to crash, core dump file will help us to troubleshoot further. I will suggest engage CA Support to analyze the dump if that's the case.
Beside, policy server logs is best way to determine if the server is overloaded or not.
Hope this helps.
great idea. and starting with 2008 (maybe earlier) windows has a core dump equivalent called a user mode dump: Collecting User-Mode Dumps (Windows)
For over a year now our Policy Servers (on Windows Server) have been crashing and restarting themselves 5+ times per day in PROD. CA has yet to really fix that issue (big surprise heh).
In this case it doesn't fully 'stop' (only a couple time it did that), but instead it just crashes, core dumps, and starts back up. Does your do this or completely stop and you have to manually go start it? Just curious if your problem could be related to any of the crashing ours have been experiencing.
As a note: We've been using the Microsoft method that Josh had linked to and it works well. Just be sure to do a full dump - you can set it to only save X number so you don't use up too much space or anything. That's what we send to CA -- for what good it's done; the mini-dumps didn't have enough info by default unless you go enable additional flags.
I'm sorry to hear that the policy server in your environment crashed for over a year. It sounds like you have engaged CA Support to help resolve the issue. If that's the case, I believe CA Support is actively working with engineering team to get the issue fix in your environment. I believe this could be a tricky problem that consume time to trace the root cause. If possible, share with me the case number and would like to see the status of the case at this stage.
Same here ca is actively working for over 2 years we got few NIN patch specific for our environment but it didn't fix the issue and CA knows about the many issues and have announced EOL for R 12.5 version which is surprising.
kar , I will provide you our CA case to you and it's always in P1
Ours has been doing the same. PROD Policy Servers (8 X R12.51 CR01) crash up to 5 times a day. Automatically restart.
Case open for over a year. No resolution yet.
The good news for us is that with eight policy servers and clustering, the agents fail over fairly gracefully.
Hi CBertagnolli, vijaythrumaran, JMCColorado
Do you might share the information on the issue number that you open with CA? I believe CA Support and engineering team are actively working on the issues but I just wonder why the issues didn't have fix yet. It might due to some unforseen circumstances. I might not be able to provide much input given that the issues already with CA Support. However, I think of provide some feedback to management team on our customer concern base on the existing issues in place and didn't have solution for extended amount of time.
Policy server goes down, restart, in general suggest policy server crash. If we have corresponding crash files and releavant files in place, we should able to trace the root cause and fix it unless the issue was due to third party product (ie:backend store, backend server etc) and SM become a victim of the issue. That's just my hypothesis and it could due to some other reasons as well.
Current number 21955636 (there's been a few related to it and so that is the latest). We've got a DSE and engineering/dev is 'working' it...unfortunately working it doesn't mean much without providing adequate solution to the problem in a timely fashion .
First it was SP1 was the solution, that didn't help and it kept on crashing. Then it was finally a custom NIN installer + hot fix dll files...that 'fix' introduced another issue which prevented us from pushing it through to PROD. And it still didn't fully resolve the problem in our testing - just seemed to make it better; although only way for us to know for certain is push it to higher environments which is currently held up due to the bug in the 'fix'...in non-prod it doesn't happen nearly as much most likely due to limited load/combination of transactions at any given time.
Thanks for your update. I will follow up with the DSE and team to see what we can do better to resolve the issue soon.
21747801-1: PROD POLICY SERVERS CRASHING
Thanks. I have a quick check on the issue. The issue is a bit complicated as it involved custom authentication scheme and memory corruption. Engineering is investigating the possible root cause and CA Support is following up closely with engineering. I believe we are getting closer to the root cause
from my experience, in common implementations: no custom auth and active responses, the policy server (smpolicysrv.exe) memory usage (on windows server) should be in the area of 300MB.
how your server is behaving in regards to memory consumption? memory usage is around 300MB or above ? what is your memory consumption when the server crashes ?
I got an update from CA saying this is a Bug in 12.5 version. They asked me to edit sm.registry file HKEY_LOCAL_MACHIN\SOFTWARE\Netrerity\SiteMinder\CurrentVersion\DS\LDAPProvider . I have to edit some referal parameters. CA team told that this is the work around for the bug.
I had the similar SM restart issue in 12.51 and below SM registry entry fixed it. You need to add - "EnableADEnhancedReferrals= 0; REG_DWORD"
EnableADEnhancedReferrals= 0; REG_DWORD
Also I discovered that SM restart happens only when SM tries to search users from the directories (AD, CA dir) which are having more than 1000 users(approx). Also SM did not restart, if I point User Directory to a specific OU which is having less no of users, where as SM auto restart happens when SM connects to a OU or root of AD which is having bulk users(more than 1000 users).
Interesting finds. We've got it turned 'on' (set to 1) on ours...guess we need to try setting to to 0 and see if it makes a difference like it did for ya'll (this would be in version 12.52 SP1).
rahulk.s did you actually implement the change and it work or still waiting on testing it out?
thre was a number of actual fixes in r12.52 sp1
my last employer had an issue that CA tossed out as not possible to be them. i later found something talking about a federation thing. had NATHAN WARREN look into those two. there was enough collected it was likely the issue.
i think that they have too many that like to not take that extra step now. that give up too easily.
if it's not logical to be SM, they dont go the extra mile, they dont find the trigger
just my $0.02
I am waiting for the approval to make the changes. I will tell you the results once it is tested.
Thank you Ashok. We are seeing the the back-end application server (Weblogic 10g) shows in hung state when we are getting an error from Policy server.
This is the log WebLogic server log says : SiteMinder has returned the following status Code is 500 . I hope this will also be fixed after making the changes in sm.registry file.
So we tried the referral change just to see....still core dumps. Didn't really expect much else but was worth a shot since it's simple enough update.
Just about 2 years now of crashing in PROD multiple times every day and CA still hasn't provided a solution. When folks have started calling it ShiteMinder, you basically know the product and CA support's reputation is shot in the eyes of our users.
I made below change in the sm.registry. We did a performance testing in test environment and we didnt face any issues. We will do a Load test and let you know the results.
We have done a peak load test and didnt face any issue. Change in the refferal parameter is a solution. Also in the SM documentation they have mentioned it.
Thanks for the update and also for referring to the Documentation.
We are having the same issue. Setting the EnableEnhancedRefferrals= 0 did not help.
Mike if you are using CA Directory server we need to make some changes there also. In the settings.dxc file.
Add below entires in tha CA directory server settings and check if it helps :
set concurrent-bind-user=<dc com><dc ABCD><ou admins><username smadmin>;
We are using Red Hat Directory.
The restarts started when we introduced certificate authentication (X509 Client Cert Template using smgetcred.scc)
Oddly, the restarts do not occur directly at the time of a cert auth. The restarts are somewhat random and occur at times sometime much later after several cert auth's have occurred.
Seems to align with the tidbits of info we got early on in our cases - that it was related to the cert auth stuff. We also use the ACA add-on module.
Since we rolled out with cert auth, could be why we've never had a single stable day.
One case back in the day we found was that you could almost force a crash if you had two user directories on an application and then did a smartcard log in. But it didn't happen for everyone, only some users but it crashed for those users 100% of the time.
I'd just be happy to replace SiteMinder with something stable at this point :/...not a single day since beginning using it that it hasn't crashed.