We started received intermittent 500 error for our website which is using Siteminder, after restarting webserver instances it got resolved. While analyzing the root cause we found following error in SM logs
[8395/703276800][Tue Mar 26 2019 04:57:05][CSmLowLevelAgent.cpp:546][ERROR][sm-AgentFramework-00520] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_IsProtectedEx' returned '-1'.
[8395/703276800][Tue Mar 26 2019 04:57:05][CSmProtectionManager.cpp:192][ERROR][sm-AgentFramework-00420] HLA: Component reported fatal error: 'Low Level Agent'.
[8395/703276800][Tue Mar 26 2019 04:57:05][CSmHighLevelAgent.cpp:423][ERROR][sm-AgentFramework-00420] HLA: Component reported fatal error: 'Protection Manager'.
It looks like problem due to policyserver and webagent. We also see following message in server during that time
SiteMinder agent is enabled.
SiteMinder agent is running.
It looks like SM agent got rebooted due to policy server error but some webinstances become not responsive.
The question we like to understand is, will SM agent auto reboot if it fails due to policy server error? or is it possible due to this failure some webagent still not recovered after policy server failure solved and kept throwing 500 until webinstance reboot?
Please help to understand.
There are a couple of reasons why an agent might not issue the "'Sm_AgentApi_IsProtectedEx' returned '-1'." message.
First review the following KB for further information regarding the error message.
As the article discusses, It could be related to network issues and normally is.
However, it could also be related to hitting the 'maxconnections" setting on the policy server.
Connection Options Group Box
This group box allows you to specify the maximum number of Policy Server threads, and the idle timeout for a connection to the Policy Server.
Try this..Enable the stats command on the policy servers if not already and monitor to confirm that the policy servers are not hitting the configured "max connections" value.
Also enable the connections trace on the web agents in question. Collect Detailed Agent Connection Data with an Agent Connection Manager Trace Log
Together with this information, you might be able to determine if this issue is related to network issues or policy server configuration of the max connections.
One last bit of information, if your policy servers are on linux, then you might want to check entropy as well.
Agent connection issues is one of the symptoms of low entropy in a Linux environment.
If you need further assistance in reviewing this type of an issue, please open a new case with Broadcom SSO Support.
James AtchleyPrincipal Support Engineer - SSO CA a Broadcom Company
Thank you for the detailed response, Policy server in our case is maintained by different team, we are trying to work with them to try the recommendations are already in place.
There are two questions which we try to get answer.
1. We noticed from SM log that the agent restarted, is that normal behavior?
2. We have multiple instances, looks like one among them could not re-establish connection, once restart webserver instances it is all worked fine. Is there a chance some instance may not able to automatically reconnect after policy server issue resolved?
These are the questions our technical team try to understand more about the SM agent behavior. If you can help us to understand that will be great help.
I"ll try to answer the remaining question bellow.
1. We noticed from SM log that the agent restarted, is that normal behavior?I would want to validate that in the logs. Did the agent crash? or stop?
If unable to connect to a policy server and all requests have timed out, would fail to start or come to stop gracefully.
However, if it restarting, a crash condition might have occurred?
What's the release of the agent?
Again, this might be related to configuration of the policy server connections or it might be related to the agent configuration.
For instance, you need to have a unique value for "ServerPath" in the webagent.conf file.
Also, adding 'agentwaittime' = N (where N is a value of 30x number of Policy Server in the boot strap. + 10) to the webagent.conf file might assist in the agents timing out during boot strap under network latency.
If you have 6 servers defined in the SMHOST.conf file, then (6*30) +10 = 190. agentwaittime=190
Also, if you have the Policy Server listed as FQDN consider testing with IP addresses instead.
In summary, I would .. .1. check the release notes for code issues within your release of the agent.
2. remove DNS resolution from the equation by using IPs for the policy server connections by the agent in the SMHOST.conf files.
3. Confirm that the value for "ServerPath" is unique for each instance.
note: Serverpath does NOT need to be "Actual directory Path", you can use any unique value. For instance "ServerPath=MyAppName
4: Add "agentwaittime" to the webagent.conf.
Does this help?
Here is an extract from one of our agent log
We are running version 12.52
[21327/4160678352][Tue Mar 26 2019 03:57:10][CSmHighLevelAgent.cpp:206][INFO][sm-AgentFramework-00390] HLA: Stopping.[21327/4160678352][Tue Mar 26 2019 03:57:11][SmPlugin.cpp:103][INFO][sm-AgentFramework-00180] Agent Framework plug-in 'SM_WAF_HTTP_PLUGIN' shutdown.[21327/4160678352][Tue Mar 26 2019 03:57:11][SmAgentAPI.cpp:1703][INFO][sm-AgentFunc-00040] Agent API has been released.[31698/4160670528][Tue Mar 26 2019 04:47:42][LLAWorkerProcess.cpp:1893][WARNING][sm-AgentFramework-00700] LLAWP: DoManagement lost connection to Policy Server.[13956/62794608][Tue Mar 26 2019 10:20:05][CSmLowLevelAgent.cpp:5207][INFO][sm-AgentFramework-00510] LLA: Logging initialized.
It looks like agent stopped, I am not sure it is graceful shutdown or crash. and in few mins it started getting requests.
With the help of our middleware team i am trying to get the configurations verified.