Today, one of the node in our cluster started behaving wierd. SSG log had the following entries
2021-09-10T09:31:36.934-0500 SEVERE 57 com.hazelcast.cluster.impl.operations.JoinCheckOperation: [10.202.78.163]:8777 [gateway] [3.6.1] Cannot send response: null to Address[10.202.78.252]:8777. Op: com.hazelcast.cluster.impl.operations.JoinCheckOperation{identityHash=1179682587, serviceName='hz:core:clusterService', partitionId=-1, replicaIndex=0, callId=2054, invocationTime=1631283349820 (Fri Sep 10 09:15:49 CDT 2021), waitTimeout=-1, callTimeout=10000}
com.hazelcast.spi.exception.ResponseNotSentException: Cannot send response: null to Address[10.202.78.252]:8777. Op: com.hazelcast.cluster.impl.operations.JoinCheckOperation{identityHash=1179682587, serviceName='hz:core:clusterService', partitionId=-1, replicaIndex=0, callId=2054, invocationTime=1631283349820 (Fri Sep 10 09:15:49 CDT 2021), waitTimeout=-1, callTimeout=10000}
at com.hazelcast.spi.impl.operationservice.impl.RemoteInvocationResponseHandler.sendResponse(RemoteInvocationResponseHandler.java:54)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.sendResponse(OperationRunnerImpl.java:278)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.handleResponse(OperationRunnerImpl.java:251)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:173)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:393)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processPacket(OperationThread.java:184)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:137)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:124)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:99)
2021-09-10T09:31:53.849-0500 SEVERE 59 com.hazelcast.cluster.impl.operations.JoinCheckOperation: [10.202.78.163]:8777 [gateway] [3.6.1] Cannot send response: JoinMessage{packetVersion=4, buildNumber=20160228, address=Address[10.202.78.163]:8777, uuid='c275136f-d324-4903-9aa3-5697bc263dba', liteMember=false, memberCount=1, dataMemberCount=1} to Address[10.202.78.252]:8777. Op: com.hazelcast.cluster.impl.operations.JoinCheckOperation{identityHash=551967137, serviceName='hz:core:clusterService', partitionId=-1, replicaIndex=0, callId=2055, invocationTime=1631283366748 (Fri Sep 10 09:16:06 CDT 2021), waitTimeout=-1, callTimeout=10000}
com.hazelcast.spi.exception.ResponseNotSentException: Cannot send response: JoinMessage{packetVersion=4, buildNumber=20160228, address=Address[10.202.78.163]:8777, uuid='c275136f-d324-4903-9aa3-5697bc263dba', liteMember=false, memberCount=1, dataMemberCount=1} to Address[10.202.78.252]:8777. Op: com.hazelcast.cluster.impl.operations.JoinCheckOperation{identityHash=551967137, serviceName='hz:core:clusterService', partitionId=-1, replicaIndex=0, callId=2055, invocationTime=1631283366748 (Fri Sep 10 09:16:06 CDT 2021), waitTimeout=-1, callTimeout=10000}
These entries were happening intermittently causing this gateway node to perform extremely slow when compared to rest of the nodes.
Then started seeing Error allocating socker processor at 13:13:18
2021-09-10T13:13:00.656-0500 SEVERE 2138 com.l7tech.server.SoapMessageProcessingServlet: RaiseErrorAssertion is stopping execution.
2021-09-10T13:13:18.302-0500 SEVERE 151 org.apache.tomcat.util.net.JIoEndpoint: Error allocating socket processor
java.lang.NullPointerException
at org.apache.catalina.core.StandardThreadExecutor.execute(StandardThreadExecutor.java:150)
at org.apache.tomcat.util.net.JIoEndpoint.processSocket(JIoEndpoint.java:806)
at org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:369)
at java.lang.Thread.run(Thread.java:748)
2021-09-10T13:47:44.278-0500 SEVERE 222 com.l7tech.se
Noticed several instances of this error in the ssg log
2021-09-10T13:52:20.053-0500 SEVERE 1643 org.apache.catalina.core.ContainerBase.[ssg].[node7.autozone.com].[/].[SoapMessageProcessingServlet]: Servlet.service() for servlet SoapMessageProcessingServlet threw exception
java.lang.NullPointerException
at com.l7tech.server.audit.AuditContextImpl.a(Unknown Source)
at com.l7tech.server.audit.AuditContextFactory.doWithNewAuditContext(Unknown Source)
at com.l7tech.server.SoapMessageProcessingServlet.service(Unknown Source)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.AppjlicationDispatcher.invoke(ApplicationDispatcher.java:646)
at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:436)
at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:342)
at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:302)
at com.l7tech.server.transport.http.HttpNamespaceFilter.doFilter(Unknown Source)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.l7tech.server.WsdlFilter.doFilter(Unknown Source)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.l7tech.server.transport.http.ConnectionIdFilter.doFilter(Unknown Source)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.l7tech.server.transport.http.InputTimeoutFilter.doFilter(Unknown Source)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.l7tech.server.log.HybridDiagnosticContextServletFilter.doFilter(Unknown Source)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:234)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:181)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at com.l7tech.server.tomcat.ResponseKillerValve.invoke(Unknown Source)
at com.l7tech.server.tomcat.ConnectionIdValve.invoke(Unknown Source)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:295)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:610)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:410)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
followed by this error
2021-09-10T13:52:30.108-0500 SEVERE 1 com.l7tech.server.policy.module.j: Module ApiPortalDataLensAssertion-9.2.00.aar: exception while notifying class com.l7tech.external.assertions.apiportaldatalens.listener.ModuleListener of module unload: java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.l7tech.server.policy.module.j.onModuleUnloaded(Unknown Source)
at com.l7tech.server.policy.module.j.a(Unknown Source)
at com.l7tech.server.policy.module.j.close(Unknown Source)
at com.l7tech.server.policy.module.ModularAssertionModule.close(Unknown Source)
at com.l7tech.server.policy.module.ModularAssertionsScanner.onModuleUnloaded(Unknown Source)
at com.l7tech.server.policy.module.ModularAssertionsScanner.onModuleUnload(Unknown Source)
at com.l7tech.server.policy.module.ModularAssertionsScanner.onModuleUnload(Unknown Source)
at com.l7tech.server.policy.module.ModulesScanner.destroy(Unknown Source)
at com.l7tech.server.policy.module.ScheduledModuleScanner.destroy(Unknown Source)
at com.l7tech.server.policy.module.ModularAssertionsScanner.destroy(Unknown Source)
at com.l7tech.server.policy.ServerAssertionRegistry.destroy(Unknown Source)
at org.springframework.beans.factory.support.DisposableBeanAdapter$1.run(DisposableBeanAdapter.java:178)
at java.security.AccessController.doPrivileged(Native Method)
at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:176)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:487)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:463)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:431)
at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1048)
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1022)
at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:970)
at com.l7tech.server.boot.GatewayBoot.destroy(Unknown Source)
at com.l7tech.server.boot.GatewayBoot.runUntilShutdown(Unknown Source)
at com.l7tech.server.boot.GatewayMain.main(Unknown Source)
Caused by: java.lang.NullPointerException
at java.util.concurrent.ConcurrentSkipListMap.remove(ConcurrentSkipListMap.java:1991)
at java.util.concurrent.ConcurrentSkipListSet.remove(ConcurrentSkipListSet.java:259)
at com.l7tech.server.service.ServiceTemplateManagerImpl.unregister(Unknown Source)
at com.l7tech.external.assertions.apiportaldatalens.listener.ModuleListener.unregisterServiceTemplates(ModuleListener.java:210)
at com.l7tech.external.assertions.apiportaldatalens.listener.ModuleListener.onModuleUnloaded(ModuleListener.java:68)
... 27 more
2021-09-10T14:04:28.562-0500 SEVERE 61 com.hazelcast.cluster.impl.operations.JoinCheckOperation: [10.202.78.163]:8777 [gateway] [3.6.1] Cannot send response: null to Address[10.202.78.252]:8777. Op: com.hazelcast.cluster.impl.operations.JoinCheckOperation{identityHash=1307830410, serviceName='hz:core:clusterService', partitionId=-1, replicaIndex=0, callId=2805, invocationTime=1631299721458 (Fri Sep 10 13:48:41 CDT 2021), waitTimeout=-1, callTimeout=10000}
com.hazelcast.spi.exception.ResponseNotSentException: Cannot send response: null to Address[10.202.78.252]:8777. Op: com.hazelcast.cluster.impl.operations.JoinCheckOperation{identityHash=1307830410, serviceName='hz:core:clusterService', partitionId=-1, replicaIndex=0, callId=2805, invocationTime=1631299721458 (Fri Sep 10 13:48:41 CDT 2021), waitTimeout=-1, callTimeout=10000}
at com.hazelcast.spi.impl.operationservice.impl.RemoteInvocationResponseHandler.sendResponse(RemoteInvocationResponseHandler.java:54)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.sendResponse(OperationRunnerImpl.java:278)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.handleResponse(OperationRunnerImpl.java:251)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:173)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:393)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processPacket(OperationThread.java:184)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:137)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:124)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:99)
2021-09-10T14:14:50.354-0500 SEVERE 161 org.apache.tomcat.util.net.JIoEndpoint: Error allocating socket processor
java.lang.IllegalStateException: StandardThreadPool not started.
at org.apache.catalina.core.StandardThreadExecutor.execute(StandardThreadExecutor.java:155)
at org.apache.tomcat.util.net.JIoEndpoint.processSocket(JIoEndpoint.java:806)
at org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:369)
at java.lang.Thread.run(Thread.java:748)
Tried restarting the gateway using reboot option, then the gw status was showing as Starting for a very long time.
By looking at the process controller log, I found this highlighted line repeating many times
2021-09-10T15:00:20.115-0500 INFO 1 com.l7tech.server.processcontroller.q: default starting
2021-09-10T15:00:25.260-0500 INFO 1 com.l7tech.server.processcontroller.ProcessController: Getting API port from /opt/SecureSpan/Gateway/node/default/var/processControllerPort
then this line also repeated every 5 to 6 seconds many times
2021-09-10T15:05:18.585-0500 INFO 1 com.l7tech.server.processcontroller.ProcessController: /opt/SecureSpan/Gateway/node/default/var/processControllerPort does not exist yet, will try default port
Then finally it started, took over 11 mins and 36 seconds. The longest i have seen is over 20mins on this node after reboot or restart.
2021-09-10T15:11:28.185-0500 INFO 1 com.l7tech.server.processcontroller.ProcessController: /opt/SecureSpan/Gateway/node/default/var/processControllerPort does not exist yet, will try default port
2021-09-10T15:11:33.281-0500 INFO 1 com.l7tech.server.processcontroller.ProcessController: Getting API port from /opt/SecureSpan/Gateway/node/default/var/processControllerPort
2021-09-10T15:11:36.591-0500 INFO 1 com.l7tech.server.processcontroller.q: default started successfully
2021-09-10T15:11:36.595-0500 INFO 1 com.l7tech.server.processcontroller.ProcessController: default started
Question: Why I am noticing the above errors all of sudden today on one of our node? Any tips/lead on how to resolve this issue permanently would be greatly appreciated.