We have tried a shutdown of the UC4 server processes to fix the issue, and when doing so, found that some of the CP processes will not terminate gracefully and must be terminated abnormally to get them to stop. It has been different CPs with this issue, but CP1 appears to consistently have the problem. I have sent agent, WP and CP logs to tech support but so far no smoking gun has been identified.Recently we've discovered that when our operations crew first notices sluggish AWI performance and a small number of agents have begun to disconnect, we can come in and quickly terminate CP1 and restart it, and performance will then improve and the disconnected agents will reconnect within 10 minutes.Has anyone else experienced this issue or anything similar?Here is a summary of our landscape:1 UC4 Server (16 CPUs, 50GB memory) on a RHEL 7 VM1 Database backend (Oracle 12c) on AIXApproximately 360 agents of multiple types, running on different platforms. Mostly 12.3.2 agents, a few still 11.2 (our last version) and in the process of getting upgraded. (The outage issue affects agents whether they are upgraded or not.)
We have continued to see this problem on version 12.3.3 HF4 and were able to upload the trace files while the issue was occurring to our ticket. Like others we are able to resolve by restarting CP01. I am disappointed with the response from the Vendor as they have now stated that the root cause is bug in an agent we are not currently running in our environment. They recommend upgrading and updating the EH_KICK_INTERVAL to 60 to resolve this issue. We are still running version 11.2 agents and have never seen this issue until we upgraded the AE.https://knowledge.broadcom.com/external/article?articleId=202385 - the Automation Engine becomes slow and unresponsive if an unix agent has no free space left to write its agent logs.How does restarting CP01 correct an agent full problem?