For nearly a month, we have been experiencing sporadic outages with our UC4 server. The outage first manifests as a slowing and unresponsive AWI, and then eventually a large number (but not all) of the UC4 agents disconnecting all at once. The agents remain up and running the whole time, but they begin showing up as inactive in the UI and all their jobs begin to fail/fail to start. If this is not responded to quickly, more agents disconnect, and eventually the AWI becomes inaccessible with an error message stating that the CPs are unavailable:
We have tried a shutdown of the UC4 server processes to fix the issue, and when doing so, found that some of the CP processes will not terminate gracefully and must be terminated abnormally to get them to stop. It has been different CPs with this issue, but CP1 appears to consistently have the problem. I have sent agent, WP and CP logs to tech support but so far no smoking gun has been identified.
Recently we've discovered that when our operations crew first notices sluggish AWI performance and a small number of agents have begun to disconnect, we can come in and quickly terminate CP1 and restart it, and performance will then improve and the disconnected agents will reconnect within 10 minutes.
Has anyone else experienced this issue or anything similar?
Here is a summary of our landscape:
1 UC4 Server (16 CPUs, 50GB memory) on a RHEL 7 VM
1 Database backend (Oracle 12c) on AIX
Approximately 360 agents of multiple types, running on different platforms. Mostly 12.3.2 agents, a few still 11.2 (our last version) and in the process of getting upgraded. (The outage issue affects agents whether they are upgraded or not.)