Automic Workload Automation

 View Only
  • 1.  Recurring problem with server processes (12.3)

    Posted Oct 23, 2020 09:04 AM
    For nearly a month, we have been experiencing sporadic outages with our UC4 server. The outage first manifests as a slowing and unresponsive AWI, and then eventually a large number (but not all) of the UC4 agents disconnecting all at once. The agents remain up and running the whole time, but they begin showing up as inactive in the UI and all their jobs begin to fail/fail to start. If this is not responded to quickly, more agents disconnect, and eventually the AWI becomes inaccessible with an error message stating that the CPs are unavailable:



    We have tried a shutdown of the UC4 server processes to fix the issue, and when doing so, found that some of the CP processes will not terminate gracefully and must be terminated abnormally to get them to stop. It has been different CPs with this issue, but CP1 appears to consistently have the problem. I have sent agent, WP and CP logs to tech support but so far no smoking gun has been identified.

    Recently we've discovered that when our operations crew first notices sluggish AWI performance and a small number of agents have begun to disconnect, we can come in and quickly terminate CP1 and restart it, and performance will then improve and the disconnected agents will reconnect within 10 minutes.

    Has anyone else experienced this issue or anything similar?

    Here is a summary of our landscape:

    1 UC4 Server (16 CPUs, 50GB memory) on a RHEL 7 VM
    1 Database backend (Oracle 12c) on AIX
    Approximately 360 agents of multiple types, running on different platforms. Mostly 12.3.2 agents, a few still 11.2 (our last version) and in the process of getting upgraded. (The outage issue affects agents whether they are upgraded or not.)



  • 2.  RE: Recurring problem with server processes (12.3)

    Posted Oct 23, 2020 12:35 PM
    We had agent disconnects that were caused by a scheduled VM backup of our server.  The backup had to stun the server for a few seconds when it was deleting the snapshot.  We never saw a need to recycle CPs though and everything reconnected automatically a few seconds later.  We are a small shop with very few agents.

    another random thought is that anti virus software can sometimes impact servers in negative ways.

    ------------------------------
    Pete Wirfs
    SAIF Corporation
    Salem Oregon USA
    ------------------------------



  • 3.  RE: Recurring problem with server processes (12.3)

    Posted Nov 11, 2020 08:53 AM
    We are seeing something very familiar in our environment upgraded to 12.3.3 HF4 with some version 11 agents. Luckily we have a monitor that lets us know when users are not able to log into the AWI and have also been restarting CP01 to resolve the problem. There is a fix out for the AE becoming slow and unresponsive when an agent file system fills 12.3.4 HF1 but I do not think that is what we are currently seeing.


  • 4.  RE: Recurring problem with server processes (12.3)

    Posted Nov 12, 2020 01:16 AM
    We recently experienced this too with V12.3.3.HF3 and support have so far not been able to work out root cause. 
    It seems to be Users activity is what makes the CP usage go up so high as I have seen CP with 150 Agents connected and it runs at low usage such as 5 or less all day and night.  However, some CP with less than 20 connections (mainly Users) it goes up to 100 and then eventually affects the entire set of CP/WP's.  

    Last week I did a Coldstart and the situation has improved somewhat.  I suggest trying a Coldstart as it will clear the MQCP* tables.
    I don't feel comfortable with this CP situation - it has us quite concerned. 

    CP User / Agents

    Regards,
    Matt



  • 5.  RE: Recurring problem with server processes (12.3)

    Posted Nov 20, 2020 09:20 AM

    We have continued to see this problem on version 12.3.3 HF4 and were able to upload the trace files while the issue was occurring to our ticket. Like others we are able to resolve by restarting CP01. I am disappointed with the response from the Vendor as they have now stated that the root cause is bug in an agent we are not currently running in our environment. They recommend upgrading and updating the EH_KICK_INTERVAL to 60 to resolve this issue. We are still running version 11.2 agents and have never seen this issue until we upgraded the AE.

    https://knowledge.broadcom.com/external/article?articleId=202385 - the Automation Engine becomes slow and unresponsive if an unix agent has no free space left to write its agent logs.

    How does restarting CP01 correct an agent full problem?




  • 6.  RE: Recurring problem with server processes (12.3)

    Posted Dec 01, 2020 03:56 PM
    Have a small bit of an update, based on our experience.

    My colleagues and I made a concerted effort to update as many of our 11.2 agents as we possibly could, and ever since we began reducing the number of 11.2 agents we were using, our environment has stabilized and (knock on wood) we have had no more outages and have had no more need to restart any CPs. We still have had no definitive cause identified for the issue, but upgrading the agents has been the only meaningful change we've made in our environment since the issue began, and the issue seems to have gone away. So for those seeing this same behavior, I would suggest upgrading your remaining 11.2 agents if possible to see if performance improves.