AppWorx, Dollar Universe and Sysload Community

 View Only
  • 1.  Agents repeately disconnecting every monday from Automic

    Posted Sep 07, 2020 01:05 PM
    Hello, dear community members,

    I am kinda lost with the issue we are facing for over 2 months. Let me briefly describe our environment first. We do have two separate instances of Automic, running on Windows Server 2016 R2 with the following config:

    Automic v12.3
    MS SQL 2014 std edition

    We are currently facing a very weird scenario (in both environments). Every Monday morning, at (approximately) the same time, all the connected agents disconnect from the environment and after a certain period (usually a couple of minutes) connect back. In the agent log files, we observe the following messages:

    Win agents:

    20200907/163926.447 - U02000042 Connection aborted. Error code '10053', error description: 'An established connection was aborted by the software in your host machine.'.
    20200907/163926.447 - U02000010 Connection to Server 'ILMARINEN_1' terminated.
    20200907/163926.447 - U02000072 Connection to system 'UC4PE' initiated.
    20200907/163926.447 - U02000011 Connection to Server '172.23.248.35:2217' initiated.
    20200907/163926.447 - U02000011 Connection to Server '172.23.248.36:2218' initiated.
    20200907/163926.463 - U02000011 Connection to Server 'C105S273VM014:2219' initiated.
    20200907/163931.088 - U02001040 Error in function 'Connect', error code '10022', error description: 'An invalid argument was supplied.'.
    20200907/163931.088 - U02000012 Connection to Server 'C105S273VM014:2219' denied.
    20200907/163931.088 - U02000011 Connection to Server 'C105S273VM014:2220' initiated.
    20200907/163935.666 - U02001040 Error in function 'Connect', error code '10022', error description: 'An invalid argument was supplied.'.
    20200907/163935.666 - U02000012 Connection to Server 'C105S273VM014:2220' denied.

    UX agents:

    20200907/071115.727 - U02003044 Invalid 'send' call, socket '0'. Error code: ('88' - 'Socket operation on non-socket')

    or

    20200907/065125.119 - U02003044 Invalid 'read' call, socket 'UC4TE#CP005'. Error code: ('104' - 'Connection reset by peer')
    20200907/065125.119 - U02000010 Connection to Server 'C105S1449VM011:2219' terminated.
    20200907/065133.173 - U02003044 Invalid 'read' call, socket 'UC4TE#CP006'. Error code: ('104' - 'Connection reset by peer')
    20200907/065133.173 - U02000010 Connection to Server 'C105S1449VM011:2220' terminated.

    as/400:

    20200907/065125.119 - U02003044 Invalid 'read' call, socket 'UC4TE#CP005'. Error code: ('104' - 'Connection reset by peer')
    20200907/065125.119 - U02000010 Connection to Server 'C105S1449VM011:2219' terminated.
    20200907/065133.173 - U02003044 Invalid 'read' call, socket 'UC4TE#CP006'. Error code: ('104' - 'Connection reset by peer')
    20200907/065133.173 - U02000010 Connection to Server 'C105S1449VM011:2220' terminated.

    ...and quite similar messages on mainframe agents.

    We haven't found a root cause for this yet. Most of the agents are on v12 version. What we tried so far was, to send a periodical telnet attempts from the agents towards the application server - no gaps. Disabled Mcaffee software - no help. Allowed all the ports on the windows firewall - no help. Investigated from DB perspective - no help. Check for the event log entries - no help. The only agents that remain connected are the agents on the localhost or agents on the same network as the application server. It is very weird, that it's happening every Monday at a similar time. Would you please suggest, what should we try to identify the problem? Is there anyone facing the same issues?

    I also tried to change values in UC_HOSTCHAR_DEFAULT to these:

    KEEP_ALIVE 60
    RECONNECT_TIME 180

    No help at all.

    I am getting very frustrated with this issue and would appreciate any kind of help. Thanks for your suggestions!

    PS: Broadcom support closed the case that it's not caused by automic (which seems to be obvious).

    From my perspective, it looks like there is ran some routine outside of Automic (backup, monitor, vulnerability scan, ...) that is causing that. I would love to know a way how to identify what is causing that.

    Thanks again for any input!


  • 2.  RE: Agents repeately disconnecting every monday from Automic

    Posted Sep 08, 2020 09:04 AM
    I would check in with your systems admins to find out if there is any scheduled work like you'd mentioned (backups, scans, etc).

    At one point we had a similar issue except it was nightly.  It corresponded with VMWare backups of our database server which would cause momentary connections drops.  In our very specific case, we had to live with it until VMWare could be updated.


  • 3.  RE: Agents repeately disconnecting every monday from Automic

    Posted Sep 08, 2020 11:22 AM
    Same experience here with VMWare backups.  They run a snapshot during the backup process, and when it deletes the snapshot there was a several second stun of the server. (until they fixed it.)

    ------------------------------
    Pete Wirfs
    SAIF Corporation
    Salem Oregon USA
    ------------------------------



  • 4.  RE: Agents repeately disconnecting every monday from Automic
    Best Answer

    Broadcom Employee
    Posted Sep 08, 2020 04:54 PM
    Hi @Vojtech Bures You can also check with your system admin if there's any network port scanner which could disconnect the connection between the AE and the agents.​


  • 5.  RE: Agents repeately disconnecting every monday from Automic

    Posted Nov 03, 2020 02:01 PM
    Did you find what was causing this issue?  Did you find a solution?


  • 6.  RE: Agents repeately disconnecting every monday from Automic

    Posted Nov 04, 2020 01:37 PM
    Hello everyone. The situation is now solved. We found out using a wireshark network tool, that on one particular network device (loadbalancer) that was shared with multiple services, was over 8 mil. connections every Monday (caused by the vulnerability scanner) and the LB's memory was overloaded. That was leading the LB to restart and drop all the connections.

    Thanks to the wireshark - the tool is really awesome.

    Again, thanks for your help!