Hello, dear community members,
I am kinda lost with the issue we are facing for over 2 months. Let me briefly describe our environment first. We do have two separate instances of Automic, running on Windows Server 2016 R2 with the following config:
Automic v12.3
MS SQL 2014 std edition
We are currently facing a very weird scenario (in both environments). Every Monday morning, at (approximately) the same time, all the connected agents disconnect from the environment and after a certain period (usually a couple of minutes) connect back. In the agent log files, we observe the following messages:
Win agents:
20200907/163926.447 - U02000042 Connection aborted. Error code '10053', error description: 'An established connection was aborted by the software in your host machine.'.
20200907/163926.447 - U02000010 Connection to Server 'ILMARINEN_1' terminated.
20200907/163926.447 - U02000072 Connection to system 'UC4PE' initiated.
20200907/163926.447 - U02000011 Connection to Server '172.23.248.35:2217' initiated.
20200907/163926.447 - U02000011 Connection to Server '172.23.248.36:2218' initiated.
20200907/163926.463 - U02000011 Connection to Server 'C105S273VM014:2219' initiated.
20200907/163931.088 - U02001040 Error in function 'Connect', error code '10022', error description: 'An invalid argument was supplied.'.
20200907/163931.088 - U02000012 Connection to Server 'C105S273VM014:2219' denied.
20200907/163931.088 - U02000011 Connection to Server 'C105S273VM014:2220' initiated.
20200907/163935.666 - U02001040 Error in function 'Connect', error code '10022', error description: 'An invalid argument was supplied.'.
20200907/163935.666 - U02000012 Connection to Server 'C105S273VM014:2220' denied.
UX agents:
20200907/071115.727 - U02003044 Invalid 'send' call, socket '0'. Error code: ('88' - 'Socket operation on non-socket')
or
20200907/065125.119 - U02003044 Invalid 'read' call, socket 'UC4TE#CP005'. Error code: ('104' - 'Connection reset by peer')
20200907/065125.119 - U02000010 Connection to Server 'C105S1449VM011:2219' terminated.
20200907/065133.173 - U02003044 Invalid 'read' call, socket 'UC4TE#CP006'. Error code: ('104' - 'Connection reset by peer')
20200907/065133.173 - U02000010 Connection to Server 'C105S1449VM011:2220' terminated.
as/400:
20200907/065125.119 - U02003044 Invalid 'read' call, socket 'UC4TE#CP005'. Error code: ('104' - 'Connection reset by peer')
20200907/065125.119 - U02000010 Connection to Server 'C105S1449VM011:2219' terminated.
20200907/065133.173 - U02003044 Invalid 'read' call, socket 'UC4TE#CP006'. Error code: ('104' - 'Connection reset by peer')
20200907/065133.173 - U02000010 Connection to Server 'C105S1449VM011:2220' terminated.
...and quite similar messages on mainframe agents.
We haven't found a root cause for this yet. Most of the agents are on v12 version. What we tried so far was, to send a periodical telnet attempts from the agents towards the application server - no gaps. Disabled Mcaffee software - no help. Allowed all the ports on the windows firewall - no help. Investigated from DB perspective - no help. Check for the event log entries - no help. The only agents that remain connected are the agents on the localhost or agents on the same network as the application server. It is very weird, that it's happening every Monday at a similar time. Would you please suggest, what should we try to identify the problem? Is there anyone facing the same issues?
I also tried to change values in UC_HOSTCHAR_DEFAULT to these:
KEEP_ALIVE 60
RECONNECT_TIME 180
No help at all.
I am getting very frustrated with this issue and would appreciate any kind of help. Thanks for your suggestions!
PS: Broadcom support closed the case that it's not caused by automic (which seems to be obvious).
From my perspective, it looks like there is ran some routine outside of Automic (backup, monitor, vulnerability scan, ...) that is causing that. I would love to know a way how to identify what is causing that.
Thanks again for any input!