Hello
Since V12 we have the problem that agents do not disconnect themselves in the event of network problems or remain online in the AE for a very long time.
Jobs start running on the agent (because it's officially online) but then get stuck in the 'Start initiated' status.
Through tests in advance (network of only one agent switched off), we have adjusted 4 config files, which are now (according to support) involved in agent handling.
In my opinion, the decisive factor was the value in tcp_retries2, here we could actually influence disconnect times.
The value 4 caused a disconnect after about 60 seconds.
/proc/sys/net/ipv4/tcp_keepalive_time = 1200
/proc/sys/net/ipv4/tcp_keepalive_intvl = 30
/proc/sys/net/ipv4/tcp_keepalive_probes = 3
/proc/sys/net/ipv4/tcp_retries2 = 4
On the last weekend we had a failsave test again (loc.1 is isolated, everything runs on loc.2).
1700 agents are connected, actually about 800 agents should now be offline.
After 2 hours only about 100 were gone, the rest was shown as active in the AE, but the servers were not available (as expected).
The entries in the txp config files showed no effect.
One of our systems:
1700 agents, mixed versions 12.2.1, 12.2.4, 12.3.0, 12.3.1 ...
AE: 12.3.1+build.157046751939
Does anyone know the problem? and may have already solved it ?
Thanks in advance
Carsten
------------------------------
Application Analyst Senior
Postbank Systems
------------------------------