In order to figure out some random FAIL blips we are seeing in our net_connect.log files, we have been picking net_connect apport and trying to decipher how it actually work. And in the process, we have discovered something deserving of a major facepalm.
It seems the threading net_connect is split up into something like this:
1x MAIN SEND (ALARM)
1x MAIN RECV (ALARM)
N PING QOS
1 x QOS DISPATCHER
1 x RESOLVER
N x ConnectToServiceSrv (that connects to tcp services)
And we notices that the both the MAIN SEND and assorted PING QOS threads are doing pings to our servers:
Jul 30 08:30:46:217 [47955888007488] nc: send_ping - 31: SEND 80.76.x.x, ident=19481, seq=1
Jul 30 08:30:46:217 [47955888007488] nc: send_ping - 31: SEND 80.76.x.x, ident=19481, seq=2
Jul 30 08:30:46:217 [47955888007488] nc: send_ping - 31: SEND 80.76.x.x, ident=19481, seq=3
Jul 30 08:30:48:656 [47955888007488] nc: send_ping - 9: SEND 80.76.x.x, ident=20029, seq=1
Jul 30 08:30:48:656 [47955888007488] nc: send_ping - 9: SEND 80.76.x.x, ident=20029, seq=2
Jul 30 08:30:48:656 [47955888007488] nc: send_ping - 9: SEND 80.76.x.x, ident=20029, seq=3
Jul 30 08:31:20:289 [47954818042176] nc: send_ping - 7: SEND 80.76.x.x, ident=20270, seq=1
Jul 30 08:31:20:289 [47954818042176] nc: send_ping - 7: SEND 80.76.x.x, ident=20270, seq=2
Jul 30 08:31:20:289 [47954818042176] nc: send_ping - 7: SEND 80.76.x.x, ident=20270, seq=3
In this case, 47955888007488 is the MAIN SEND thread, and 47954818042176 is one of a handful of "PING QOS" threads.
In my config, we have interval set to 2min, and burst set to 3 packages.
So the first 3 entries at 08:30:46:217 is the alarm check running at interval, and it gets a random FAIL due to what appears to be a different problem in the net_connect probe (but we'll get to that later). The next 3 at 08:30:48:656 is the retry 2sec later (I have a timeout of 2min). And then the 3rd part 08:31:20:289 is something else which actually appears to be another ping used for qos also collecting every 2 min:
Jul 30 08:31:20:289 [47954818042176] nc: ptNetPingWait: host = 80.76.x.x socket, 7 is aquired
Jul 30 08:31:20:289 [47954818042176] nc: send_ping - 7: SEND 80.76.x.x, ident=20270, seq=1
Jul 30 08:31:20:289 [47954818042176] nc: send_ping - 7: SEND 80.76.x.x, ident=20270, seq=2
Jul 30 08:31:20:289 [47954818042176] nc: send_ping - 7: SEND 80.76.x.x, ident=20270, seq=3
Jul 30 08:31:20:307 [47954818042176] nc: (recv_ping) 7: RECV 80.76.x.x, ident=20270[20270], seq=1
Jul 30 08:31:20:307 [47954818042176] nc: ConnectForPingQos - 80.76.x.x [18ms], status=18)
And we find all 9 packets with tcpdump as well.
So, it REALLY does seem like net_connect does DOUBLE pings against each device. One used for alarms and one used for qos, which basically doubles the work net_connect has to do.
Good thing they resolved that in any one of these:
v2.90: "Revamped probe for scalability
v2.03: "Performance improvements done to handle 5000+ profiles."
Probably gonna put up a new thread "Dissecting net_connect" with all the findings I end up with today, but the executive summary is mainly that net_connect is horribly broken.