VMware Tanzu Greenplum

 interconnect encountered a network error, please check your network

J Z's profile image
J Z posted Apr 29, 2023 10:45 PM

we have installed Greenplum Database server version 6.24.2. 

we have configured one stand alone Master node and 18 segment node.

 

Sometimes when executing SQL statements, the following error will be reported, but when the same statement is executed again, it is possible to succeed.

 

The error is:

interconnect encountered a network error, please check your network (seg8 slice11 192.168.1.210:40008 pid=1627342). Failed to send packet(seq 1) to 192.168.1.211:25366(pid 1630839 cid 12) after 303 retries in 300 seconds.

 

We have already ajusted OS Parameters Settings, such as ipfrag_high_thresh、ipfrag_low_thresh、ipfrag_time.

 

We use iperf3 to test the UDP transmission status: iperf3 -c xx.xx.xx.xx -u -b 5000M -f M -t 50, and the loss rate is about 0.5%. 

Kevin Huang's profile image
Kevin Huang

This error is a pretty generic message that suggests some issue with the network. However, there are times when specific query can intermittently fail as in your case due to an extra need to do Broadcast or redistribute motion.

 

1.) Is 192.168.1.211:25366 another segment host? If so then this suggests the motion (where data is being sent from one segment to another segment) is failing.

2.) Is your setup on bare metal or is it on a cloud provider like AWS or Azure?

3.) You can check if packets are being dropped using netstat -i on 192.168.1.210 and 192.168.1.211. Another question is does it always fail on those two segment hosts?

4.) Are there a lot of queries running when you get these error messages?

 

Hope that helps. If none of those really narrow it down, I would suggest opening a support case so they can dig further.

J Z's profile image
J Z

Thank you for your answer.

 

1.)192.168.1.211 is another segment host.

2.)It's on bare metal.

3.)Failures also occur on other segment hosts.And, as I mentioned, failure doesn't have to happen, and sometimes it doesn't happen.Netstat -s displays some 'packet reassembles failed' errors

4.)Not so much queries running.

 

What puzzles me is that although the UDP protocol has packet loss, there can be no such thing as '303 retries in 300 seconds', right?

Kevin Huang's profile image
Kevin Huang

I would check those values against these two GUCs:

Gp_interconnect_transmit_timeout and Gp_interconnect_min_retries_before_timeout

 

This GitHub issue may explain the retries value and how the value increments for a similar issue. Just want to make it clear, it may not necessarily be the same issue you are facing: https://github.com/greenplum-db/gpdb/issues/12961