VMware Tanzu Greenplum

 View Only

 Spark Connector - JDBC Connection Pooling Timeout - How low is too low when closing

Dallas Hall's profile image
Dallas Hall posted Jun 26, 2019 02:15 AM

# Details

 

We are using the Spark Connector to send data from a custom version of Spark to Greenplum, version details below. During this process sometimes I experience the too many connections issued described at https://greenplum-spark.docs.pivotal.io/1-6/troubleshooting.html#tsh_conn_errors

 

To try and stop this, I have been reducing the time the connector is waiting before killing idle connections. Changed it from the default 10 seconds to 1 second. I noticed in https://greenplum-spark.docs.pivotal.io/1-6/using_the_connector.html#jdbcconnpool that it says "If you decrease the idle timeout for connections in the pool, the Connector closes idle connections sooner. Very short timeout values may defeat the purpose of connection pooling."

 

When killing <IDLE> connections every second, the active connection count between Spark and Greenplum ranges between 50-150. Sometimes at the end of the data transfer this spikes to around 240, which around 100 still having the status of <IDLE> and they appear to not be actively killed every second.

 

# Questions

 

So this is where my question comes from. Is 1 second too low a time to kill idle connections when using JDBC connection pooling? Is it better to increase max_connections rather than kill of idle connections quicker?

 

# Settings

 

GP Version - PostgreSQL 8.2.15 (Greenplum Database 4.3.7.2 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 17 2016 12:49:03

 

Spark Version - https://github.com/palantir/spark/releases/tag/2.4.0-palantir.36 (forked from Spark 2.4)

 

Spark Connector Version - 1.6.1

 

max_connections: default of 250

 

Scala connection code :

 

val greenplumConnectionSettings = Map[String,String] ( "url" -> greenplumUrl, "connector.port" -> ports, "user" -> username, "dbschema" -> dbschema, "password" -> password, "dbtable" -> dbtable, "iteratorOptimization" -> "false", "truncate" -> "false" ) finalDf.write .format(driverToUse) .options(connectionSettings) //.mode(SaveMode.Append) // Add rows to table .mode(SaveMode.Overwrite) // Controlled by the "truncate" flag in the connection options above /* * JDBC connection pooling - https://docs.oracle.com/cd/E13222_01/wls/docs81/ConsoleHelp/jdbc_connection_pools.html & https://www.baeldung.com/java-connection-pooling * * https://greenplum-spark.docs.pivotal.io/1-6/using_the_connector.html#jdbcconnpool * https://greenplum-spark.docs.pivotal.io/1-6/options.html#lg2s_conpoolopts * https://greenplum-spark.docs.pivotal.io/1-6/troubleshooting.html#tsh_conn_errors * * We must reduce pool.timeoutMs to 1000 so we kill idle connections quickly, otherwise we will exceed max_connections of 250 and the job will crash. */ .option("pool.maxSize", "64") // default is 64 .option("pool.minIdle", "0") // default is 0 .option("pool.timeoutMs", "1000") //default is 10,000 .save() )

 

 

Dallas Hall's profile image
Dallas Hall

Still running into this problem when adjusting the .option("pool.maxSize", "64") down to 50. Trying 32 now. The last crash was due to FATAL: (53300) connection limit exceeded for non-superusers when connection count due to the Spark Connector spiked to around 220 active connections, half where INSERT INTO and the other have were SELECT EXISTS.

 

If the maximum JDBC pool size is 50, how can 200+ active connections be established? Or are these connections the http connections between the Spark workers and Greenplum segments? I haven't found of any way to control the amount of http connections in the Spark Connector.

Shawn Yan's profile image
Broadcom Employee Shawn Yan

Hello Hall,

we will work on this through support ticket 210662.

Dallas Hall's profile image
Dallas Hall

This has been resolved. The issue was a misunderstanding of how it all works together. From what I have learnt, it works like this:

 

  • When submitting a Spark job, you get allocated some number of worker nodes which each have some number of executors within them.
  • The important things to know about Spark are:
    • How many Spark executors you get.
    • How many CPU cores are allocated to your Spark Executors.
    • Without knowing the above you can never know how many connection attempts will be made. It will just be luck of the draw of how busy your Spark cluster is.
  • Greenplum by default has a of 250 maximum connections.
  • Greenplum by default reserves 3 connections for super users. Thus you have 247 connections to work with.
  • The Spark Connector creates a JDBC pool for each Spark executor.
  • The total number of CPU cores for all Spark executors is how many connections may be created between Greenplum and Spark.
  • Thus to calculate how many executors and cores to allocate, you need to
    • Work out how many connection slots to Greenplum you are willing to sacrifice. For us 160 was fine.
    • Based on the number of connection slots you are willing to use, create a maximum number of Spark executors and CPU cores allocated.
    • TotalConnections = NumberOfSparkExecutors * CpuCoresPerExecutor.
      • We are using either 80 executors with 2 CPU cores or 40 executors with 4 CPU cores. Both seem to work just as well as each other.
      • You can control this with --num-executors and --total-executor-cores
        • https://spark.apache.org/docs/latest/submitting-applications.html
    • JdbcConnectionPool = CpuCoresPerExecutor
      • The JDBC connection must be at least the amount of CPU cores per executor, so each CPU core can actually create a connection. You can increase this but you may create too many connections.
      • The default is 64 and seems way too much.
    • JdbcIdleTime = 10000
      • The default value of 10 seconds seems to work fine.