VMware Tanzu Greenplum

 I am using a greenplum-db-6.8.1 with 3 segment servers running on EC2 instance (r6g.8xlarge). We are frequently performing reads and writes . We are seeing all the segments going down almost every day . Kindly suggest on resolving this segment failures

swamy reddy's profile image
swamy reddy posted Jun 30, 2023 03:35 PM

Hi Team,

 

I am using a greenplum-db-6.8.1 iwith 3 segment servers running on EC2 instance (r6g.8xlarge type). We are frequently performing reads and writes on the database as we use this db for our reports functionality so reads and writes happen very frequently.

The issue is we are seeing the segments gets down everytime and have to go for full recovery.

 

We need help on this . How to avoid these frequent segment failures. Do we need to make changes to default config values like gp_fts_probe_interval, gp_fts_probe_timeout, gp_fts_probe_retries as we are using the default values???

 

https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/admin_guide-highavail-topics-g-detecting-a-failed-segment.html

 

Below is the configuration of our greenplum cluster:

 

1 Master Server -- 32cpus and 256Gb Memorty

3 Segment servers -- 32cpus and 256Gb memory

 

Below is the current gpstate

20230630:19:26:59:003334 gpstate:-----:gpadmin-[INFO]:-Gathering data from segments...

.

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-Greenplum instance status summary

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-----------------------------------------------------

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Master instance                      = Active

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Master standby                      = No master standby configured

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total segment instance count from metadata        = 48

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-----------------------------------------------------

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Primary Segment Status

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-----------------------------------------------------

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total primary segments                  = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total primary segment valid (at master)          = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[WARNING]:-Total primary segment failures (at master)        = 24

       <<<<<<<<

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid files missing       = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid files found        = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid PIDs missing        = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid PIDs found         = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of /tmp lock files missing          = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of /tmp lock files found           = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number postmaster processes missing         = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number postmaster processes found          = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-----------------------------------------------------

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Mirror Segment Status

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-----------------------------------------------------

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total mirror segments                   = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total mirror segment valid (at master)          = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total mirror segment failures (at master)         = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid files missing       = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid files found        = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid PIDs missing        = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of postmaster.pid PIDs found         = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of /tmp lock files missing          = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number of /tmp lock files found           = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number postmaster processes missing         = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number postmaster processes found          = 24

20230630:19:27:00:003334 gpstate:-----:gpadmin-[WARNING]:-Total number mirror segments acting as primary segments  = 24

       <<<<<<<<

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-  Total number mirror segments acting as mirror segments  = 0

20230630:19:27:00:003334 gpstate:-----:gpadmin-[INFO]:-----------------------------------------------------

[gpadmin@----- local]$

Kevin Huang's profile image
Kevin Huang

gp_fts_probe_interval, gp_fts_probe_timeout, gp_fts_probe_retries would give a greater grace period for when the segments decide to go down. Increasing these will help prevent segment failures but will not address the underlying problem if the issue is caused by too much I/O. You can increase these values and see if it helps.

 

You can also check if the following OS parameters relating to memory are set properly on all segments:

vm.swappiness = 10

vm.zone_reclaim_mode = 0

vm.dirty_expire_centisecs = 500

vm.dirty_writeback_centisecs = 100

vm.dirty_background_ratio = 0 # See System Memory

vm.dirty_ratio = 0

vm.dirty_background_bytes = 1610612736

vm.dirty_bytes = 4294967296

 

Another parameter to look for is vm.min_free_kbytes needs to be set to 3% of physical memory otherwise it can cause segment failures if this value is set at default value which is usually too low.

https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/install_guide-prep_os.html