VMware Tanzu Greenplum

 Recurring corruption on >50GB tables

adugatkin adugatkin's profile image
adugatkin adugatkin posted Sep 01, 2018 02:09 AM

Hi,

 

This is kind of a hard issue to pin down and explain.. We are experiencing recurring corruption on our Greenplum environment (on the Google cloud). This is typically occurring on larger (50GB or more) aggregated tables that we trunc and regenerate at least once a week (or sometimes just append to), and this occurs in one place or another approximately once a month on average. It only occurs on one table at a time, and the segment/segment host varies.

 

This always shows up as nonesense rows in the table, inflating the row count and varchar fields can come back with values that are completely random in just a few of the rows, while the numeric\time values in those rows causes errors if you attempt to view them. However, the exact error when trying to view this data can vary, we have seen the following at different times:

 

'invalid page header in block xxxx of relation base/yyyy/zzzzzzz (seg14 slice1 gp02:40006 pid=5444)’

'Error: Could not temporarily connect to one or more segments'

'Connection was unexpectedly closed'

 

While records of other people experiencing this type of corruption can be found if I google the errors up, they universally seem to be rare/disaster cases and not once a month regular occurrence.

 

Google cloud support can not find any logs of corruption occurring on their end.

 

Has anyone experienced this before or have advice on how to move forward?

 

Kushal Choubay's profile image
Kushal Choubay

Hi ,

 

 

It can happen when a bad block gets read from disc and committed back. What is the greenplum version you are using ? Also is this happening on both primary/mirror instance on a specific host ?

 

Below article can help to check if problem exists on both primary/mirror :

 

https://community.pivotal.io/s/article/Queries-fail-with-AO-checksum-errors-on-Pivotal-Greenplum-4-x-and-5-x

adugatkin adugatkin's profile image
adugatkin adugatkin

Hi Kushal,

 

We are using version 4.3.16. Unfortunately, this appears to be a version that does not include checksums. Also, our tables are not append-optimized, but it does appear that newer vesions have checksums for heap tables as well, correct?

 

Would you have any recommendations on how to prevent these errors while staying on the current version?