Fail loudly with corrupted Parquet statistics by zhenxiao · Pull Request #12036 · prestodb/presto

zhenxiao · 2018-12-08T08:17:36Z

electrum

Overall this looks good to me. Something to consider, separately, is adding a data source ID like we have in OrcCorruptionException. When you have corruption, it's really useful to know exactly which file is bad.

electrum · 2018-12-08T08:22:26Z

presto-parquet/src/test/java/com/facebook/presto/parquet/TestTupleDomainParquetPredicate.java

Remove the assertEquals() since it should never be called, as we expect the test to throw.

electrum · 2018-12-08T08:23:06Z

presto-parquet/src/test/java/com/facebook/presto/parquet/TestTupleDomainParquetPredicate.java

This is fine, but you can make testing for exceptions nicer using AssertJ: http://joel-costigliola.github.io/assertj/assertj-core-features-highlight.html#exception-assertion

electrum · 2018-12-08T08:23:14Z

presto-parquet/src/test/java/com/facebook/presto/parquet/TestTupleDomainParquetPredicate.java

Same comments

electrum · 2018-12-08T08:26:28Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

This catch block seems to be in the wrong place, since this is catching an exception from the close() call. I think this needs to be below as a (e instanceof ParquetCorruptionException) check, similar to the check for PrestoException.

zhenxiao · 2018-12-09T03:11:13Z

thank you, @electrum
I get comments addressed. will do following task to add dataSourceID

nezihyigitbasi · 2018-12-11T01:31:59Z

In addition to knowing which file is bad, it would be easier to debug if we know which column's stats is corrupted. By looking at the stack trace of the exception we can see for which type we have corrupted stats, but if we have many columns of that type then that's not useful. In short, an exception message like: Corrupted statistics for column %s in Parquet file %s would be pretty useful.

nezihyigitbasi · 2018-12-11T06:33:13Z

There is also a risk of breaking user queries that scan Parquet files with corrupted stats. Do we want to have a flag to revert to the current behavior just in case (which we can eventually remove and make this new behavior the default behavior)?

findepi · 2018-12-11T08:05:51Z

Do we want to have a flag to revert to the current behavior just in case (which we can eventually remove and make this new behavior the default behavior)?

@nezihyigitbasi IMO we should fail on corrupted stats by default -- unless some widely used Parquet writing program is notorious for writing corrupted stats.
And yes, we should have safety flag to disable the check, at least for some transition period.

zhenxiao · 2018-12-11T09:15:55Z

@nezihyigitbasi @findepi @electrum thanks ur comments
get comments addressed, added dataSourceId for Parquet, added configuration to control whether fail query when scan Parquet file with corrupted statistics. During transition period, disable the check, we could enable it, and probably remove the config, to fail whenever hitting corrupted statistics

nezihyigitbasi · 2018-12-11T21:06:58Z

IMO we should fail on corrupted stats by default -- unless some widely used Parquet writing program is notorious for writing corrupted stats.

@findepi My comment was confusing probably, sorry. I totally agree with you. We should fail on corrupt stats by default. @zhenxiao looks like this PR sets the flag to false by default btw.

zhenxiao · 2018-12-12T03:26:57Z

oh, OK. Get it updated. by default, will fail on corrupted statistics.

nezihyigitbasi

Thanks @zhenxiao. I made a quick pass and left some comments.

Please squash the commits (I don't think we need to have a separate commit for the config flag).
I see tests for failing loudly with corrupted stats. But, I didn't see tests for the case where we want to ignore corrupted stats, please add some tests for that.

nezihyigitbasi · 2018-12-12T23:36:30Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

Fail when scanning Parquet files with corrupted statistics

nezihyigitbasi · 2018-12-12T23:38:11Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java

"Fail when scanning Parquet files with corrupted statistics"

nezihyigitbasi · 2018-12-12T23:39:52Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/HdfsParquetDataSource.java

Don't need the toString call.

nezihyigitbasi · 2018-12-12T23:41:37Z

presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/Predicate.java

when scanning a Parquet file

nezihyigitbasi · 2018-12-12T23:45:21Z

...parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java

You can create a method like this one (feel free to find a better name) to simplify & get rid of repetition:

private static void failWithCorruptionException(boolean failOnCorruptedParquetStatistics, String column, ParquetDataSourceId id, Object min, Object max) { if (failOnCorruptedParquetStatistics) { throw new ParquetCorruptionException(format("Corrupted statistics for column %s in Parquet file %s: min %s, max %s", column, id, min, max)); } }

Then calls will look like:

failWithCorruptionException(failOnCorruptedParquetStatistics, column, id, intStatistics.genericGetMin(), intStatistics.genericGetMax());

zhenxiao · 2018-12-13T03:33:42Z

thank you @nezihyigitbasi
I get comments addressed, squash into one commit, added testcases for ignore and fail on corrupted statistics

nezihyigitbasi

@zhenxiao thanks for addressing the comments. I think in general this looks good. The only important issue to address is the unit tests. We only have unit tests for the float and date types for testing this new fail/ignore logic. I think we should have a unit test that tests the fail/ignore logic for all types.

nezihyigitbasi · 2018-12-13T18:59:36Z

presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetCorruptionException.java

I think it makes sense to keep this as IOException.

OK, will keep it

nezihyigitbasi · 2018-12-13T19:01:45Z

...parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java

I hope all children of Statistics type have toString implemented.

One minor change we can make to the error message is to add quotes/brackets to set the boundaries of the column/file name and stats information:

Corrupted statistics for column "column_name" in Parquet file "file_name": [stats_string]

Currently it looks like the following, which is a little bit hard to follow:

Corrupted statistics for column DateColumn in Parquet file testFile: min: 200, max: 100, num_nulls: 0

yes, Statistics has a toString() abstract method
sure, will add quotes and brackets

zhenxiao

thank you, @nezihyigitbasi

zhenxiao · 2018-12-14T03:38:12Z

...parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java

yes, Statistics has a toString() abstract method
sure, will add quotes and brackets

zhenxiao · 2018-12-14T04:14:14Z

presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetCorruptionException.java

OK, will keep it

zhenxiao · 2018-12-14T04:16:34Z

thank you, @nezihyigitbasi
comments addressed, added testcases for all types

nezihyigitbasi

LGTM % a minor comment.

@dain @electrum @findepi please let @zhenxiao know if you have any comments.

nezihyigitbasi · 2018-12-17T19:31:31Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

Instead of relying on the error message, can't we just do:

if (e instanceof ParquetCorruptionException) throw new PrestoException(HIVE_BAD_DATA, e); }

zhenxiao · 2018-12-18T02:35:41Z

thank you, @nezihyigitbasi
get comments addressed
@dain @electrum @findepi your comments are welcome

nezihyigitbasi · 2018-12-19T21:08:56Z

Thanks @zhenxiao, merged.

facebook-github-bot added the CLA Signed label Dec 8, 2018

zhenxiao mentioned this pull request Dec 8, 2018

Add support for DATE predicate pushdown with Parquet via min/max and … #10181

Closed

electrum reviewed Dec 8, 2018

View reviewed changes

zhenxiao force-pushed the parquet-fail branch from 80673bb to 853176a Compare December 9, 2018 03:09

zhenxiao force-pushed the parquet-fail branch from 853176a to 6ecdc56 Compare December 11, 2018 08:16

zhenxiao force-pushed the parquet-fail branch from 21c1908 to 2b69102 Compare December 12, 2018 03:25

nezihyigitbasi requested changes Dec 12, 2018

View reviewed changes

zhenxiao force-pushed the parquet-fail branch from 2b69102 to 2a8326a Compare December 13, 2018 03:30

nezihyigitbasi requested changes Dec 13, 2018

View reviewed changes

zhenxiao commented Dec 14, 2018

View reviewed changes

zhenxiao force-pushed the parquet-fail branch from 2a8326a to 5215fb1 Compare December 14, 2018 04:15

zhenxiao force-pushed the parquet-fail branch 2 times, most recently from 27e61cd to 2022f2c Compare December 14, 2018 06:33

nezihyigitbasi approved these changes Dec 17, 2018

View reviewed changes

Fail loudly with corrupted Parquet statistics

8ad30cd

zhenxiao force-pushed the parquet-fail branch from 2022f2c to 8ad30cd Compare December 18, 2018 02:34

nezihyigitbasi merged commit 0bfe80b into prestodb:master Dec 19, 2018

nezihyigitbasi mentioned this pull request Dec 19, 2018

Release notes for 0.216 #12081

Closed

14 tasks

zhenxiao deleted the parquet-fail branch March 3, 2020 22:14

Conversation

zhenxiao commented Dec 8, 2018

Uh oh!

electrum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao commented Dec 9, 2018

Uh oh!

nezihyigitbasi commented Dec 11, 2018

Uh oh!

nezihyigitbasi commented Dec 11, 2018

Uh oh!

findepi commented Dec 11, 2018

Uh oh!

zhenxiao commented Dec 11, 2018

Uh oh!

nezihyigitbasi commented Dec 11, 2018

Uh oh!

zhenxiao commented Dec 12, 2018

Uh oh!

nezihyigitbasi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao commented Dec 13, 2018

Uh oh!

nezihyigitbasi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao commented Dec 14, 2018

Uh oh!

nezihyigitbasi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao commented Dec 18, 2018

Uh oh!

nezihyigitbasi commented Dec 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

nezihyigitbasi left a comment •

edited

Loading