Fail loudly with corrupted Parquet statistics#12036
Fail loudly with corrupted Parquet statistics#12036nezihyigitbasi merged 1 commit intoprestodb:masterfrom
Conversation
electrum
left a comment
There was a problem hiding this comment.
Overall this looks good to me. Something to consider, separately, is adding a data source ID like we have in OrcCorruptionException. When you have corruption, it's really useful to know exactly which file is bad.
There was a problem hiding this comment.
Remove the assertEquals() since it should never be called, as we expect the test to throw.
There was a problem hiding this comment.
This is fine, but you can make testing for exceptions nicer using AssertJ: http://joel-costigliola.github.io/assertj/assertj-core-features-highlight.html#exception-assertion
There was a problem hiding this comment.
This catch block seems to be in the wrong place, since this is catching an exception from the close() call. I think this needs to be below as a (e instanceof ParquetCorruptionException) check, similar to the check for PrestoException.
80673bb to
853176a
Compare
|
thank you, @electrum |
|
In addition to knowing which file is bad, it would be easier to debug if we know which column's stats is corrupted. By looking at the stack trace of the exception we can see for which type we have corrupted stats, but if we have many columns of that type then that's not useful. In short, an exception message like: |
|
There is also a risk of breaking user queries that scan Parquet files with corrupted stats. Do we want to have a flag to revert to the current behavior just in case (which we can eventually remove and make this new behavior the default behavior)? |
@nezihyigitbasi IMO we should fail on corrupted stats by default -- unless some widely used Parquet writing program is notorious for writing corrupted stats. |
853176a to
6ecdc56
Compare
|
@nezihyigitbasi @findepi @electrum thanks ur comments |
@findepi My comment was confusing probably, sorry. I totally agree with you. We should fail on corrupt stats by default. @zhenxiao looks like this PR sets the flag to false by default btw. |
21c1908 to
2b69102
Compare
|
oh, OK. Get it updated. by default, will fail on corrupted statistics. |
There was a problem hiding this comment.
Thanks @zhenxiao. I made a quick pass and left some comments.
- Please squash the commits (I don't think we need to have a separate commit for the config flag).
- I see tests for failing loudly with corrupted stats. But, I didn't see tests for the case where we want to ignore corrupted stats, please add some tests for that.
There was a problem hiding this comment.
Fail when scanning Parquet files with corrupted statistics
There was a problem hiding this comment.
"Fail when scanning Parquet files with corrupted statistics"
There was a problem hiding this comment.
Don't need the toString call.
There was a problem hiding this comment.
when scanning a Parquet file
There was a problem hiding this comment.
You can create a method like this one (feel free to find a better name) to simplify & get rid of repetition:
private static void failWithCorruptionException(boolean failOnCorruptedParquetStatistics, String column, ParquetDataSourceId id, Object min, Object max)
{
if (failOnCorruptedParquetStatistics) {
throw new ParquetCorruptionException(format("Corrupted statistics for column %s in Parquet file %s: min %s, max %s", column, id, min, max));
}
}
Then calls will look like:
failWithCorruptionException(failOnCorruptedParquetStatistics, column, id, intStatistics.genericGetMin(), intStatistics.genericGetMax());
2b69102 to
2a8326a
Compare
|
thank you @nezihyigitbasi |
nezihyigitbasi
left a comment
There was a problem hiding this comment.
@zhenxiao thanks for addressing the comments. I think in general this looks good. The only important issue to address is the unit tests. We only have unit tests for the float and date types for testing this new fail/ignore logic. I think we should have a unit test that tests the fail/ignore logic for all types.
There was a problem hiding this comment.
I think it makes sense to keep this as IOException.
There was a problem hiding this comment.
I hope all children of Statistics type have toString implemented.
There was a problem hiding this comment.
One minor change we can make to the error message is to add quotes/brackets to set the boundaries of the column/file name and stats information:
Corrupted statistics for column "column_name" in Parquet file "file_name": [stats_string]
Currently it looks like the following, which is a little bit hard to follow:
Corrupted statistics for column DateColumn in Parquet file testFile: min: 200, max: 100, num_nulls: 0
There was a problem hiding this comment.
yes, Statistics has a toString() abstract method
sure, will add quotes and brackets
zhenxiao
left a comment
There was a problem hiding this comment.
thank you, @nezihyigitbasi
There was a problem hiding this comment.
yes, Statistics has a toString() abstract method
sure, will add quotes and brackets
2a8326a to
5215fb1
Compare
|
thank you, @nezihyigitbasi |
27e61cd to
2022f2c
Compare
There was a problem hiding this comment.
Instead of relying on the error message, can't we just do:
if (e instanceof ParquetCorruptionException)
throw new PrestoException(HIVE_BAD_DATA, e);
}
2022f2c to
8ad30cd
Compare
|
thank you, @nezihyigitbasi |
|
Thanks @zhenxiao, merged. |
@nezihyigitbasi @dain @findepi @qqibrow