Skip to content

Fail loudly with corrupted Parquet statistics#12036

Merged
nezihyigitbasi merged 1 commit intoprestodb:masterfrom
zhenxiao:parquet-fail
Dec 19, 2018
Merged

Fail loudly with corrupted Parquet statistics#12036
nezihyigitbasi merged 1 commit intoprestodb:masterfrom
zhenxiao:parquet-fail

Conversation

@zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Dec 8, 2018

Copy link
Contributor

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Something to consider, separately, is adding a data source ID like we have in OrcCorruptionException. When you have corruption, it's really useful to know exactly which file is bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the assertEquals() since it should never be called, as we expect the test to throw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but you can make testing for exceptions nicer using AssertJ: http://joel-costigliola.github.io/assertj/assertj-core-features-highlight.html#exception-assertion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch block seems to be in the wrong place, since this is catching an exception from the close() call. I think this needs to be below as a (e instanceof ParquetCorruptionException) check, similar to the check for PrestoException.

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Dec 9, 2018

thank you, @electrum
I get comments addressed. will do following task to add dataSourceID

@nezihyigitbasi
Copy link
Contributor

In addition to knowing which file is bad, it would be easier to debug if we know which column's stats is corrupted. By looking at the stack trace of the exception we can see for which type we have corrupted stats, but if we have many columns of that type then that's not useful. In short, an exception message like: Corrupted statistics for column %s in Parquet file %s would be pretty useful.

@nezihyigitbasi
Copy link
Contributor

There is also a risk of breaking user queries that scan Parquet files with corrupted stats. Do we want to have a flag to revert to the current behavior just in case (which we can eventually remove and make this new behavior the default behavior)?

@findepi
Copy link
Contributor

findepi commented Dec 11, 2018

Do we want to have a flag to revert to the current behavior just in case (which we can eventually remove and make this new behavior the default behavior)?

@nezihyigitbasi IMO we should fail on corrupted stats by default -- unless some widely used Parquet writing program is notorious for writing corrupted stats.
And yes, we should have safety flag to disable the check, at least for some transition period.

@zhenxiao
Copy link
Collaborator Author

@nezihyigitbasi @findepi @electrum thanks ur comments
get comments addressed, added dataSourceId for Parquet, added configuration to control whether fail query when scan Parquet file with corrupted statistics. During transition period, disable the check, we could enable it, and probably remove the config, to fail whenever hitting corrupted statistics

@nezihyigitbasi
Copy link
Contributor

IMO we should fail on corrupted stats by default -- unless some widely used Parquet writing program is notorious for writing corrupted stats.

@findepi My comment was confusing probably, sorry. I totally agree with you. We should fail on corrupt stats by default. @zhenxiao looks like this PR sets the flag to false by default btw.

@zhenxiao
Copy link
Collaborator Author

oh, OK. Get it updated. by default, will fail on corrupted statistics.

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhenxiao. I made a quick pass and left some comments.

  • Please squash the commits (I don't think we need to have a separate commit for the config flag).
  • I see tests for failing loudly with corrupted stats. But, I didn't see tests for the case where we want to ignore corrupted stats, please add some tests for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail when scanning Parquet files with corrupted statistics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Fail when scanning Parquet files with corrupted statistics"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the toString call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when scanning a Parquet file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create a method like this one (feel free to find a better name) to simplify & get rid of repetition:

private static void failWithCorruptionException(boolean failOnCorruptedParquetStatistics, String column, ParquetDataSourceId id, Object min, Object max)
{
    if (failOnCorruptedParquetStatistics) {
        throw new ParquetCorruptionException(format("Corrupted statistics for column %s in Parquet file %s: min %s, max %s", column, id, min, max));
    }
}

Then calls will look like:

failWithCorruptionException(failOnCorruptedParquetStatistics, column, id, intStatistics.genericGetMin(), intStatistics.genericGetMax());

@zhenxiao
Copy link
Collaborator Author

thank you @nezihyigitbasi
I get comments addressed, squash into one commit, added testcases for ignore and fail on corrupted statistics

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao thanks for addressing the comments. I think in general this looks good. The only important issue to address is the unit tests. We only have unit tests for the float and date types for testing this new fail/ignore logic. I think we should have a unit test that tests the fail/ignore logic for all types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep this as IOException.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will keep it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope all children of Statistics type have toString implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change we can make to the error message is to add quotes/brackets to set the boundaries of the column/file name and stats information:

Corrupted statistics for column "column_name" in Parquet file "file_name": [stats_string]

Currently it looks like the following, which is a little bit hard to follow:

Corrupted statistics for column DateColumn in Parquet file testFile: min: 200, max: 100, num_nulls: 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Statistics has a toString() abstract method
sure, will add quotes and brackets

Copy link
Collaborator Author

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, @nezihyigitbasi

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Statistics has a toString() abstract method
sure, will add quotes and brackets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will keep it

@zhenxiao
Copy link
Collaborator Author

thank you, @nezihyigitbasi
comments addressed, added testcases for all types

@zhenxiao zhenxiao force-pushed the parquet-fail branch 2 times, most recently from 27e61cd to 2022f2c Compare December 14, 2018 06:33
Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a minor comment.

@dain @electrum @findepi please let @zhenxiao know if you have any comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on the error message, can't we just do:

if (e instanceof ParquetCorruptionException)
    throw new PrestoException(HIVE_BAD_DATA, e);
}

@zhenxiao
Copy link
Collaborator Author

thank you, @nezihyigitbasi
get comments addressed
@dain @electrum @findepi your comments are welcome

@nezihyigitbasi nezihyigitbasi merged commit 0bfe80b into prestodb:master Dec 19, 2018
@nezihyigitbasi
Copy link
Contributor

Thanks @zhenxiao, merged.

@nezihyigitbasi nezihyigitbasi mentioned this pull request Dec 19, 2018
14 tasks
@zhenxiao zhenxiao deleted the parquet-fail branch March 3, 2020 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants