-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1217: Incorrect handling of missing values in Statistics #458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
@rdblue, would you also take a look if your time allows? Thanks.
| } | ||
|
|
||
| if (hasNulls(meta)) { | ||
| if (stats.isNumNullsSet() && hasNulls(meta)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional) You might consider adding the "stats.isNumNullsSet()" clause to the body of hasNulls() itself, but I don't really have a strong opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would not work correctly because hasNulls is called negated as well.
| (String createdBy, Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder) { | ||
| // create stats object based on the column type | ||
| org.apache.parquet.column.statistics.Statistics stats = org.apache.parquet.column.statistics.Statistics.createStats(type); | ||
| StatisticsBuilder builder = org.apache.parquet.column.statistics.Statistics.getBuilder(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional, nit): I think stats would be more descriptive regarding the purpose of this variable, even if it is a builder. Alternatively, statsBuilder.
|
|
||
| private static final IntStatistics intStats = new IntStatistics(); | ||
| private static final IntStatistics nullIntStats = new IntStatistics(); | ||
| private static final org.apache.parquet.column.statistics.Statistics<?> missingMinMaxIntStats = org.apache.parquet.column.statistics.Statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only man and max are missing from this one, but numNulls as well. Would missingIntStats be a better name? Alternatively, set numNulls.
| getDoubleColumnMeta(doubleStats, 177L)); | ||
|
|
||
| private static final List<ColumnChunkMetaData> missingMinMaxColumnMetas = Arrays.asList( | ||
| getIntColumnMeta(missingMinMaxIntStats, 177L), // missing min/max, no null values => stats is empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify: Is it really the lack of null values or the lack of numNulls?
| public static class AllPositiveUdp extends UserDefinedPredicate<Double> { | ||
| @Override | ||
| public boolean keep(Double value) { | ||
| if (value == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should/could this be an assert instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more an UDP implementation for testing purposes than the test itself. In addition, we have to return a boolean even if an assertion is added.
| w.writeDataPage(2, 4, BytesInput.from(BYTES2), STATS2, BIT_PACKED, BIT_PACKED, PLAIN); | ||
| w.writeDataPage(3, 4, BytesInput.from(BYTES2), STATS2, BIT_PACKED, BIT_PACKED, PLAIN); | ||
| w.writeDataPage(1, 4, BytesInput.from(BYTES2), STATS2, BIT_PACKED, BIT_PACKED, PLAIN); | ||
| w.writeDataPage(2, 4, BytesInput.from(BYTES2), EMPTY_STATS, BIT_PACKED, BIT_PACKED, PLAIN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what purpose STATS1 and STATS2 might have served in the original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea, why we had two Statistics objects for the same purpose. I think, it is more readable this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I don't remember either.
|
@gszadovszky, can you update the description with a summary of the problem and how this fixes it? It's easier to review a patch if I know the general idea behind it. |
|
@rdblue, until Gabor updates the description, you can read about the error in the JIRA: https://issues.apache.org/jira/browse/PARQUET-1217 |
|
Have any engines actually written stats with |
| /** | ||
| * Builder class to build Statistics objects. Used to read the statistics from the Parquet file. | ||
| */ | ||
| public static class StatisticsBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this to the public API?
Because this is nested, it should be Statistics.Builder, not Statistics.StatisticsBuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be public because ParquetMetadataConverter uses it. The concept is that you are only able to initialize num_nulls to -1 (meaning it is not set in the file) by using the builder (read path). If you are creating a new Statistics object in any other way (write path) num_nulls will be initialized to 0.
I'll rename it to Builder.
|
Impala started to write the new |
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @rdblue, are you okay with me merging this tomorrow? Thanks.
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios: - null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs fix: check if min/max is set before comparing to the related values - null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering Author: Gabor Szadovszky <[email protected]> Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits: 9d14090 [Gabor Szadovszky] Updates according to rdblue's comments 116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount 2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios: - null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs fix: check if min/max is set before comparing to the related values - null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering Author: Gabor Szadovszky <[email protected]> Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits: 9d14090 [Gabor Szadovszky] Updates according to rdblue's comments 116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount 2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios: - null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs fix: check if min/max is set before comparing to the related values - null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering Author: Gabor Szadovszky <[email protected]> Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits: 9d14090 [Gabor Szadovszky] Updates according to rdblue's comments 116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount 2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics This change is based on b82d962 but is not a clean cherry-pick.
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios: - null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs fix: check if min/max is set before comparing to the related values - null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering Author: Gabor Szadovszky <[email protected]> Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits: 9d14090 [Gabor Szadovszky] Updates according to rdblue's comments 116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount 2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics This change is based on b82d962 but is not a clean cherry-pick.
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios: - null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs fix: check if min/max is set before comparing to the related values - null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering Author: Gabor Szadovszky <[email protected]> Closes #458 from gszadovszky/PARQUET-1217 and squashes the following commits: 9d14090 [Gabor Szadovszky] Updates according to rdblue's comments 116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount 2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics This change is based on b82d962 but is not a clean cherry-pick.
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
fix: check if min/max is set before comparing to the related values
fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering