Skip to content

Conversation

@gszadovszky
Copy link
Contributor

@gszadovszky gszadovszky commented Feb 19, 2018

In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:

  • null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
    fix: check if min/max is set before comparing to the related values
  • null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
    fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@rdblue, would you also take a look if your time allows? Thanks.

}

if (hasNulls(meta)) {
if (stats.isNumNullsSet() && hasNulls(meta)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) You might consider adding the "stats.isNumNullsSet()" clause to the body of hasNulls() itself, but I don't really have a strong opinion on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not work correctly because hasNulls is called negated as well.

(String createdBy, Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder) {
// create stats object based on the column type
org.apache.parquet.column.statistics.Statistics stats = org.apache.parquet.column.statistics.Statistics.createStats(type);
StatisticsBuilder builder = org.apache.parquet.column.statistics.Statistics.getBuilder(type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional, nit): I think stats would be more descriptive regarding the purpose of this variable, even if it is a builder. Alternatively, statsBuilder.


private static final IntStatistics intStats = new IntStatistics();
private static final IntStatistics nullIntStats = new IntStatistics();
private static final org.apache.parquet.column.statistics.Statistics<?> missingMinMaxIntStats = org.apache.parquet.column.statistics.Statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only man and max are missing from this one, but numNulls as well. Would missingIntStats be a better name? Alternatively, set numNulls.

getDoubleColumnMeta(doubleStats, 177L));

private static final List<ColumnChunkMetaData> missingMinMaxColumnMetas = Arrays.asList(
getIntColumnMeta(missingMinMaxIntStats, 177L), // missing min/max, no null values => stats is empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify: Is it really the lack of null values or the lack of numNulls?

public static class AllPositiveUdp extends UserDefinedPredicate<Double> {
@Override
public boolean keep(Double value) {
if (value == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should/could this be an assert instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more an UDP implementation for testing purposes than the test itself. In addition, we have to return a boolean even if an assertion is added.

w.writeDataPage(2, 4, BytesInput.from(BYTES2), STATS2, BIT_PACKED, BIT_PACKED, PLAIN);
w.writeDataPage(3, 4, BytesInput.from(BYTES2), STATS2, BIT_PACKED, BIT_PACKED, PLAIN);
w.writeDataPage(1, 4, BytesInput.from(BYTES2), STATS2, BIT_PACKED, BIT_PACKED, PLAIN);
w.writeDataPage(2, 4, BytesInput.from(BYTES2), EMPTY_STATS, BIT_PACKED, BIT_PACKED, PLAIN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what purpose STATS1 and STATS2 might have served in the original code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea, why we had two Statistics objects for the same purpose. I think, it is more readable this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I don't remember either.

@rdblue
Copy link
Contributor

rdblue commented Feb 21, 2018

@gszadovszky, can you update the description with a summary of the problem and how this fixes it? It's easier to review a patch if I know the general idea behind it.

@zivanfi
Copy link
Contributor

zivanfi commented Feb 21, 2018

@rdblue, until Gabor updates the description, you can read about the error in the JIRA: https://issues.apache.org/jira/browse/PARQUET-1217

@rdblue
Copy link
Contributor

rdblue commented Feb 21, 2018

Have any engines actually written stats with num_nulls set, min and max not set, and num_nulls != num_values? That's the case that this is fixing, right? I don't think Parquet MR will write any stats if min and max are not written.

/**
* Builder class to build Statistics objects. Used to read the statistics from the Parquet file.
*/
public static class StatisticsBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this to the public API?

Because this is nested, it should be Statistics.Builder, not Statistics.StatisticsBuilder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be public because ParquetMetadataConverter uses it. The concept is that you are only able to initialize num_nulls to -1 (meaning it is not set in the file) by using the builder (read path). If you are creating a new Statistics object in any other way (write path) num_nulls will be initialized to 0.
I'll rename it to Builder.

@gszadovszky
Copy link
Contributor Author

Impala started to write the new min_value/max_value with num_nulls set. In the same CDH release we use a parquet-mr release where the handling of the new min_value/max_value is not backported therefore, we are seeing only unset min and max values with set num_nulls.
In addition, there was an Impala release where no num_nulls were written which is correct from the parquet-format point-of-view but in parquet-mr we were handled unset num_nulls the same way as the num_nulls would be 0.

Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @rdblue, are you okay with me merging this tomorrow? Thanks.

@asfgit asfgit closed this in b82d962 Feb 27, 2018
pbruneton pushed a commit to criteo-forks/parquet-mr that referenced this pull request Mar 22, 2018
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <[email protected]>

Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics
legend-hua pushed a commit to legend-hua/parquet-mr that referenced this pull request Mar 23, 2018
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <[email protected]>

Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics
gszadovszky pushed a commit to gszadovszky/parquet-mr that referenced this pull request Apr 19, 2018
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <[email protected]>

Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics

This change is based on b82d962 but is not a clean cherry-pick.
gszadovszky pushed a commit to gszadovszky/parquet-mr that referenced this pull request Apr 20, 2018
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <[email protected]>

Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics

This change is based on b82d962 but is not a clean cherry-pick.
zivanfi pushed a commit that referenced this pull request Apr 24, 2018
In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <[email protected]>

Closes #458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics

This change is based on b82d962 but is not a clean cherry-pick.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants