-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2261: Implement SizeStatistics #1177
Conversation
I have drafted the POC to read/write SizeStatistics. The feature implementation should be complete and associated tests will be added progressively. Please take a look when you have time. Thanks! @emkornfield cc @mapleFU |
Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the |
Thanks for the suggestion! Yes, it definitely should be done. @etseidl |
@wgtmac took a scan through and this generally seems like what I expected. Thank you for doing it. Agree unit tests are needed. |
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
cc0d75d
to
0acf99f
Compare
26ced88
to
4ff9d3d
Compare
339d397
to
ed3a89e
Compare
I have just rebased on the latest master branch and fixed all CI falures. As this PR gets too large, I will add print cli command and rewriter support for |
cc @ConeyLiu as I have modified mergeColumnStatistics method which you've just refactored. |
Thank @wgtmac for your notification. |
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndex.java
Outdated
Show resolved
Hide resolved
...-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndexBuilder.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java
Outdated
Show resolved
Hide resolved
Took another pass through, I'm less familiar with Parquet MR but overall looks ok to me (with the exception of confirming if the one place I found we should be changing length to 0) |
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
eb7a1a5
to
aa69b35
Compare
LGTM @ConeyLiu @etseidl @emkornfield Do you still have pending comments? |
Looks good to me too. I'd still like to see the CLI changed at some point to print the new statistics, but if no one else has cycles to work on that, I could try cleaning up what I have locally. |
@etseidl I have filed https://issues.apache.org/jira/browse/PARQUET-2433 and https://issues.apache.org/jira/browse/PARQUET-2434 as follow-up work items. Will work on them once this PR gets merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 thanks for the great work.
I think all of my suggestions have been addressed. Thanks @wgtmac ! |
@gszadovszky It would be good if you can take a look if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments but LGTM overall.
This change is about writing these new statistics. Are there any benefits in actually using them at reading? Do we plan to implement?
Side note: This is yet another statistics that we gather during writing data. We already have min/max statistics, null counts, bloom filter, and now some additional ones. I think, we should implement a centralized builder that we call once for each value/dl/rl and can generate the statistics we need. We need to implement the gathering part as optimal as it can be. One check less can significantly decrease write performance.
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java
Show resolved
Hide resolved
public void add(int repetitionLevel, int definitionLevel, Binary value) { | ||
add(repetitionLevel, definitionLevel); | ||
if (type.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.BINARY && value != null) { | ||
unencodedByteArrayDataBytes = Math.addExact(unencodedByteArrayDataBytes, value.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think, we shall fear of an overflow while adding an int
to a long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to keep this check just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, it is good to have checks like this one. But it can significantly hit performance when used in places called regularly. This method is invoked for every values. We shall be as effective as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I have removed the overflow check.
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
parquet-column/src/test/java/org/apache/parquet/column/statistics/TestSizeStatistics.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
Thanks for the feedback! I've addressed all the comments and added a new internal |
void write(int value, int repetitionLevel, int definitionLevel) { | ||
statistics.updateStats(value); | ||
sizeStatisticsBuilder.add(repetitionLevel, definitionLevel); | ||
if (bloomFilter != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using a no-op BloomFilter implementation instead of a null-check? I am not sure if would perform better, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Could you please review it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @wgtmac!
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation