-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1211: Column indexes: read/write API #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please note that the current |
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change "Parquet 1211" to "PARQUET-1211" in the PR description. I may have more useful suggestions as well once I read through the code. :)
| public abstract class ColumnIndexBuilder { | ||
|
|
||
| static abstract class ColumnIndexBase implements ColumnIndex { | ||
| private static ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| null, | ||
| stringBinary("Slartibartfast"))); | ||
| assertEquals(BoundaryOrder.ASCENDING, columnIndex.getBoundaryOrder()); | ||
| assertCorrectNullCounts(columnIndex, 1, 2, 3, 4, 5, 6, 7, 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is passed? why not "assertCorrectNullCounts(columnIndex, 11, 21, 31, 41, 51, 61, 71, 81)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the List<Long> is created by using Arrays.asList I have to use long values otherwise a List<Integer> would be created. 1l, 2l etc. are the long literals in java. Others might use capital L as such 1L, 2L etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh, I got it wrong. It's letter "l", not number "1".
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only got to OffsetIndexBuilder.java, will continue reviewing from that point later. Only minor nits so far.
| static abstract class ColumnIndexBase implements ColumnIndex { | ||
| private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0); | ||
| private static final int MAX_VALUE_LENGTH_FOR_TOSTRING = 40; | ||
| private static final String INNER_ETC = "(...)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I would suggest naming this TOSTRING_TRUNCATION_MARKER.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0); | ||
| private static final int MAX_VALUE_LENGTH_FOR_TOSTRING = 40; | ||
| private static final String INNER_ETC = "(...)"; | ||
| private static final int FIRST_LENGTH = (MAX_VALUE_LENGTH_FOR_TOSTRING - INNER_ETC.length()) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I would suggest naming this TOSTRING_TRUNCATION_START_POS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| private static final int MAX_VALUE_LENGTH_FOR_TOSTRING = 40; | ||
| private static final String INNER_ETC = "(...)"; | ||
| private static final int FIRST_LENGTH = (MAX_VALUE_LENGTH_FOR_TOSTRING - INNER_ETC.length()) / 2; | ||
| private static final int LAST_LENGTH = MAX_VALUE_LENGTH_FOR_TOSTRING - INNER_ETC.length() - FIRST_LENGTH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I would suggest naming this TOSTRING_TRUNCATION_END_POS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| public String toString() { | ||
| try (Formatter formatter = new Formatter()) { | ||
| formatter.format("Boudary order: %s\n", boundaryOrder); | ||
| String minMaxPart = " %-" + MAX_VALUE_LENGTH_FOR_TOSTRING + "s %-" + MAX_VALUE_LENGTH_FOR_TOSTRING + "s\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks somewhat scary, but I don't know better either (only in C/C++, where printf supports specifying the desired lengths in parameters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know a better way either. 😞
| String nullCount = nullCounts == null ? "--" : Long.toString(nullCounts[i]); | ||
| String min, max; | ||
| if (nullPages[i]) { | ||
| min = max = "--"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Replace "--"-s with a constant named TOSTRING_MISSING_VALUE_MARKER or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| * the statistics to be added | ||
| */ | ||
| public void add(Statistics<?> stats) { | ||
| if (stats.hasNonNullValue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle that case correctly when we don't have min/max values in spite of non-null values being present? (For int96-s, for example.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we collecting statistics for all the types (even if not supported) only we don't write them to the file. ColumnIndex is working similarly.
| List<ByteBuffer> maxValues) { | ||
| clear(); | ||
| int requiredSize = nullPages.size(); | ||
| if ((nullCounts != null && nullCounts.size() != requiredSize) || minValues.size() != requiredSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is nullCounts checked for being null but minValues and maxValues being used without a similar check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see now, only nullCounts is optional.
| } | ||
|
|
||
| /** | ||
| * Builds the column index. It also resets all the collected data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the build method resetting the builder is something that API consumers would expect based on its name. Would it make sense to separate the building and the resetting?
| } | ||
| } | ||
|
|
||
| // min_i <= min_i+1 && max_i <= max_i+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) It is a bit hard to parse at first whether the +1 is in the index. It would be better as:
// min[i] <= min[i+1] && max[i] <= max[i+1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| * @param firstRowIndex | ||
| * the index of the first row in the page (within the row group) | ||
| */ | ||
| public void add(long offset, int compressedPageSize, long firstRowIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this method public? It seems to me that if one circumvents the other add method that has just 2 parameters, previousRowCount will not be updated thus this add() should only be called by the other add(). Do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two argument add method is used by the writers as that point we don't have the real file offsets. That's why we have the build(long) to shift the related values at build time.
The three argument add method is used by the metadata converter when reading the Parquet file. In this case the build() method is used as already the correct offsets are read and no shifting is required.
I'll add some comments to make it more clear.
| w.start(); | ||
| w.startBlock(4); | ||
| w.startColumn(C1, 7, CODEC); | ||
| w.writeDataPage(7, 4, BytesInput.from(BYTES3), STATS1, BIT_PACKED, BIT_PACKED, PLAIN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STATS1? no definition?
BinaryStatistics STATS1 = new BinaryStatistics();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, you've checked my change by rebasing to the actual master where STATS1 and STATS2 are removed and exchanged to EMPTY_STATS. In the current change STATS1 and STATS2 still exist so it is correct as is.
I did not rebase this change yet because it would cause loosing the pointers of the review findings. I'll do the rebase as soon as we'll have the required parquet-format release and this change can be finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I build the branch which merging your patch, and found the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drawing my attention to this one. I'll keep in mind when I'm rebasing.
| w.writeDataPage(7, 4, BytesInput.from(BYTES3), STATS1, BIT_PACKED, BIT_PACKED, PLAIN); | ||
| w.endColumn(); | ||
| w.startColumn(C2, 8, CODEC); | ||
| w.writeDataPage(8, 4, BytesInput.from(BYTES4), STATS2, BIT_PACKED, BIT_PACKED, PLAIN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STATS2? no definition?
BinaryStatistics STATS2 = new BinaryStatistics();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See STATS1.
6ee8e3b to
7139654
Compare
| stringBinary("Slartibartfast"), | ||
| null, | ||
| null, | ||
| stringBinary("Perfect"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a serious typo here: The THHGTTG character is called Prefect, not Perfect. :)
| 3000, 3000, 29, | ||
| 6000, 1200, 56); | ||
|
|
||
| builder = OffsetIndexBuilder.getBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be mistaken, but recreating the builder with the same values seems unnecessary to me.
| builder.add(5, 6); | ||
| builder.add(7, 8); | ||
| assertNull(builder.build()); | ||
| builder.add(1, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these repeated adds intentional?
| 48000, 22000, 211, | ||
| 90000, 30000, 361); | ||
|
|
||
| builder = OffsetIndexBuilder.getBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like before, this seems unnecessary.
| builder.add(7, 8, 9); | ||
| builder.add(10, 11, 12); | ||
| assertNull(builder.build()); | ||
| builder.add(1, 2, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like unintentional duplication.
| /** | ||
| * @param column | ||
| * the column chunk which the column index is to be returned for | ||
| * @return the column index for the specified column chunk or {@code null} if the there is no index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) s/the there/there/ (in this line and in another line below as well)
|
|
||
| assertNull("Should handle null column index", ParquetMetadataConverter | ||
| .toParquetColumnIndex(Types.required(PrimitiveTypeName.INT32).named("test_int32"), null)); | ||
| assertNull("Should handle unsupported types", ParquetMetadataConverter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ignore unsupported types.
|
|
||
| public class TestColumnChunkPageWriteStore { | ||
|
|
||
| // OutputFile implementation to reach out the PositionOutputStream internally used by the writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/reach out/expose/
| * @deprecated will be removed in 2.0.0. This method does not support writing column indexes; Use | ||
| * {@link #writePage(BytesInput, int, int, Statistics, Encoding, Encoding, Encoding)} instead | ||
| */ | ||
| @Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this requires deprecation. It's part of the column API that is internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it is not documented anywhere and we do know that at least two different projects use it directly (Spark and Hive for vectorization).
I think, it is better to be on the safe side and handle every java-public API members as Public so we won't break any API consumer.
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch. * PARQUET-1211: Column indexes: read/write API (#456) * PARQUET-1212: Column indexes: Show indexes in tools (#479) * PARQUET-1213: Column indexes: Limit index size (#480) * PARQUET-1214: Column indexes: Truncate min/max values (#481) * PARQUET-1364: Invalid row indexes for pages starting with nulls (#507) * PARQUET-1310: Column indexes: Filtering (#509) * PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515) * PARQUET-1389: Improve value skipping at page synchronization (#514) * PARQUET-1381: Fix missing endRecord after merging columnIndex
| } | ||
|
|
||
| @Override | ||
| int compareMinValues(PrimitiveComparator<Binary> comparator, int index1, int index2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gszadovszky, why does the type of the PrimitiveComparator use Binary for all types? Or just a mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConeyLiu, PrimitiveComparator<Binary> covers all potential use case since the only non-java-primitive value type in parquet-mr is Binary and the related methods for these primitive types are part of the PrimitiveComparator interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gszadovszky Thanks for the response. It is a little stranger for IntColumnIndexBuilder uses a comparator typed PrimitiveComparator<Binary>.
https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IntColumnIndexBuilder.java#L123
No description provided.