-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1214: Column indexes: Truncate min/max values #481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| // Trying to increment the bytes from the last one to the beginning | ||
| private byte[] increment(byte[] array) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this only increments bytes if they do not overflow. I think they should be incremented even if they overflow.
An example for the difference between the two approaches:
| Description | Value |
|---|---|
| Input | 00 42 FF FF |
| Only incrementing bytes even if they overflow | 00 43 00 00 |
| Only incrementing bytes that do not overflow | 00 43 FF FF |
Even though 00 43 FF FF is a valid max value as well, 00 43 00 00 is closer to the original value thus it results in better filtering.
| byte prev = array[i]; | ||
| byte inc = prev; | ||
| while (++inc != 0) { // Until overflow: 0xFF -> 0x00 | ||
| array[i] = inc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems to increment bytes up to FF, but then not write the overflow back to the array. I.e., if I understand correctly, the array may go through the following changes:
42 FB
42 FC
42 FD
42 FE
42 FF
43 FF
...
I think the last one should be 43 00
| /** | ||
| * Tests for {@link BinaryTruncator} | ||
| */ | ||
| public class TestBinaryTruncator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth to have some test cases with specific expected values as well instead of just checking that the actual result satisfies the contract.
| public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; | ||
| public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; | ||
| public static final String ESTIMATE_PAGE_SIZE_CHECK = "parquet.page.size.check.estimate"; | ||
| public static final String CULOMN_INDEX_TRUNCATE_LENGTH = "parquet.columnindex.truncate.length"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: CULOMN should be COLUMN.
|
|
||
| // Truncate invalid UTF-8 values -> truncate without validity check | ||
| assertEquals(binary(0xFF, 0xFE, 0xFD), truncator.truncateMin(binary(0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA), 3)); | ||
| assertEquals(binary(0xFF, 0xFE, 0xFE), truncator.truncateMax(binary(0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA), 3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test where there is a carry over some bytes? For example, truncateMax(binary(0xFF, 0xFE, 0xFD, 0xFF, 0xFF, 0xFF), 5) should become binary(0xFF, 0xFE, 0xFE) or binary(0xFF, 0xFE, 0xFE, 0x00, 0x00).
| + UTF8_3BYTES_MAX_CHAR | ||
| + UTF8_4BYTES_MAX_CHAR), | ||
| 5)); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test where there is a carry over some bytes? For example, "abc" followed by a few max chars and truncated to 4 or more bytes should become "abd" potentially followed by a few \0-s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized after modifying the code that in case of UTF-8 it is not easy to put \0 to the end for carry on. I'll put back the original algorithm that writes back the original bytes in case the incrementation fails so the last maximum UTF-8 characters will be unchanged.
|
|
||
| // Truncate 1-2 bytes characters | ||
| assertEquals(Binary.fromString("árvízt"), truncator.truncateMin(Binary.fromString("árvíztűrő"), 8)); | ||
| assertEquals(Binary.fromString("árvízu"), truncator.truncateMax(Binary.fromString("árvíztűrő"), 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test case where the truncation cutoff happens to be inside a single unicode codepoint? For example, byte position 9 of "árvíztűrő" ends up inside the bytes representing "ű" (c5 b1), so you could add a test case that truncates that to a length of 9 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original idea just miscalculated the length...
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch. * PARQUET-1211: Column indexes: read/write API (#456) * PARQUET-1212: Column indexes: Show indexes in tools (#479) * PARQUET-1213: Column indexes: Limit index size (#480) * PARQUET-1214: Column indexes: Truncate min/max values (#481) * PARQUET-1364: Invalid row indexes for pages starting with nulls (#507) * PARQUET-1310: Column indexes: Filtering (#509) * PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515) * PARQUET-1389: Improve value skipping at page synchronization (#514) * PARQUET-1381: Fix missing endRecord after merging columnIndex
No description provided.