Skip to content

Conversation

@gszadovszky
Copy link
Contributor

No description provided.

}

// Trying to increment the bytes from the last one to the beginning
private byte[] increment(byte[] array) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this only increments bytes if they do not overflow. I think they should be incremented even if they overflow.

An example for the difference between the two approaches:

Description Value
Input 00 42 FF FF
Only incrementing bytes even if they overflow 00 43 00 00
Only incrementing bytes that do not overflow 00 43 FF FF

Even though 00 43 FF FF is a valid max value as well, 00 43 00 00 is closer to the original value thus it results in better filtering.

byte prev = array[i];
byte inc = prev;
while (++inc != 0) { // Until overflow: 0xFF -> 0x00
array[i] = inc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems to increment bytes up to FF, but then not write the overflow back to the array. I.e., if I understand correctly, the array may go through the following changes:

42 FB
42 FC
42 FD
42 FE
42 FF
43 FF
...

I think the last one should be 43 00

/**
* Tests for {@link BinaryTruncator}
*/
public class TestBinaryTruncator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth to have some test cases with specific expected values as well instead of just checking that the actual result satisfies the contract.

public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min";
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max";
public static final String ESTIMATE_PAGE_SIZE_CHECK = "parquet.page.size.check.estimate";
public static final String CULOMN_INDEX_TRUNCATE_LENGTH = "parquet.columnindex.truncate.length";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: CULOMN should be COLUMN.


// Truncate invalid UTF-8 values -> truncate without validity check
assertEquals(binary(0xFF, 0xFE, 0xFD), truncator.truncateMin(binary(0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA), 3));
assertEquals(binary(0xFF, 0xFE, 0xFE), truncator.truncateMax(binary(0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA), 3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test where there is a carry over some bytes? For example, truncateMax(binary(0xFF, 0xFE, 0xFD, 0xFF, 0xFF, 0xFF), 5) should become binary(0xFF, 0xFE, 0xFE) or binary(0xFF, 0xFE, 0xFE, 0x00, 0x00).

+ UTF8_3BYTES_MAX_CHAR
+ UTF8_4BYTES_MAX_CHAR),
5));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test where there is a carry over some bytes? For example, "abc" followed by a few max chars and truncated to 4 or more bytes should become "abd" potentially followed by a few \0-s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized after modifying the code that in case of UTF-8 it is not easy to put \0 to the end for carry on. I'll put back the original algorithm that writes back the original bytes in case the incrementation fails so the last maximum UTF-8 characters will be unchanged.


// Truncate 1-2 bytes characters
assertEquals(Binary.fromString("árvízt"), truncator.truncateMin(Binary.fromString("árvíztűrő"), 8));
assertEquals(Binary.fromString("árvízu"), truncator.truncateMax(Binary.fromString("árvíztűrő"), 8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test case where the truncation cutoff happens to be inside a single unicode codepoint? For example, byte position 9 of "árvíztűrő" ends up inside the bytes representing "ű" (c5 b1), so you could add a test case that truncates that to a length of 9 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original idea just miscalculated the length...

@gszadovszky gszadovszky merged commit dc645db into apache:column-indexes Jul 9, 2018
zivanfi pushed a commit that referenced this pull request Oct 18, 2018
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch.

* PARQUET-1211: Column indexes: read/write API (#456)
* PARQUET-1212: Column indexes: Show indexes in tools (#479)
* PARQUET-1213: Column indexes: Limit index size (#480)
* PARQUET-1214: Column indexes: Truncate min/max values (#481)
* PARQUET-1364: Invalid row indexes for pages starting with nulls (#507)
* PARQUET-1310: Column indexes: Filtering (#509)
* PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515)
* PARQUET-1389: Improve value skipping at page synchronization (#514)
* PARQUET-1381: Fix missing endRecord after merging columnIndex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants