PARQUET-1214: Column indexes: Truncate min/max values #481

gszadovszky · 2018-05-17T11:42:59Z

No description provided.

zivanfi · 2018-06-27T12:52:38Z

...uet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java

+    }
+
+    // Trying to increment the bytes from the last one to the beginning
+    private byte[] increment(byte[] array) {


It seems to me that this only increments bytes if they do not overflow. I think they should be incremented even if they overflow.

An example for the difference between the two approaches:

Description Value

Input 00 42 FF FF

Only incrementing bytes even if they overflow 00 43 00 00

Only incrementing bytes that do not overflow 00 43 FF FF

Even though 00 43 FF FF is a valid max value as well, 00 43 00 00 is closer to the original value thus it results in better filtering.

zivanfi · 2018-06-27T13:00:28Z

...uet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java

+        byte prev = array[i];
+        byte inc = prev;
+        while (++inc != 0) { // Until overflow: 0xFF -> 0x00
+          array[i] = inc;


This one seems to increment bytes up to FF, but then not write the overflow back to the array. I.e., if I understand correctly, the array may go through the following changes:

42 FB
42 FC
42 FD
42 FE
42 FF
43 FF
...

I think the last one should be 43 00

zivanfi · 2018-06-27T13:04:23Z

...column/src/test/java/org/apache/parquet/internal/column/columnindex/TestBinaryTruncator.java

+/**
+ * Tests for {@link BinaryTruncator}
+ */
+public class TestBinaryTruncator {


I think it would be worth to have some test cases with specific expected values as well instead of just checking that the actual result satisfies the contract.

zivanfi · 2018-06-27T13:05:45Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java

  public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min";
  public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max";
  public static final String ESTIMATE_PAGE_SIZE_CHECK = "parquet.page.size.check.estimate";
+  public static final String CULOMN_INDEX_TRUNCATE_LENGTH = "parquet.columnindex.truncate.length";


Typo: CULOMN should be COLUMN.

zivanfi · 2018-07-03T17:46:51Z

...column/src/test/java/org/apache/parquet/internal/column/columnindex/TestBinaryTruncator.java

+
+    // Truncate invalid UTF-8 values -> truncate without validity check
+    assertEquals(binary(0xFF, 0xFE, 0xFD), truncator.truncateMin(binary(0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA), 3));
+    assertEquals(binary(0xFF, 0xFE, 0xFE), truncator.truncateMax(binary(0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA), 3));


Could you also add a test where there is a carry over some bytes? For example, truncateMax(binary(0xFF, 0xFE, 0xFD, 0xFF, 0xFF, 0xFF), 5) should become binary(0xFF, 0xFE, 0xFE) or binary(0xFF, 0xFE, 0xFE, 0x00, 0x00).

zivanfi · 2018-07-03T17:49:04Z

...column/src/test/java/org/apache/parquet/internal/column/columnindex/TestBinaryTruncator.java

+                + UTF8_3BYTES_MAX_CHAR
+                + UTF8_4BYTES_MAX_CHAR),
+            5));
+


Could you also add a test where there is a carry over some bytes? For example, "abc" followed by a few max chars and truncated to 4 or more bytes should become "abd" potentially followed by a few \0-s.

Just realized after modifying the code that in case of UTF-8 it is not easy to put \0 to the end for carry on. I'll put back the original algorithm that writes back the original bytes in case the incrementation fails so the last maximum UTF-8 characters will be unchanged.

zivanfi · 2018-07-03T17:52:58Z

...column/src/test/java/org/apache/parquet/internal/column/columnindex/TestBinaryTruncator.java

+
+    // Truncate 1-2 bytes characters
+    assertEquals(Binary.fromString("árvízt"), truncator.truncateMin(Binary.fromString("árvíztűrő"), 8));
+    assertEquals(Binary.fromString("árvízu"), truncator.truncateMax(Binary.fromString("árvíztűrő"), 8));


Could you also add a test case where the truncation cutoff happens to be inside a single unicode codepoint? For example, byte position 9 of "árvíztűrő" ends up inside the bytes representing "ű" (c5 b1), so you could add a test case that truncates that to a length of 9 bytes.

That was the original idea just miscalculated the length...

This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch. * PARQUET-1211: Column indexes: read/write API (#456) * PARQUET-1212: Column indexes: Show indexes in tools (#479) * PARQUET-1213: Column indexes: Limit index size (#480) * PARQUET-1214: Column indexes: Truncate min/max values (#481) * PARQUET-1364: Invalid row indexes for pages starting with nulls (#507) * PARQUET-1310: Column indexes: Filtering (#509) * PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515) * PARQUET-1389: Improve value skipping at page synchronization (#514) * PARQUET-1381: Fix missing endRecord after merging columnIndex

Gabor Szadovszky and others added 3 commits May 29, 2018 17:36

PARQUET-1214: Column indexes: Truncate min/max values

67237dc

PARQUET-1214: Introduce property for truncate length

3bf0ce9

PARQUET-1214: Fix rebase conflicts

a8630a4

gszadovszky force-pushed the PARQUET-1214 branch from 77aa42f to a8630a4 Compare May 29, 2018 15:59

zivanfi requested changes Jun 27, 2018

View reviewed changes

PARQUET-1214: Updates according to review comments

347d808

zivanfi requested changes Jul 3, 2018

View reviewed changes

gszadovszky added 2 commits July 4, 2018 12:00

PARQUET-1214: Review comments vol.2

1cf1451

PARQUET-1214: Remove degug logging

854f57b

zivanfi approved these changes Jul 4, 2018

View reviewed changes

gszadovszky merged commit dc645db into apache:column-indexes Jul 9, 2018

mapleFU mentioned this pull request Jun 17, 2023

[C++][Parquet] Allow Truncate min-max Statistics apache/arrow#36139

Open

asfimport mentioned this pull request Jun 23, 2024

Column indexes: Truncate min/max values #1508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-1214: Column indexes: Truncate min/max values #481

PARQUET-1214: Column indexes: Truncate min/max values #481

Uh oh!

gszadovszky commented May 17, 2018

Uh oh!

zivanfi Jun 27, 2018

Uh oh!

zivanfi Jun 27, 2018

Uh oh!

zivanfi Jun 27, 2018

Uh oh!

zivanfi Jun 27, 2018

Uh oh!

zivanfi Jul 3, 2018

Uh oh!

zivanfi Jul 3, 2018

Uh oh!

gszadovszky Jul 4, 2018

Uh oh!

zivanfi Jul 3, 2018

Uh oh!

gszadovszky Jul 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Description	Value
Input	00 42 FF FF
Only incrementing bytes even if they overflow	00 43 00 00
Only incrementing bytes that do not overflow	00 43 FF FF

PARQUET-1214: Column indexes: Truncate min/max values #481

PARQUET-1214: Column indexes: Truncate min/max values #481

Uh oh!

Conversation

gszadovszky commented May 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants