PARQUET-1310: Column indexes: Filtering #509

gszadovszky · 2018-08-03T05:58:22Z

No description provided.

zivanfi

Thanks for this great pull request.

The logic looks fine in general, I had a lot of suggestions though regarding names. Sorry about these nitpicks, but I think misleading names make it much harder to understand code than it needs to be.

As naming choices are subject to personal taste, feel free to disregard my suggestions that you don't agree with.

zivanfi · 2018-08-03T16:23:03Z

parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java

+      return this;
+    }
+
+    public Builder useColumnIndexFilter() {


(nit) I would remove this convenience method as it is not only superfluous but also unnecessary on the "convenience" path, since true is already the default. With this method we have 3 ways of setting the value to true: not doing anything, calling useColumnIndexFilter(true) and calling useColumnIndexFilter().

I've followed the pattern of the other options (e.g. useRecordFilter(boolean) and useRecordFilter() etc.). I think, it is better to be consistent.

zivanfi · 2018-08-03T16:26:33Z

parquet-column/src/main/java/org/apache/parquet/column/page/DataPage.java

+  /**
+   * @return the index of the first row in this page
+   * @throws IllegalStateException
+   *           if no row synchronization is required


(nit) I would use a different wording in the comment and in the exception as well, for example:

row synchronization not supported

row synchronization not available

row synchronization not possible

row synchronization [mode] not enabled

row synchronization [mode] not active

Could you also give a few hints about when this happens or what this means?

zivanfi · 2018-08-03T16:27:46Z

parquet-column/src/main/java/org/apache/parquet/column/page/DataPage.java

+   * @see PageReadStore#isRowSynchronizationRequired()
+   */
+  public long getFirstRowIndex() {
+    if (firstRowIndex < 0) {


Should there be a way to query this state without relying on an exception being thrown?

zivanfi · 2018-08-03T16:34:28Z

parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java

+   * @return {@code true} if row synchronization is required; {@code false} otherwise
+   * @see DataPage#getFirstRowIndex()
+   */
+  default boolean isRowSynchronizationRequired() {


(nit) Enabled or Active may be a better word than Required.

zivanfi · 2018-08-03T16:36:47Z

...-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java

+  private final List<ColumnPath> columns;
+  private final long rowCount;
+
+  public static RowRanges calculateRowRanges(FilterCompat.Filter filter, ColumnIndexStore columnIndexStore,


Could you describe what this function does?

zivanfi · 2018-08-07T14:20:49Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

+      } else if (right.to + 1 >= left.from) {
+        return new Range(right.from, Math.max(left.to, right.to));
+      }
+      return null;


The union of non-empty ranges can not be empty, so I guess this null stands for a non-continous range that can not be represented with a single Range. Shouldn't this method return a List<Range> instead and support the case when the input ranges do not intersect?

UPDATE: It seems that this case is handled in the caller instead, so returning null here is fine, although I would document this behaviour to help developers understand the intent.

zivanfi · 2018-08-07T14:41:47Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

 public class RowRanges {
  private static class Range implements Comparable<Range> {
+    private static Range union(Range left, Range right) {
+      if (left.from <= right.from) {


I had trouble understanding this logic, but before putting more effort into it, I wanted to raise a different concern. I may be wrong, but it seems to me that this union method depends on the order of its parameters, and even worse, the assumption that left is "to the left" or right is not explicitly called out. In the case of a union, the names left and right can be easily interpreted to refer to the left and right parameters of the union function and not to their position on the number line.

left and right does not mean anything special here. Might not be the best naming. Do you think range1 and range2 would be better?
(The implementation does not rely on the order of the parameters.)

No, no, left and right are fine in this case. It just seemed to me that their order matters. My bad.

zivanfi · 2018-08-07T14:44:19Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

+    }
+
+    private static Range intersection(Range left, Range right) {
+      if (left.from <= right.from) {


This method seems to be fine on the other hand, so I may have just misinterpreted the logic of union.

zivanfi · 2018-08-07T14:58:26Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

+  private RowRanges() {
+  }
+
+  private void add(Range range) {


I would rename this to addRangeAtEnd to make the condition of the assert more explicit.

For me addRangeAtEnd is not more descriptive than the original one. This is a private method. Will add some comments to describe its working.

zivanfi · 2018-08-07T15:10:34Z

parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java

+class SynchronizingColumnReader extends ColumnReaderBase {
+
+  private final PrimitiveIterator.OfLong rowIndexes;
+  private long actualRow;


s/actual/current

zivanfi · 2018-08-10T11:13:21Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

 public class RowRanges {
  private static class Range implements Comparable<Range> {
+    private static Range union(Range left, Range right) {
+      if (left.from <= right.from) {


No, no, left and right are fine in this case. It just seemed to me that their order matters. My bad.

zivanfi · 2018-08-10T11:42:07Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderImpl.java


 /**
- * ColumnReader implementation
+ * ColumnReader implementation for the simple scenario (all values are read)


I think "not using indexes" would be more descriptive than "for the simple scenario"

zivanfi · 2018-08-10T12:21:00Z

parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java

+import org.apache.parquet.io.api.PrimitiveConverter;
+
+/**
+ * A {@link ColumnReader} implementation that synchronize the values for skipped pages.


Suggested javadoc:

A {@link ColumnReader} implementation for utilizing indexes. When filtering using indexes, some of the rows may be loaded only partially, because rows are not synchorined accross columns, thus pages containing other fields of a row may have been filtered out. In this case we can't assemble the row, but there is no need to do so either, since getting filtered out in another column means that it can not match the filter condition.

A RecordReader assembles rows by reading from each ColumnReader. Without filtering, when RecordReader starts reading a row, ColumnReader-s are always positioned at the same row in respect to each other. With filtering, however, due to the misalignment described above, some of the pages read by ColumnReaders may start or end with values that have no corresponding values in other rows. This SynchronizingColumnReader is column reader implementation that skips such values so that the values returned to RecordReader for the different fields all correspond to a single row.

zivanfi · 2018-08-10T13:32:47Z

parquet-column/src/main/java/org/apache/parquet/column/page/DataPage.java

+  public long getFirstRowIndex() {
+    if (firstRowIndex < 0) {
+      throw new IllegalStateException(
+          "No row synchronization is required; all pages shall be read.");


I see a large amount of IllegalStateException-s scattered around the code with very similar texts about row synchronization not being required (but still being a little bit cryptic about what this means). Could you please create a separate Exception class for these so that it's not repeated all over the code with minor differences in wording? This would also allow a more verbose centralized description in the javadoc of the exception.

zivanfi · 2018-08-10T13:35:14Z

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BoundaryOrder.java

+    }
+
+    @Override
+    PrimitiveIterator.OfInt gt(ColumnIndexBase<?>.ValueComparator comparator) {


Sorry, my bad.

zivanfi · 2018-08-10T13:49:46Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

+    final long from;
+    final long to;
+
+    Range(long from, long to) {


Please document that from and to are both inclusive. Maybe also mention that the range can not be empty.

zivanfi · 2018-08-10T14:07:32Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

+      } else if (from > other.to) {
+        return 1;
+      } else {
+        // Equality means the two ranges are overlapping


This implementation violates the third rule of the compareTo() contract, because this concept of equality is not transitive. For example, if

A = [20, 40]
B = [30, 60]
C = [50, 70]

then according to your compareTo implementation:

A = B and
B = C, yet
A < C

Fortunately, you don't really need this class to be Comparable, so I would suggest to remove that interface and also rename this method because the compareTo name may confuse readers of the code. I would suggest using an enum as the return value. The names of the enum values should properly describe their meaning (e.g., BEFORE, AFTER, OVERLAPPING).

zivanfi · 2018-08-16T12:12:40Z

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

-        // Equality means the two ranges are overlapping
-        return 0;
-      }
+    boolean isBeforeThan(Range other) {


(nit) These method names sound a bit strange in English. Some alternative you could consider:

isBefore/isBehind ("than" is not needed and "behind" is better for non-temporal comparison than "after")

precedes/follows

zivanfi

Please fix the Travis error then feel free to merge the PR as otherwise it looks good. Thanks!

This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch. * PARQUET-1211: Column indexes: read/write API (#456) * PARQUET-1212: Column indexes: Show indexes in tools (#479) * PARQUET-1213: Column indexes: Limit index size (#480) * PARQUET-1214: Column indexes: Truncate min/max values (#481) * PARQUET-1364: Invalid row indexes for pages starting with nulls (#507) * PARQUET-1310: Column indexes: Filtering (#509) * PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515) * PARQUET-1389: Improve value skipping at page synchronization (#514) * PARQUET-1381: Fix missing endRecord after merging columnIndex

gszadovszky added 4 commits August 3, 2018 07:41

PARQUET-1310: New options for column index based filtering

465cca6

PARQUET-1310: Prepare for skipping pages at reading

c00b2c0

PARQUET-1310: Simple implementation of page filtering

3c9fe3e

PARQUET-1310: Glue the whole stuff together; implement proper tests

c471c72

gszadovszky requested a review from zivanfi August 3, 2018 05:58

zivanfi reviewed Aug 7, 2018

View reviewed changes

PARQUET-1310: Modifications according to zi's comments

c0ce0f3

zivanfi reviewed Aug 10, 2018

View reviewed changes

gszadovszky added 2 commits August 15, 2018 11:00

PARQUET-1310: Implement binary search for ASCENDING/DESCENDING orders

7d5faea

PARQUET-1310: Modifications according to zi's comments

f9b6ecc

zivanfi reviewed Aug 16, 2018

View reviewed changes

gszadovszky added 2 commits August 16, 2018 17:06

PARQUET-1310: Update terminology of column index based filtering

e82b9d5

PARQUET-1310: Fix exception naming

325e5f6

zivanfi approved these changes Aug 17, 2018

View reviewed changes

PARQUET-1310: Fix issue introduced at exception renaming

659c1a0

zivanfi approved these changes Aug 17, 2018

View reviewed changes

gszadovszky merged commit d8e78eb into apache:column-indexes Aug 17, 2018

asfimport mentioned this pull request Jun 23, 2024

Column indexes: Filtering #2178

Closed

PARQUET-1310: Column indexes: Filtering #509

PARQUET-1310: Column indexes: Filtering #509

Uh oh!

Conversation

gszadovszky commented Aug 3, 2018

Uh oh!

zivanfi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zivanfi Aug 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zivanfi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zivanfi Aug 10, 2018 •

edited

Loading