-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1310: Column indexes: Filtering #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great pull request.
The logic looks fine in general, I had a lot of suggestions though regarding names. Sorry about these nitpicks, but I think misleading names make it much harder to understand code than it needs to be.
As naming choices are subject to personal taste, feel free to disregard my suggestions that you don't agree with.
| return this; | ||
| } | ||
|
|
||
| public Builder useColumnIndexFilter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I would remove this convenience method as it is not only superfluous but also unnecessary on the "convenience" path, since true is already the default. With this method we have 3 ways of setting the value to true: not doing anything, calling useColumnIndexFilter(true) and calling useColumnIndexFilter().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've followed the pattern of the other options (e.g. useRecordFilter(boolean) and useRecordFilter() etc.). I think, it is better to be consistent.
| /** | ||
| * @return the index of the first row in this page | ||
| * @throws IllegalStateException | ||
| * if no row synchronization is required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I would use a different wording in the comment and in the exception as well, for example:
- row synchronization not supported
- row synchronization not available
- row synchronization not possible
- row synchronization [mode] not enabled
- row synchronization [mode] not active
Could you also give a few hints about when this happens or what this means?
| * @see PageReadStore#isRowSynchronizationRequired() | ||
| */ | ||
| public long getFirstRowIndex() { | ||
| if (firstRowIndex < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a way to query this state without relying on an exception being thrown?
| * @return {@code true} if row synchronization is required; {@code false} otherwise | ||
| * @see DataPage#getFirstRowIndex() | ||
| */ | ||
| default boolean isRowSynchronizationRequired() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Enabled or Active may be a better word than Required.
| private final List<ColumnPath> columns; | ||
| private final long rowCount; | ||
|
|
||
| public static RowRanges calculateRowRanges(FilterCompat.Filter filter, ColumnIndexStore columnIndexStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what this function does?
| } else if (right.to + 1 >= left.from) { | ||
| return new Range(right.from, Math.max(left.to, right.to)); | ||
| } | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The union of non-empty ranges can not be empty, so I guess this null stands for a non-continous range that can not be represented with a single Range. Shouldn't this method return a List<Range> instead and support the case when the input ranges do not intersect?
UPDATE: It seems that this case is handled in the caller instead, so returning null here is fine, although I would document this behaviour to help developers understand the intent.
| public class RowRanges { | ||
| private static class Range implements Comparable<Range> { | ||
| private static Range union(Range left, Range right) { | ||
| if (left.from <= right.from) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had trouble understanding this logic, but before putting more effort into it, I wanted to raise a different concern. I may be wrong, but it seems to me that this union method depends on the order of its parameters, and even worse, the assumption that left is "to the left" or right is not explicitly called out. In the case of a union, the names left and right can be easily interpreted to refer to the left and right parameters of the union function and not to their position on the number line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left and right does not mean anything special here. Might not be the best naming. Do you think range1 and range2 would be better?
(The implementation does not rely on the order of the parameters.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no, left and right are fine in this case. It just seemed to me that their order matters. My bad.
| } | ||
|
|
||
| private static Range intersection(Range left, Range right) { | ||
| if (left.from <= right.from) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems to be fine on the other hand, so I may have just misinterpreted the logic of union.
| private RowRanges() { | ||
| } | ||
|
|
||
| private void add(Range range) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to addRangeAtEnd to make the condition of the assert more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me addRangeAtEnd is not more descriptive than the original one. This is a private method. Will add some comments to describe its working.
| class SynchronizingColumnReader extends ColumnReaderBase { | ||
|
|
||
| private final PrimitiveIterator.OfLong rowIndexes; | ||
| private long actualRow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/actual/current
| public class RowRanges { | ||
| private static class Range implements Comparable<Range> { | ||
| private static Range union(Range left, Range right) { | ||
| if (left.from <= right.from) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no, left and right are fine in this case. It just seemed to me that their order matters. My bad.
|
|
||
| /** | ||
| * ColumnReader implementation | ||
| * ColumnReader implementation for the simple scenario (all values are read) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "not using indexes" would be more descriptive than "for the simple scenario"
| import org.apache.parquet.io.api.PrimitiveConverter; | ||
|
|
||
| /** | ||
| * A {@link ColumnReader} implementation that synchronize the values for skipped pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested javadoc:
A {@link ColumnReader} implementation for utilizing indexes. When filtering using indexes, some of the rows may be loaded only partially, because rows are not synchorined accross columns, thus pages containing other fields of a row may have been filtered out. In this case we can't assemble the row, but there is no need to do so either, since getting filtered out in another column means that it can not match the filter condition.
A RecordReader assembles rows by reading from each ColumnReader. Without filtering, when RecordReader starts reading a row, ColumnReader-s are always positioned at the same row in respect to each other. With filtering, however, due to the misalignment described above, some of the pages read by ColumnReaders may start or end with values that have no corresponding values in other rows. This SynchronizingColumnReader is column reader implementation that skips such values so that the values returned to RecordReader for the different fields all correspond to a single row.
| public long getFirstRowIndex() { | ||
| if (firstRowIndex < 0) { | ||
| throw new IllegalStateException( | ||
| "No row synchronization is required; all pages shall be read."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a large amount of IllegalStateException-s scattered around the code with very similar texts about row synchronization not being required (but still being a little bit cryptic about what this means). Could you please create a separate Exception class for these so that it's not repeated all over the code with minor differences in wording? This would also allow a more verbose centralized description in the javadoc of the exception.
| } | ||
|
|
||
| @Override | ||
| PrimitiveIterator.OfInt gt(ColumnIndexBase<?>.ValueComparator comparator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad.
| final long from; | ||
| final long to; | ||
|
|
||
| Range(long from, long to) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document that from and to are both inclusive. Maybe also mention that the range can not be empty.
| } else if (from > other.to) { | ||
| return 1; | ||
| } else { | ||
| // Equality means the two ranges are overlapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation violates the third rule of the compareTo() contract, because this concept of equality is not transitive. For example, if
A = [20, 40]
B = [30, 60]
C = [50, 70]
then according to your compareTo implementation:
A = B and
B = C, yet
A < C
Fortunately, you don't really need this class to be Comparable, so I would suggest to remove that interface and also rename this method because the compareTo name may confuse readers of the code. I would suggest using an enum as the return value. The names of the enum values should properly describe their meaning (e.g., BEFORE, AFTER, OVERLAPPING).
| // Equality means the two ranges are overlapping | ||
| return 0; | ||
| } | ||
| boolean isBeforeThan(Range other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) These method names sound a bit strange in English. Some alternative you could consider:
- isBefore/isBehind ("than" is not needed and "behind" is better for non-temporal comparison than "after")
- precedes/follows
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the Travis error then feel free to merge the PR as otherwise it looks good. Thanks!
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch. * PARQUET-1211: Column indexes: read/write API (#456) * PARQUET-1212: Column indexes: Show indexes in tools (#479) * PARQUET-1213: Column indexes: Limit index size (#480) * PARQUET-1214: Column indexes: Truncate min/max values (#481) * PARQUET-1364: Invalid row indexes for pages starting with nulls (#507) * PARQUET-1310: Column indexes: Filtering (#509) * PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515) * PARQUET-1389: Improve value skipping at page synchronization (#514) * PARQUET-1381: Fix missing endRecord after merging columnIndex
No description provided.