PARQUET-1201: Column indexes #527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

zivanfi merged 14 commits into master from column-indexes

Oct 18, 2018

Contributor

gszadovszky commented Sep 28, 2018

Merging the column-indexes feature branch to master.

gszadovszky added 10 commits

May 17, 2018 12:57


          PARQUET-1211: Column indexes: read/write API (#456)

aa571d7


          PARQUET-1212: Column indexes: Show indexes in tools (#479)

6165a0c


          PARQUET-1213: Column indexes: Limit index size (#480)


          PARQUET-1214: Column indexes: Truncate min/max values (#481)

dc645db


          PARQUET-1364: Invalid row indexes for pages starting with nulls (#507)

43ac3e1


          PARQUET-1310: Column indexes: Filtering (#509)

d8e78eb


          PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double col…

1f95eca

…umn indexes (#515)


          PARQUET-1389: Improve value skipping at page synchronization (#514)

55d791c


          Merge branch 'master' into column-indexes

85e699c


          PARQUET-1381: Fix missing endRecord after merging columnIndex

c215f1f

Contributor Author

gszadovszky commented Sep 28, 2018

As the column-index related changes have already been reviewed, we should not do a rebase on the feature branch. I think, the best option is to merge the feature branch so all the changes will be kept and trackable.

zivanfi approved these changes

View reviewed changes

Contributor

rdblue commented Sep 30, 2018

-1 for a merge commit. The feature branch was a good way to break down the work and review it in chunks, but I think we still need to review the final patch that will go in. That's why I asked for a PR for this. I never thought that the branch would be merged without a final review, and that's a good time to take care of rebasing or merging into this branch and then squashing.

rdblue reviewed

View reviewed changes

parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowColumnIndexCommand.java Outdated

    
                List<String> ColumnPaths;

                @Parameter(names = { "-b",

                    "--block" }, description = "Shows the column/offset indexes for the given block (row-group) only; "

Contributor

rdblue Sep 30, 2018

User-facing options should always use "row group" and never "block" because block is used in several different contexts and is confusing. Row group is always clear.

rdblue reviewed

View reviewed changes

parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowColumnIndexCommand.java Outdated

    
                  InputFile in = HadoopInputFile.fromPath(new Path(files.get(0)), new Configuration());

                  if (!showColumnIndex && !showOffsetIndex) {

                    showColumnIndex = showOffsetIndex = true;

Contributor

rdblue Sep 30, 2018

Nit: it is more clear to use separate assignment because this is one character away from assigning the value of a boolean test.

rdblue reviewed

View reviewed changes

parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowColumnIndexCommand.java Outdated

    
                  if (blockIndexes == null || blockIndexes.isEmpty()) {

                    int index = 0;

                    for (BlockMetaData block : blocks) {

                      pairs.add(new AbstractMap.SimpleImmutableEntry<>(index++, block));

Contributor

rdblue Sep 30, 2018

Nit: Using the return value of a ++ expression makes the code harder to read. Statements that set variables should be independent.

rdblue reviewed

View reviewed changes

parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowColumnIndexCommand.java Outdated

    
                  try (ParquetFileReader reader = ParquetFileReader.open(in)) {

                    boolean firstBlock = true;

                    for (Entry<Integer, BlockMetaData> entry : getBlocks(reader.getFooter())) {

Contributor

rdblue Sep 30, 2018

Minor: It is odd to me that getBlocks iterates through all the blocks and uses a map entry without a map, then this iterates through the blocks again. I think it would be less code and more straightforward if you iterated once with a counter and test the value of that counter against a set of indices.

rdblue reviewed

View reviewed changes

parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowColumnIndexCommand.java Outdated

    
                  Preconditions.checkArgument(files.size() == 1,

                      "Cannot process multiple Parquet files.");

                  InputFile in = HadoopInputFile.fromPath(new Path(files.get(0)), new Configuration());

Contributor

rdblue Sep 30, 2018

This should use the helper methods in BaseCommand. Those helpers make arguments into paths how users expect to interact with a CLI utility. For example, "/tmp/file.parquet" is opened in the local FS, not the default FS.

rdblue reviewed

View reviewed changes

parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowColumnIndexCommand.java Outdated

    
                }

                // Returns the index-block pairs based on the arguments of --block

                private List<Entry<Integer, BlockMetaData>> getBlocks(ParquetMetadata meta) {

Contributor

rdblue Sep 30, 2018

Minor: It would be better to return a map instead. I think it's a bad practice to use Entry outside of a map because you need a Pair.

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReadStoreImpl.java

    
                  if (pageReadStore.isInPageFilteringMode()) {

                    return new SynchronizingColumnReader(path, pageReader, converter, writerVersion, pageReadStore.getRowIndexes());

                  } else {

                    return new ColumnReaderImpl(path, pageReader, converter, writerVersion);

Contributor

rdblue Sep 30, 2018

Why doesn't this use newMemColumnReader? Since there are only two uses of that function, I think that either both of them should be inlined like this, or both should continue calling it.

Contributor Author

gszadovszky Oct 1, 2018 •

edited

Loading

newMemColumnReader is used by ParquetFileWriter.merge(List<InputFile>, BytesCompressor, String, long) introduced in PARQUET-1381. The implementation logic is different in the two methods.
getColumnReader(ColumnDescriptor) uses the internal PageReadStore instance to get the PageReader and the row indexes (to create the synchronizing reader if required). In the other hand newMemColumnReader gets the PageReader as a parameter and the internal PageReadStore is not used (no way of creating a synchronizing reader).
Because of these differences the two logic cannot be merged.

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java Outdated

    
                 *           if page filtering mode is not active so the related information is not available

                 * @see #isInPageFilteringMode()

                 */

                default PrimitiveIterator.OfLong getRowIndexes() {

Contributor

rdblue Sep 30, 2018

Most of the code uses fastutil instead of Java 8 primitive iterators. It would be better to use the same one everywhere. Feel free to open an issue to move to Java 8 and eliminate fastutil, but I don't think we should mix them.

Contributor Author

gszadovszky Oct 1, 2018

I've used the Java 8 primitive iterators because this interface is public and did not want to expose fastutil classes. However, this API is internal, so we might use fastutil as well.
Unfortunately, Java 8 only introduced the primitive iterators but non of the other primitive implementations that fastutil offers (primitive lists, sets, maps etc.). So, I don't think we can drop fastutil.
Do you think it is better to use the fastutil primitive iterators here?

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java Outdated

    
               * Class representing row ranges in a row-group. These row ranges are calculated as a result of the column index based

               * filtering.

               *

               * @see ColumnIndexFilter#calculateRowRanges(Filter, ColumnIndexStore, Collection, long)

Contributor

rdblue Sep 30, 2018

This javadoc is broken.

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java

    
               * @see ColumnIndexFilter#calculateRowRanges(Filter, ColumnIndexStore, Collection, long)

               */

              public class RowRanges {

                private static class Range {

Contributor

rdblue Sep 30, 2018

Why introduce a custom Range class? Guava includes a range implementation that is quite good, so I'd rather see that used.

Contributor

rdblue Sep 30, 2018

It also provides convenient methods, like isConnected.

Contributor Author

gszadovszky Oct 1, 2018

Guava Range can handle one range at a time and I found it too generic for our use. RowRanges handles several distinct ranges in a sorted way and calculates union and intersection as well. RowRanges.Range is a private class and it's fairly simple to help implementing the functionality of RowRanges.
I think, it would require much more work to use the guava Range here while we cannot drop RowRanges as guava Range does not provide the required functionalities.

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java Outdated

    
                  return ranges;

                }

                static RowRanges build(long rowCount, PrimitiveIterator.OfInt pageIndexes, OffsetIndex offsetIndex) {

Contributor

rdblue Sep 30, 2018

I think this class needs more documentation. It isn't clear what these factory methods do without looking closely at the implementation. For example, single(rowCount) doesn't tell me that the resulting range is [0, rowCount). Similarly, it isn't clear what this is doing. Why is it called "build" when it is a factory method and not a builder?

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java Outdated

    
                /**

                 * @return the ascending iterator of the row indexes contained in the ranges

                 */

                public PrimitiveIterator.OfLong allRows() {

Contributor

rdblue Sep 30, 2018

It would be more clear if this were named iterator because it iterates over all of the selected rows, not the span of the set of ranges.

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java Outdated

    
               * A {@link ColumnReader} implementation for utilizing indexes. When filtering using column indexes, some of the rows

               * may be loaded only partially, because rows are not synchronized across columns, thus pages containing other fields of

               * a row may have been filtered out. In this case we can't assemble the row, but there is no need to do so either, since

               * getting filtered out in another column means that it can not match the filter condition.

Contributor

rdblue Sep 30, 2018 •

edited

Loading

I'm having trouble making sense of this paragraph. Why do the other columns matter to this column reader? This column reader is passed an index of rows it will materialize and should be responsible for materializing those values as if the other values don't exist. Is that not how this works?

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java Outdated

    
                }

                @Override

                boolean skipLevels(int rl, int dl) {

Contributor

rdblue Sep 30, 2018

Why is dl passed in?

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/page/DataPage.java Outdated

    
                 */

                public long getFirstRowIndex() {

                  if (firstRowIndex < 0) {

                    throw new NotInPageFilteringModeException("First row index is not available");

Contributor

rdblue Sep 30, 2018

Why does this throw an exception for the filtering mode? The reader mode doesn't matter to this. It either has a starting index and row count or it doesn't.

I think a better API would be to avoid throwing an exception by using Optional, Long/null, or -1 to signal that the page can't report this. Throwing an exception can cause a runtime error when an expectation isn't met, while returning an Option forces callers to handle the case where the option is empty.

rdblue reviewed

View reviewed changes

...uet-column/src/main/java/org/apache/parquet/column/page/NotInPageFilteringModeException.java Outdated

    
               * 

               * @see PageReadStore#isInPageFilteringMode()

               */

              public class NotInPageFilteringModeException extends IllegalStateException {

Contributor

rdblue Sep 30, 2018

I don't think this exception is well-defined. Why throw an exception when some configuration isn't set? I think what this is trying to represent is when column indexes are missing or similar cases. I'd say that the problem is trying to group all of those problems together. If an index is missing, then Parquet should fall back to normal reads. Methods that return index information should return options. And classes that can only be used with indexes should return UnsupportedOperationException or similar when they are misused.

rdblue reviewed

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/values/ValuesReader.java Outdated

    
                abstract public void skip();

                /**

                 * Skips the next n value in the page

Contributor

rdblue Sep 30, 2018

Nit: should be values (plural)

rdblue reviewed

View reviewed changes

...umn/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesReader.java

    
                }

                @Override

                public void skip(int n) {

Contributor

rdblue Sep 30, 2018

Skip needs to be tested for all of the readers.

Contributor

zivanfi commented Oct 1, 2018

-1 for a merge commit. The feature branch was a good way to break down the work and review it in chunks, but I think we still need to review the final patch that will go in. That's why I asked for a PR for this. I never thought that the branch would be merged without a final review, and that's a good time to take care of rebasing or merging into this branch and then squashing.

If I understand correctly, you have two separate concerns:

Merging the feature branch without reviewing the whole change.
Actually using a merge commit instead of squashing the individual commits into a single one.

Regarding the first one, I think feature branches in general would be the most useful if we reviewed them continuously on the feature branch itself and limit the review of the merge to the commit resolving the merge conflicts. Developers of a branch may put months of effort into it. If only direct commits into the main branch are taken seriously enough for immediate review, people could come to the (justified) conclusion that the best way for their work to not get lost is by developing on the main branch and not in a feature branch.

Regarding the latter issue, I think a feature large enough for a separate feature branch is complex enough so that the more detailed history that a proper merge commit provides outweighs the disadvantage of a slightly more complicated history. Personally when I try to understand the motivation behind the existence of certain code lines I look up the commit that added them and read the whole commit. Naturally, the smaller these commits are, the easier it is to this.

Let's discuss this further on the next Parquet sync and update this thread with the outcome.

gszadovszky added 4 commits

October 2, 2018 15:51


          PARQUET-1201: Fix review findings

e551893


          PARQUET-1201: Delete class NotInPageFilteringModeException as it is n…

1206c60

…ot used anymore


          PARQUET-1201: Complete removing NotInPageFilteringModeException

5b55d87


          Merge branch 'master' into column-indexes

6781c8d

Contributor

vinooganesh commented Oct 12, 2018

Hey @gszadovszky - is there anything that we're waiting on before we can get this merged?

zivanfi merged commit e7db9e2 into master

Contributor

zivanfi commented Oct 18, 2018

I promised to update this thread with the outcome of the merging vs. rebasing vs.
squashing discussion. We decided on squashing after all. See this email for details and motivation.

wangyum reviewed

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

    
                  return total;

                }

                long getFilteredRecordCount() {

Member

wangyum Jan 8, 2020

@gszadovszky Could we make this public?

Contributor Author

gszadovszky Jan 8, 2020

I think, there is no problem making this public.
I would suggest creating a jira for the changes required for the spark integration (hopefully, not much). If everything works fine we'll try to do a rapid minor release (1.11.1) with these changes only.

Member

wangyum Jan 8, 2020

Thank you @gszadovszky. I have created https://issues.apache.org/jira/browse/PARQUET-1739

Fokko deleted the column-indexes branch

January 8, 2020 07:36

asfimport mentioned this pull request

Column indexes #2123

Closed

10 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet