Skip to content

Conversation

@gszadovszky
Copy link
Contributor

Merging the column-indexes feature branch to master.

@gszadovszky
Copy link
Contributor Author

As the column-index related changes have already been reviewed, we should not do a rebase on the feature branch. I think, the best option is to merge the feature branch so all the changes will be kept and trackable.

@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2018

-1 for a merge commit. The feature branch was a good way to break down the work and review it in chunks, but I think we still need to review the final patch that will go in. That's why I asked for a PR for this. I never thought that the branch would be merged without a final review, and that's a good time to take care of rebasing or merging into this branch and then squashing.

List<String> ColumnPaths;

@Parameter(names = { "-b",
"--block" }, description = "Shows the column/offset indexes for the given block (row-group) only; "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User-facing options should always use "row group" and never "block" because block is used in several different contexts and is confusing. Row group is always clear.


InputFile in = HadoopInputFile.fromPath(new Path(files.get(0)), new Configuration());
if (!showColumnIndex && !showOffsetIndex) {
showColumnIndex = showOffsetIndex = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it is more clear to use separate assignment because this is one character away from assigning the value of a boolean test.

if (blockIndexes == null || blockIndexes.isEmpty()) {
int index = 0;
for (BlockMetaData block : blocks) {
pairs.add(new AbstractMap.SimpleImmutableEntry<>(index++, block));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Using the return value of a ++ expression makes the code harder to read. Statements that set variables should be independent.


try (ParquetFileReader reader = ParquetFileReader.open(in)) {
boolean firstBlock = true;
for (Entry<Integer, BlockMetaData> entry : getBlocks(reader.getFooter())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It is odd to me that getBlocks iterates through all the blocks and uses a map entry without a map, then this iterates through the blocks again. I think it would be less code and more straightforward if you iterated once with a counter and test the value of that counter against a set of indices.

Preconditions.checkArgument(files.size() == 1,
"Cannot process multiple Parquet files.");

InputFile in = HadoopInputFile.fromPath(new Path(files.get(0)), new Configuration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the helper methods in BaseCommand. Those helpers make arguments into paths how users expect to interact with a CLI utility. For example, "/tmp/file.parquet" is opened in the local FS, not the default FS.

}

// Returns the index-block pairs based on the arguments of --block
private List<Entry<Integer, BlockMetaData>> getBlocks(ParquetMetadata meta) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It would be better to return a map instead. I think it's a bad practice to use Entry outside of a map because you need a Pair.

if (pageReadStore.isInPageFilteringMode()) {
return new SynchronizingColumnReader(path, pageReader, converter, writerVersion, pageReadStore.getRowIndexes());
} else {
return new ColumnReaderImpl(path, pageReader, converter, writerVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this use newMemColumnReader? Since there are only two uses of that function, I think that either both of them should be inlined like this, or both should continue calling it.

Copy link
Contributor Author

@gszadovszky gszadovszky Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newMemColumnReader is used by ParquetFileWriter.merge(List<InputFile>, BytesCompressor, String, long) introduced in PARQUET-1381. The implementation logic is different in the two methods.
getColumnReader(ColumnDescriptor) uses the internal PageReadStore instance to get the PageReader and the row indexes (to create the synchronizing reader if required). In the other hand newMemColumnReader gets the PageReader as a parameter and the internal PageReadStore is not used (no way of creating a synchronizing reader).
Because of these differences the two logic cannot be merged.

* if page filtering mode is not active so the related information is not available
* @see #isInPageFilteringMode()
*/
default PrimitiveIterator.OfLong getRowIndexes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code uses fastutil instead of Java 8 primitive iterators. It would be better to use the same one everywhere. Feel free to open an issue to move to Java 8 and eliminate fastutil, but I don't think we should mix them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the Java 8 primitive iterators because this interface is public and did not want to expose fastutil classes. However, this API is internal, so we might use fastutil as well.
Unfortunately, Java 8 only introduced the primitive iterators but non of the other primitive implementations that fastutil offers (primitive lists, sets, maps etc.). So, I don't think we can drop fastutil.
Do you think it is better to use the fastutil primitive iterators here?

* Class representing row ranges in a row-group. These row ranges are calculated as a result of the column index based
* filtering.
*
* @see ColumnIndexFilter#calculateRowRanges(Filter, ColumnIndexStore, Collection, long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc is broken.

* @see ColumnIndexFilter#calculateRowRanges(Filter, ColumnIndexStore, Collection, long)
*/
public class RowRanges {
private static class Range {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce a custom Range class? Guava includes a range implementation that is quite good, so I'd rather see that used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also provides convenient methods, like isConnected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guava Range can handle one range at a time and I found it too generic for our use. RowRanges handles several distinct ranges in a sorted way and calculates union and intersection as well. RowRanges.Range is a private class and it's fairly simple to help implementing the functionality of RowRanges.
I think, it would require much more work to use the guava Range here while we cannot drop RowRanges as guava Range does not provide the required functionalities.

return ranges;
}

static RowRanges build(long rowCount, PrimitiveIterator.OfInt pageIndexes, OffsetIndex offsetIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class needs more documentation. It isn't clear what these factory methods do without looking closely at the implementation. For example, single(rowCount) doesn't tell me that the resulting range is [0, rowCount). Similarly, it isn't clear what this is doing. Why is it called "build" when it is a factory method and not a builder?

/**
* @return the ascending iterator of the row indexes contained in the ranges
*/
public PrimitiveIterator.OfLong allRows() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more clear if this were named iterator because it iterates over all of the selected rows, not the span of the set of ranges.

* A {@link ColumnReader} implementation for utilizing indexes. When filtering using column indexes, some of the rows
* may be loaded only partially, because rows are not synchronized across columns, thus pages containing other fields of
* a row may have been filtered out. In this case we can't assemble the row, but there is no need to do so either, since
* getting filtered out in another column means that it can not match the filter condition.
Copy link
Contributor

@rdblue rdblue Sep 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble making sense of this paragraph. Why do the other columns matter to this column reader? This column reader is passed an index of rows it will materialize and should be responsible for materializing those values as if the other values don't exist. Is that not how this works?

}

@Override
boolean skipLevels(int rl, int dl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is dl passed in?

*/
public long getFirstRowIndex() {
if (firstRowIndex < 0) {
throw new NotInPageFilteringModeException("First row index is not available");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this throw an exception for the filtering mode? The reader mode doesn't matter to this. It either has a starting index and row count or it doesn't.

I think a better API would be to avoid throwing an exception by using Optional, Long/null, or -1 to signal that the page can't report this. Throwing an exception can cause a runtime error when an expectation isn't met, while returning an Option forces callers to handle the case where the option is empty.

*
* @see PageReadStore#isInPageFilteringMode()
*/
public class NotInPageFilteringModeException extends IllegalStateException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this exception is well-defined. Why throw an exception when some configuration isn't set? I think what this is trying to represent is when column indexes are missing or similar cases. I'd say that the problem is trying to group all of those problems together. If an index is missing, then Parquet should fall back to normal reads. Methods that return index information should return options. And classes that can only be used with indexes should return UnsupportedOperationException or similar when they are misused.

abstract public void skip();

/**
* Skips the next n value in the page
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should be values (plural)

}

@Override
public void skip(int n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip needs to be tested for all of the readers.

@zivanfi
Copy link
Contributor

zivanfi commented Oct 1, 2018

-1 for a merge commit. The feature branch was a good way to break down the work and review it in chunks, but I think we still need to review the final patch that will go in. That's why I asked for a PR for this. I never thought that the branch would be merged without a final review, and that's a good time to take care of rebasing or merging into this branch and then squashing.

If I understand correctly, you have two separate concerns:

  • Merging the feature branch without reviewing the whole change.
  • Actually using a merge commit instead of squashing the individual commits into a single one.

Regarding the first one, I think feature branches in general would be the most useful if we reviewed them continuously on the feature branch itself and limit the review of the merge to the commit resolving the merge conflicts. Developers of a branch may put months of effort into it. If only direct commits into the main branch are taken seriously enough for immediate review, people could come to the (justified) conclusion that the best way for their work to not get lost is by developing on the main branch and not in a feature branch.

Regarding the latter issue, I think a feature large enough for a separate feature branch is complex enough so that the more detailed history that a proper merge commit provides outweighs the disadvantage of a slightly more complicated history. Personally when I try to understand the motivation behind the existence of certain code lines I look up the commit that added them and read the whole commit. Naturally, the smaller these commits are, the easier it is to this.

Let's discuss this further on the next Parquet sync and update this thread with the outcome.

@vinooganesh
Copy link
Contributor

Hey @gszadovszky - is there anything that we're waiting on before we can get this merged?

@zivanfi zivanfi merged commit e7db9e2 into master Oct 18, 2018
@zivanfi
Copy link
Contributor

zivanfi commented Oct 18, 2018

I promised to update this thread with the outcome of the merging vs. rebasing vs.
squashing discussion. We decided on squashing after all. See this email for details and motivation.

return total;
}

long getFilteredRecordCount() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gszadovszky Could we make this public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, there is no problem making this public.
I would suggest creating a jira for the changes required for the spark integration (hopefully, not much). If everything works fine we'll try to do a rapid minor release (1.11.1) with these changes only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Fokko deleted the column-indexes branch January 8, 2020 07:36
@asfimport asfimport mentioned this pull request Jun 23, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants