Skip to content

Conversation

@samarthjain
Copy link
Collaborator

@samarthjain samarthjain commented May 4, 2020

This refactoring takes care of providing a method to customize behavior of how identity partition columns are handled in RowDataReader and BatchDataReader.

The bulk of the change is to move the piece of code on how identity partition columns are handled in BaseDataReader to RowDataReader. Currently, BatchDataReader isn't used when identity partition columns are projected #838.

}

abstract Iterator<T> open(FileScanTask task);
abstract Pair<Schema, Iterator<T>> getJoinedSchemaAndIteratorWithIdentityPartition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is part of the Base Readers api, would be good to add a docstring on what this is supposed to do and where it's being used.

@rdblue
Copy link
Contributor

rdblue commented May 5, 2020

@samarthjain, can you describe these changes in the description and give more context about why they are needed?

@samarthjain
Copy link
Collaborator Author

@rdblue, @prodeezy - I have updated the PR describing the changes. Also pushed a commit to add doc for the method.

@prodeezy
Copy link
Contributor

prodeezy commented May 7, 2020

lgtm @samarthjain

@rdblue
Copy link
Contributor

rdblue commented May 7, 2020

#1004 was merged, so I don't think we need this refactor any longer.

There were only 2 cases where we need to project before returning a batch or row to Spark: if extra columns were projected for filtering (fixed by #1004) or when using a JoinedRow to add identity partition values. Since vectorized reads don't support adding identity-partitioned values, there are no longer any cases where we need to project.

Also, when vectorized reads do support identity-partitioned values, we should be able to use the idToConstant map that Avro and Parquet currently use. Then we would create columns in the right order to begin with and won't need a projection at all.

Since this isn't needed and I already talked with @samarthjain, I'm going to close it.

@rdblue rdblue closed this May 7, 2020
@samarthjain samarthjain deleted the refactor-reader branch May 9, 2020 00:24
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
* Internal: DR actions
* Internal: Add DR support of V2 tables without delete date files (apache#894)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants