Refactor data reader classes for vectorized reads #1000

samarthjain · 2020-05-04T20:35:50Z

This refactoring takes care of providing a method to customize behavior of how identity partition columns are handled in RowDataReader and BatchDataReader.

The bulk of the change is to move the piece of code on how identity partition columns are handled in BaseDataReader to RowDataReader. Currently, BatchDataReader isn't used when identity partition columns are projected #838.

prodeezy · 2020-05-05T13:58:08Z

spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

  }

-  abstract Iterator<T> open(FileScanTask task);
+  abstract Pair<Schema, Iterator<T>> getJoinedSchemaAndIteratorWithIdentityPartition(


Since this is part of the Base Readers api, would be good to add a docstring on what this is supposed to do and where it's being used.

rdblue · 2020-05-05T16:38:04Z

@samarthjain, can you describe these changes in the description and give more context about why they are needed?

samarthjain · 2020-05-05T18:27:20Z

@rdblue, @prodeezy - I have updated the PR describing the changes. Also pushed a commit to add doc for the method.

prodeezy · 2020-05-07T02:37:03Z

lgtm @samarthjain

rdblue · 2020-05-07T19:21:38Z

#1004 was merged, so I don't think we need this refactor any longer.

There were only 2 cases where we need to project before returning a batch or row to Spark: if extra columns were projected for filtering (fixed by #1004) or when using a JoinedRow to add identity partition values. Since vectorized reads don't support adding identity-partitioned values, there are no longer any cases where we need to project.

Also, when vectorized reads do support identity-partitioned values, we should be able to use the idToConstant map that Avro and Parquet currently use. Then we would create columns in the right order to begin with and won't need a projection at all.

Since this isn't needed and I already talked with @samarthjain, I'm going to close it.

* Internal: DR actions * Internal: Add DR support of V2 tables without delete date files (apache#894)

samarthjain force-pushed the refactor-reader branch from 1f0f5eb to 67764aa Compare May 4, 2020 20:40

Refactor data reader classes for vectorized reads

bf560fe

samarthjain force-pushed the refactor-reader branch from 67764aa to bf560fe Compare May 4, 2020 20:44

samarthjain mentioned this pull request May 4, 2020

iceberg-spark changes for vectorized reads #828

Merged

prodeezy reviewed May 5, 2020

View reviewed changes

Add javadoc

cf8226e

rdblue closed this May 7, 2020

samarthjain deleted the refactor-reader branch May 9, 2020 00:24

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024

Internal: DR actions (apache#1000)

7f1f4ec

* Internal: DR actions * Internal: Add DR support of V2 tables without delete date files (apache#894)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor data reader classes for vectorized reads #1000

Refactor data reader classes for vectorized reads #1000

samarthjain commented May 4, 2020 •

edited

Loading

Uh oh!

prodeezy May 5, 2020

Uh oh!

rdblue commented May 5, 2020

Uh oh!

samarthjain commented May 5, 2020

Uh oh!

prodeezy commented May 7, 2020

Uh oh!

rdblue commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor data reader classes for vectorized reads #1000

Refactor data reader classes for vectorized reads #1000

Conversation

samarthjain commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prodeezy May 5, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 5, 2020

Uh oh!

samarthjain commented May 5, 2020

Uh oh!

prodeezy commented May 7, 2020

Uh oh!

rdblue commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samarthjain commented May 4, 2020 •

edited

Loading