Spark: Remove hasExtraFilterColumns #1004

rdblue · 2020-05-05T17:10:59Z

Spark has special handling in the read path for extra filter columns -- columns that are referenced by a filter expression but not in the schema for rows returned to Spark. This path requires copying rows before returning to Spark to match the expected schema without the filter columns, and uses Spark's UnsafeProjection.

This handling isn't needed because the source can tell Spark the schema that it will project. So if Spark requests columns, the data source can return a schema with more columns, as long as the reader passes them back in readSchema. Using this simplifies the logic in the reader and allows us to get rid of another case that doesn't need a full copy.

rdblue · 2020-05-05T17:12:05Z

@rdsr, this affects the work you're doing for ORC identity projections.

@samarthjain, this hopefully makes the vectorized read path a bit easier.

danielcweeks · 2020-05-07T19:05:38Z

+1 LGTM, you might want to file an issue for ORC to support the constant projection because I assume ORC will still pay the additional copy cost.

rdblue · 2020-05-07T19:08:09Z

Ratandeep is already working on constant projection in #989.

rdblue · 2020-05-07T19:18:18Z

I'm not sure why CI didn't run, but I tested locally to verify this change and everything passes. I'm going to merge this.

Spark: Remove hasExtraFilterColumns.

85d9146

rdblue requested a review from rdsr May 5, 2020 17:10

rdblue requested a review from danielcweeks May 5, 2020 17:16

This was referenced May 5, 2020

iceberg-spark changes for vectorized reads #828

Merged

ORC: Supported nested identity partition data #989

Merged

rdblue merged commit a9b51f0 into apache:master May 7, 2020

rdblue mentioned this pull request May 7, 2020

Refactor data reader classes for vectorized reads #1000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Remove hasExtraFilterColumns #1004

Spark: Remove hasExtraFilterColumns #1004

Uh oh!

rdblue commented May 5, 2020

Uh oh!

rdblue commented May 5, 2020

Uh oh!

danielcweeks commented May 7, 2020

Uh oh!

rdblue commented May 7, 2020

Uh oh!

rdblue commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spark: Remove hasExtraFilterColumns #1004

Spark: Remove hasExtraFilterColumns #1004

Uh oh!

Conversation

rdblue commented May 5, 2020

Uh oh!

rdblue commented May 5, 2020

Uh oh!

danielcweeks commented May 7, 2020

Uh oh!

rdblue commented May 7, 2020

Uh oh!

rdblue commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants