Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented May 5, 2020

Spark has special handling in the read path for extra filter columns -- columns that are referenced by a filter expression but not in the schema for rows returned to Spark. This path requires copying rows before returning to Spark to match the expected schema without the filter columns, and uses Spark's UnsafeProjection.

This handling isn't needed because the source can tell Spark the schema that it will project. So if Spark requests columns, the data source can return a schema with more columns, as long as the reader passes them back in readSchema. Using this simplifies the logic in the reader and allows us to get rid of another case that doesn't need a full copy.

@rdblue rdblue requested a review from rdsr May 5, 2020 17:10
@rdblue
Copy link
Contributor Author

rdblue commented May 5, 2020

@rdsr, this affects the work you're doing for ORC identity projections.

@samarthjain, this hopefully makes the vectorized read path a bit easier.

@danielcweeks
Copy link
Contributor

+1 LGTM, you might want to file an issue for ORC to support the constant projection because I assume ORC will still pay the additional copy cost.

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2020

Ratandeep is already working on constant projection in #989.

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2020

I'm not sure why CI didn't run, but I tested locally to verify this change and everything passes. I'm going to merge this.

@rdblue rdblue merged commit a9b51f0 into apache:master May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants