Skip to content

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Aug 26, 2021

What changes were proposed in this pull request?

Debugged internally and found a bug where we should disable vectorized reader now based on schema recursively. Currently we check schema.length to be no more than wholeStageMaxNumFields to enable vectorization. schema.length does not take nested columns sub-fields into condition (i.e. view nested column same as primitive column). This check will be wrong when enabling vectorization for nested columns. We should follow same check from WholeStageCodegenExec to check sub-fields recursively. This will not cause correctness issue but will cause performance issue where we may enable vectorization for nested columns by mistake when nested column has a lot of sub-fields.

Why are the changes needed?

Avoid OOM/performance regression when reading ORC table with nested column types.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in OrcQuerySuite.scala. Verified test failed without this change.

@github-actions github-actions bot added the SQL label Aug 26, 2021
@c21
Copy link
Contributor Author

c21 commented Aug 26, 2021

cc @cloud-fan and @dongjoon-hyun could you help take a look when you have time? Thanks.
Sorry but It might be a blocker for Spark 3.2.0 release, cc @gengliangwang FYI.

@c21
Copy link
Contributor Author

c21 commented Aug 26, 2021

This cannot be merged into 3.2 branch cleanly as it depends on #33626 which was merged only on master. Will create another PR for 3.2 branch once this looks good.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47280/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47280/

cloud-fan pushed a commit that referenced this pull request Aug 26, 2021
…aximal number of fields

### What changes were proposed in this pull request?

This is the patch on branch-3.2 for #33842. See the description in the other PR.

### Why are the changes needed?

Avoid OOM/performance regression when reading ORC table with nested column types.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `OrcSourceSuite.scala`.

Closes #33843 from c21/branch-3.2.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@gengliangwang
Copy link
Member

Merging to master

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142780 has finished for PR 33842 at commit 3c7c7ea.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Aug 26, 2021

Thank you @cloud-fan and @gengliangwang for review!

@c21 c21 deleted the field-fix branch August 26, 2021 19:00
@dongjoon-hyun
Copy link
Member

+1, LGTM.
Thank you, @c21 and all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants