[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842

c21 · 2021-08-26T03:58:26Z

What changes were proposed in this pull request?

Debugged internally and found a bug where we should disable vectorized reader now based on schema recursively. Currently we check schema.length to be no more than wholeStageMaxNumFields to enable vectorization. schema.length does not take nested columns sub-fields into condition (i.e. view nested column same as primitive column). This check will be wrong when enabling vectorization for nested columns. We should follow same check from WholeStageCodegenExec to check sub-fields recursively. This will not cause correctness issue but will cause performance issue where we may enable vectorization for nested columns by mistake when nested column has a lot of sub-fields.

Why are the changes needed?

Avoid OOM/performance regression when reading ORC table with nested column types.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in OrcQuerySuite.scala. Verified test failed without this change.

c21 · 2021-08-26T03:59:16Z

cc @cloud-fan and @dongjoon-hyun could you help take a look when you have time? Thanks.
Sorry but It might be a blocker for Spark 3.2.0 release, cc @gengliangwang FYI.

c21 · 2021-08-26T04:01:11Z

This cannot be merged into 3.2 branch cleanly as it depends on #33626 which was merged only on master. Will create another PR for 3.2 branch once this looks good.

SparkQA · 2021-08-26T04:47:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47280/

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

SparkQA · 2021-08-26T04:57:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47280/

…aximal number of fields ### What changes were proposed in this pull request? This is the patch on branch-3.2 for #33842. See the description in the other PR. ### Why are the changes needed? Avoid OOM/performance regression when reading ORC table with nested column types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `OrcSourceSuite.scala`. Closes #33843 from c21/branch-3.2. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gengliangwang · 2021-08-26T11:41:46Z

Merging to master

SparkQA · 2021-08-26T12:20:58Z

Test build #142780 has finished for PR 33842 at commit 3c7c7ea.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-08-26T19:00:11Z

Thank you @cloud-fan and @gengliangwang for review!

dongjoon-hyun · 2021-08-26T21:46:12Z

+1, LGTM.
Thank you, @c21 and all.

ORC vectorized reader should properly check maximal number of fields

3c7c7ea

github-actions bot added the SQL label Aug 26, 2021

cloud-fan approved these changes Aug 26, 2021

View reviewed changes

c21 mentioned this pull request Aug 26, 2021

[SPARK-36594][SQL][3.2] ORC vectorized reader should properly check maximal number of fields #33843

Closed

gengliangwang reviewed Aug 26, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala Show resolved Hide resolved

gengliangwang approved these changes Aug 26, 2021

View reviewed changes

gengliangwang closed this in 400dc7b Aug 26, 2021

c21 deleted the field-fix branch August 26, 2021 19:00

c21 mentioned this pull request Dec 30, 2021

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842

[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

SparkQA commented Aug 26, 2021

Uh oh!

Uh oh!

SparkQA commented Aug 26, 2021

Uh oh!

gengliangwang commented Aug 26, 2021

Uh oh!

SparkQA commented Aug 26, 2021

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

dongjoon-hyun commented Aug 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842

[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842

Uh oh!

Conversation

c21 commented Aug 26, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

SparkQA commented Aug 26, 2021

Uh oh!

Uh oh!

SparkQA commented Aug 26, 2021

Uh oh!

gengliangwang commented Aug 26, 2021

Uh oh!

SparkQA commented Aug 26, 2021

Uh oh!

c21 commented Aug 26, 2021

Uh oh!

dongjoon-hyun commented Aug 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants