-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan and @dongjoon-hyun could you help take a look when you have time? Thanks. |
|
This cannot be merged into 3.2 branch cleanly as it depends on #33626 which was merged only on master. Will create another PR for 3.2 branch once this looks good. |
|
Kubernetes integration test starting |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
Show resolved
Hide resolved
|
Kubernetes integration test status failure |
…aximal number of fields ### What changes were proposed in this pull request? This is the patch on branch-3.2 for #33842. See the description in the other PR. ### Why are the changes needed? Avoid OOM/performance regression when reading ORC table with nested column types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `OrcSourceSuite.scala`. Closes #33843 from c21/branch-3.2. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
Merging to master |
|
Test build #142780 has finished for PR 33842 at commit
|
|
Thank you @cloud-fan and @gengliangwang for review! |
|
+1, LGTM. |
What changes were proposed in this pull request?
Debugged internally and found a bug where we should disable vectorized reader now based on schema recursively. Currently we check
schema.lengthto be no more thanwholeStageMaxNumFieldsto enable vectorization.schema.lengthdoes not take nested columns sub-fields into condition (i.e. view nested column same as primitive column). This check will be wrong when enabling vectorization for nested columns. We should follow same check fromWholeStageCodegenExecto check sub-fields recursively. This will not cause correctness issue but will cause performance issue where we may enable vectorization for nested columns by mistake when nested column has a lot of sub-fields.Why are the changes needed?
Avoid OOM/performance regression when reading ORC table with nested column types.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit test in
OrcQuerySuite.scala. Verified test failed without this change.