[SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2 #33626

c21 · 2021-08-03T23:37:26Z

What changes were proposed in this pull request?

We added support of nested columns in ORC vectorized reader for data source v1. Data source v2 and v1 both use same underlying implementation for vectorized reader (OrcColumnVector), so we can support data source v2 as well.

Why are the changes needed?

Improve query performance for ORC data source v2 when reading nested columns.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added test in OrcQuerySuite.scala.

c21 · 2021-08-03T23:39:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

+      withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "true") {
+        val readDf = spark.read.orc(path)
+        val vectorizationEnabled = readDf.queryExecution.executedPlan.find {
+          case scan @ (_: FileSourceScanExec | _: BatchScanExec) => scan.supportsColumnar


Added BatchScanExec here for DS v2 compared to the original test in OrcSourceSuite.scala. Moved the query as we can test DS v1 and v2 here via OrcV1QuerySuite and OrcV2QuerySuite defined below.

SparkQA · 2021-08-04T00:37:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46520/

SparkQA · 2021-08-04T01:14:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46520/

dongjoon-hyun

+1, LGTM. Thank you, @c21 and @HyukjinKwon .
Merged to master

c21 · 2021-08-04T04:12:34Z

Thank you @HyukjinKwon and @dongjoon-hyun for review!

SparkQA · 2021-08-04T04:19:09Z

Test build #142008 has finished for PR 33626 at commit f58e31e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Support ORC nested column vectorized reader for data source v2

f58e31e

github-actions bot added the SQL label Aug 3, 2021

c21 commented Aug 3, 2021

View reviewed changes

HyukjinKwon approved these changes Aug 4, 2021

View reviewed changes

dongjoon-hyun approved these changes Aug 4, 2021

View reviewed changes

dongjoon-hyun closed this in de62b5a Aug 4, 2021

c21 deleted the orc-v2 branch August 4, 2021 06:49

c21 mentioned this pull request Aug 26, 2021

[SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields #33842

Closed

c21 mentioned this pull request Dec 30, 2021

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2 #33626

[SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2 #33626

Uh oh!

c21 commented Aug 3, 2021

Uh oh!

c21 Aug 3, 2021

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

dongjoon-hyun left a comment

Uh oh!

c21 commented Aug 4, 2021

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2 #33626

[SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2 #33626

Uh oh!

Conversation

c21 commented Aug 3, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 Aug 3, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

c21 commented Aug 4, 2021

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants