[SPARK-34863][SQL] Support complex types for Parquet vectorized reader #33695

sunchao · 2021-08-10T21:55:50Z

What changes were proposed in this pull request?

This PR adds support for complex types (e.g., list, map, array) for Spark's vectorized Parquet reader. In particular, this introduces the following changes:

Added a new class ParquetType which binds a Spark type with its corresponding Parquet definition & repetition level. This is used when Spark assembles a vector of complex type for Parquet.
Changed ParquetSchemaConverter and added a new method convertTypeInfo which converts a Parquet MessageType to a ParquetType above. The existing conversion logic in the class remains the same but now operates with ParquetType instead of DataType, and annotate the former with extra information such as definition & repetition level, column path, column descriptor, etc.
Added a new class ParquetColumn which encapsulates all the necessary information needed when reading a Parquet column, including the ParquetType for the column, the repetition & definition levels (only allocated for a leaf-node of a complex type), as well as the reader for the column. In addition, it also contains logic for assembling nested columnar batches, via interpreting Parquet repetition & definition levels.
Changes are made in VectorizedParquetRecordReader to initialize a list of ParquetColumn for the columns read.
VectorizedColumnReader now also creates a reader for repetition column. Depending on whether maximum repetition level is 0, the batch read is now split into two code paths, e.g., readBatch versus readBatchNested.
Added logic to handle complex type in VectorizedRleValuesReader. For data types involving only struct or primitive types, it still goes with the old readBatch method which now also saves definition levels into a vector for later assembly. Otherwise, for data types involving array or map, a separate code path readBatchNested is introduced to handle repetition levels.
Added a new config spark.sql.parquet.enableNestedColumnVectorizedReader to turn on or turn off the feature. By default it is true.
Modified WritableColumnVector to better support null structs. Currently it requires populating null entries to all child vectors when there is a null struct, however this will waste space and also doesn't work well with Parquet scan. This adds an extra field structOffsets which records the mapping from a row ID to the position of the row in the child vector, so that child vectors will only need to store real null elements.

To test this, the PR introduced an interface ParquetRowGroupReader in SpecificParquetRecordReaderBase to abstract the Parquet file reading logic. The bulk of the tests are in ParquetVectorizedSuite which covers different batch size & page size, column index, first row index, nulls, etc.

The DataSourceReadBenchmark is extended with two more cases: reading struct fields of primitive types and reading array of struct & map field.

Why are the changes needed?

Whenever read schema containing complex types, at the moment Spark will fallback to the row-based reader in parquet-mr, which is much slower. As benchmark shows, by adding support into the vectorized reader, we can get ~15x on average speed up on reading struct fields, and ~1.5x when reading array of struct and map.

Micro benchmark of reading primitive fields from a struct, over 400m rows:

================================================================================================
SQL Single Numeric Column Scan in Struct
================================================================================================

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single TINYINT Column Scan in Struct:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              77684          78174         692          5.4         185.2       1.0X
SQL ORC Vectorized (Enabled Nested Column)                4137           4226         126        101.4           9.9      18.8X
SQL Parquet Vectorized (Disabled Nested Column)          42095          42193         138         10.0         100.4       1.8X
SQL Parquet Vectorized (Enabled Nested Column)            3317           4147        1174        126.4           7.9      23.4X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single SMALLINT Column Scan in Struct:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              82438          82443           7          5.1         196.5       1.0X
SQL ORC Vectorized (Enabled Nested Column)                4746           5022         391         88.4          11.3      17.4X
SQL Parquet Vectorized (Disabled Nested Column)          43689          43761         102          9.6         104.2       1.9X
SQL Parquet Vectorized (Enabled Nested Column)            2894           2986         130        144.9           6.9      28.5X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single INT Column Scan in Struct:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              82749          82774          34          5.1         197.3       1.0X
SQL ORC Vectorized (Enabled Nested Column)                4848           4869          30         86.5          11.6      17.1X
SQL Parquet Vectorized (Disabled Nested Column)          47718          47957         338          8.8         113.8       1.7X
SQL Parquet Vectorized (Enabled Nested Column)            3055           3056           2        137.3           7.3      27.1X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single BIGINT Column Scan in Struct:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              82398          82416          25          5.1         196.5       1.0X
SQL ORC Vectorized (Enabled Nested Column)                6562           7010         634         63.9          15.6      12.6X
SQL Parquet Vectorized (Disabled Nested Column)          51007          51032          35          8.2         121.6       1.6X
SQL Parquet Vectorized (Enabled Nested Column)            4300           4358          82         97.6          10.3      19.2X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single FLOAT Column Scan in Struct:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              85791          86323         753          4.9         204.5       1.0X
SQL ORC Vectorized (Enabled Nested Column)                7231           7246          21         58.0          17.2      11.9X
SQL Parquet Vectorized (Disabled Nested Column)          48381          48476         134          8.7         115.3       1.8X
SQL Parquet Vectorized (Enabled Nested Column)            2770           2791          29        151.4           6.6      31.0X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single DOUBLE Column Scan in Struct:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              85566          85598          45          4.9         204.0       1.0X
SQL ORC Vectorized (Enabled Nested Column)                8579           8591          17         48.9          20.5      10.0X
SQL Parquet Vectorized (Disabled Nested Column)          56052          56106          77          7.5         133.6       1.5X
SQL Parquet Vectorized (Enabled Nested Column)            4135           4185          70        101.4           9.9      20.7X

Does this PR introduce any user-facing change?

With the PR Spark should now support reading complex types in its vectorized Parquet reader. A new config spark.sql.parquet.enableNestedColumnVectorizedReader is introduced to turn the feature on or off.

How was this patch tested?

Added new unit tests.

SparkQA · 2021-08-10T22:12:44Z

Test build #142294 has finished for PR 33695 at commit 1d86a47.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class ColumnIOUtil
final class ParquetColumn
trait ParquetTypeInfo
case class ParquetPrimitiveTypeInfo(
case class ParquetGroupTypeInfo(

SparkQA · 2021-08-10T22:39:18Z

Test build #142297 has finished for PR 33695 at commit 4422153.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class ColumnIOUtil
final class ParquetColumn
trait ParquetTypeInfo
case class ParquetPrimitiveTypeInfo(
case class ParquetGroupTypeInfo(

SparkQA · 2021-08-10T22:54:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46801/

SparkQA · 2021-08-10T23:03:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46804/

SparkQA · 2021-08-10T23:39:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46804/

SparkQA · 2021-08-10T23:47:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46801/

SparkQA · 2021-08-10T23:56:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46805/

SparkQA · 2021-08-11T00:34:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46805/

SparkQA · 2021-08-11T07:29:36Z

Test build #142298 has finished for PR 33695 at commit 430b4b7.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class ColumnIOUtil
final class ParquetColumn
trait ParquetTypeInfo
case class ParquetPrimitiveTypeInfo(
case class ParquetGroupTypeInfo(

SparkQA · 2021-08-11T21:36:06Z

Test build #142351 has finished for PR 33695 at commit 3b2155d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-11T21:47:45Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46859/

SparkQA · 2021-08-11T23:48:40Z

Test build #142356 has finished for PR 33695 at commit ba6d332.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-12T00:21:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46864/

SparkQA · 2021-08-12T01:16:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46864/

SparkQA · 2021-08-12T06:22:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46873/

dbtsai · 2021-08-12T06:55:16Z

Thank you, @sunchao for great work. Do we have benchmark result?

SparkQA · 2021-08-12T07:01:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46873/

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2021-08-12T13:56:13Z

Test build #142365 has finished for PR 33695 at commit 9aec74a.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/java/org/apache/parquet/io/ColumnIOUtil.java

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.java

viirya · 2021-08-12T20:07:52Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java

"after the current batch." or "in the current batch."?

it is after the current batch, suppose we have the following repetition levels:

0, 1, 0, 1, 0, 1, 0, 1, 0, 1

and suppose batch size is 4, we'll need to see the 5th 0 (which is the first list of the next batch) before we know that we've completed 4 lists for the batch.

let me know if you have better way to write the comments, it could be confusing.

SparkQA · 2021-08-31T18:14:56Z

Test build #142890 has started for PR 33695 at commit 9aa1212.

SparkQA · 2021-08-31T18:53:34Z

Kubernetes integration test unable to build dist.

exiting with code: 141
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47393/

SparkQA · 2021-08-31T20:54:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47394/

SparkQA · 2021-08-31T21:50:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47394/

SparkQA · 2021-08-31T22:36:12Z

Test build #142891 has finished for PR 33695 at commit 2a74691.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2021-08-31T22:37:22Z

I think this is ready for review (although I plan to beef up the test coverage a bit more). cc @viirya @dongjoon-hyun @cloud-fan @vkorukanti @sadikovi @michal-databricks @c21

Let me know also if you prefer to break this into sub-PRs.

sunchao · 2021-08-31T22:40:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala

This is borrowed from #31958 by @c21

dongjoon-hyun · 2021-08-31T22:49:04Z

It's great! Thank you, @sunchao .

sadikovi

Thank you for opening a PR. Fantastic work! I left a few initial comments and will review it in more detail soon-ish.

sadikovi · 2021-09-02T19:09:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Should we mention that this is only enabled when vectorised reader is enabled?

Yes good point.

sadikovi · 2021-09-02T19:16:39Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.java

nit: Would it be possible to capitalise all of the javadoc comments in this file? I think it would look neater 😄 . Thanks!

sadikovi · 2021-09-02T19:18:39Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.java

Does the code handle situations when there are no repetition or definition levels for the type? Would it allocate vectors even when the column is null?

If the column is null (i.e., a missing column), the code will exit at line 74 above, so it won't allocate vectors for these. I'm not quite sure what do you mean by "when there are no repetition or definition levels for the type", can you elaborate?

I might have misunderstood the code, my question was about what happens when you read a primitive type that is non-null and non-repeated. Would the code still allocate the level vectors even if they are not necessarily needed? Again, it probably does not matter that much but maybe it is something we can clarify with a comment.

@sadikovi ah I see what you mean. Good point. Let me see if I can remove these in those cases.

Actually it's a bit tricky to remove these, although possible. Let me mark a TODO here and we can tackle it separately.

Yes, I also thought it might be tricky to fix. It is fine to keep as is, I guess, especially if performance is good anyway.

sadikovi · 2021-09-02T19:20:06Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.java

Interesting, isPrimitive == children.isEmpty() would this hold?

I think it doesn't hold for empty structs.

Ah, good point!

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.java

sadikovi · 2021-09-02T19:35:26Z

.../src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1SchemaPruningSuite.scala

nit: Let's remove new line.

sunchao · 2021-09-10T00:15:30Z

Thanks @sadikovi for taking a look! really appreciate it.

SparkQA · 2021-09-10T01:28:49Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47637/

SparkQA · 2021-09-10T02:35:39Z

Test build #143133 has finished for PR 33695 at commit 90f531b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-17T00:42:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47880/

SparkQA · 2021-09-17T00:51:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47880/

SparkQA · 2021-09-17T01:57:28Z

Test build #143373 has finished for PR 33695 at commit 5f9f186.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2021-09-28T16:57:19Z

I'm going to break this into smaller PRs for easier review. Closing it now.

agrawaldevesh · 2021-10-28T17:21:46Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java


  private int readPage() {
    DataPage page = pageReader.readPage();
+    if (page == null) {


Just curious when/why does this case now happen ? Can we run past a column in the middle ?

Oops sorry @agrawaldevesh just saw your comment.

This could happen when we are reading a repeated column (e.g., an int list). In this case, we don't know whether the list is completed or not until we have no more page to read.

I think the newer version of parquet-mr (since 0.11 I think) no longer allow a repeated list to span across multiple pages, but it is allowed in older versions.

It's always great to put these nice details in code comments, as there are so many small tricks people don't know

Good point. I'll add some comments in the new PR.

sunchao · 2021-11-19T02:32:16Z

I've opened a new PR #34659 replacing this one. Will appreciate your review there!

sunchao force-pushed the SPARK-34861-nested branch from 1bf6e10 to 1d86a47 Compare August 10, 2021 22:00

sunchao force-pushed the SPARK-34861-nested branch from 1d86a47 to 4422153 Compare August 10, 2021 22:16

github-actions bot added BUILD SQL labels Aug 10, 2021

sunchao changed the title ~~[SPARK-34861][SQL] Support complex types for Parquet vectorized reader~~ [SPARK-34863][SQL] Support complex types for Parquet vectorized reader Aug 10, 2021

sunchao force-pushed the SPARK-34861-nested branch from 4422153 to 430b4b7 Compare August 10, 2021 22:41

sunchao force-pushed the SPARK-34861-nested branch from 3b2155d to ba6d332 Compare August 11, 2021 23:28

sunchao force-pushed the SPARK-34861-nested branch from ba6d332 to 9aec74a Compare August 12, 2021 05:32

viirya reviewed Aug 12, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 12, 2021

View reviewed changes

sql/core/src/main/java/org/apache/parquet/io/ColumnIOUtil.java Outdated Show resolved Hide resolved

viirya reviewed Aug 12, 2021

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.java Outdated Show resolved Hide resolved

viirya reviewed Aug 12, 2021

View reviewed changes

sunchao force-pushed the SPARK-34861-nested branch from 10fc357 to 9aa1212 Compare August 31, 2021 18:12

sunchao force-pushed the SPARK-34861-nested branch from 9aa1212 to 2a74691 Compare August 31, 2021 20:05

sunchao marked this pull request as ready for review August 31, 2021 22:33

sunchao commented Aug 31, 2021

View reviewed changes

sadikovi reviewed Sep 2, 2021

View reviewed changes

sunchao added 5 commits September 9, 2021 17:15

initial commit

ec9d9d9

address comments

55726b6

fix issues after turning config on by default

7081004

fix a bug when reading struct

bf06d8d

address comments

90f531b

sunchao force-pushed the SPARK-34861-nested branch from 2a74691 to 90f531b Compare September 10, 2021 00:15

add comment for TODO

5f9f186

sunchao closed this Sep 28, 2021

agrawaldevesh reviewed Oct 28, 2021

View reviewed changes

hudi-bot mentioned this pull request Nov 30, 2025

Hudi-Spark Support Complex Data Types for Parquet Vectorized Reader apache/hudi#15976

Open

[SPARK-34863][SQL] Support complex types for Parquet vectorized reader #33695

[SPARK-34863][SQL] Support complex types for Parquet vectorized reader #33695

Uh oh!

Conversation

sunchao commented Aug 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 12, 2021

Uh oh!

SparkQA commented Aug 12, 2021

Uh oh!

SparkQA commented Aug 12, 2021

Uh oh!

dbtsai commented Aug 12, 2021

Uh oh!

SparkQA commented Aug 12, 2021

Uh oh!

Uh oh!

SparkQA commented Aug 12, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 31, 2021

Uh oh!

SparkQA commented Aug 31, 2021

Uh oh!

SparkQA commented Aug 31, 2021

Uh oh!

SparkQA commented Aug 31, 2021

Uh oh!

SparkQA commented Aug 31, 2021

Uh oh!

sunchao commented Aug 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 31, 2021

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao commented Aug 10, 2021 •

edited

Loading

sunchao commented Aug 31, 2021 •

edited

Loading