Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Aug 10, 2021

What changes were proposed in this pull request?

This PR adds support for complex types (e.g., list, map, array) for Spark's vectorized Parquet reader. In particular, this introduces the following changes:

  1. Added a new class ParquetType which binds a Spark type with its corresponding Parquet definition & repetition level. This is used when Spark assembles a vector of complex type for Parquet.
  2. Changed ParquetSchemaConverter and added a new method convertTypeInfo which converts a Parquet MessageType to a ParquetType above. The existing conversion logic in the class remains the same but now operates with ParquetType instead of DataType, and annotate the former with extra information such as definition & repetition level, column path, column descriptor, etc.
  3. Added a new class ParquetColumn which encapsulates all the necessary information needed when reading a Parquet column, including the ParquetType for the column, the repetition & definition levels (only allocated for a leaf-node of a complex type), as well as the reader for the column. In addition, it also contains logic for assembling nested columnar batches, via interpreting Parquet repetition & definition levels.
  4. Changes are made in VectorizedParquetRecordReader to initialize a list of ParquetColumn for the columns read.
  5. VectorizedColumnReader now also creates a reader for repetition column. Depending on whether maximum repetition level is 0, the batch read is now split into two code paths, e.g., readBatch versus readBatchNested.
  6. Added logic to handle complex type in VectorizedRleValuesReader. For data types involving only struct or primitive types, it still goes with the old readBatch method which now also saves definition levels into a vector for later assembly. Otherwise, for data types involving array or map, a separate code path readBatchNested is introduced to handle repetition levels.
  7. Added a new config spark.sql.parquet.enableNestedColumnVectorizedReader to turn on or turn off the feature. By default it is true.
  8. Modified WritableColumnVector to better support null structs. Currently it requires populating null entries to all child vectors when there is a null struct, however this will waste space and also doesn't work well with Parquet scan. This adds an extra field structOffsets which records the mapping from a row ID to the position of the row in the child vector, so that child vectors will only need to store real null elements.

To test this, the PR introduced an interface ParquetRowGroupReader in SpecificParquetRecordReaderBase to abstract the Parquet file reading logic. The bulk of the tests are in ParquetVectorizedSuite which covers different batch size & page size, column index, first row index, nulls, etc.

The DataSourceReadBenchmark is extended with two more cases: reading struct fields of primitive types and reading array of struct & map field.

Why are the changes needed?

Whenever read schema containing complex types, at the moment Spark will fallback to the row-based reader in parquet-mr, which is much slower. As benchmark shows, by adding support into the vectorized reader, we can get ~15x on average speed up on reading struct fields, and ~1.5x when reading array of struct and map.

Micro benchmark of reading primitive fields from a struct, over 400m rows:

================================================================================================
SQL Single Numeric Column Scan in Struct
================================================================================================

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single TINYINT Column Scan in Struct:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              77684          78174         692          5.4         185.2       1.0X
SQL ORC Vectorized (Enabled Nested Column)                4137           4226         126        101.4           9.9      18.8X
SQL Parquet Vectorized (Disabled Nested Column)          42095          42193         138         10.0         100.4       1.8X
SQL Parquet Vectorized (Enabled Nested Column)            3317           4147        1174        126.4           7.9      23.4X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single SMALLINT Column Scan in Struct:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              82438          82443           7          5.1         196.5       1.0X
SQL ORC Vectorized (Enabled Nested Column)                4746           5022         391         88.4          11.3      17.4X
SQL Parquet Vectorized (Disabled Nested Column)          43689          43761         102          9.6         104.2       1.9X
SQL Parquet Vectorized (Enabled Nested Column)            2894           2986         130        144.9           6.9      28.5X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single INT Column Scan in Struct:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              82749          82774          34          5.1         197.3       1.0X
SQL ORC Vectorized (Enabled Nested Column)                4848           4869          30         86.5          11.6      17.1X
SQL Parquet Vectorized (Disabled Nested Column)          47718          47957         338          8.8         113.8       1.7X
SQL Parquet Vectorized (Enabled Nested Column)            3055           3056           2        137.3           7.3      27.1X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single BIGINT Column Scan in Struct:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              82398          82416          25          5.1         196.5       1.0X
SQL ORC Vectorized (Enabled Nested Column)                6562           7010         634         63.9          15.6      12.6X
SQL Parquet Vectorized (Disabled Nested Column)          51007          51032          35          8.2         121.6       1.6X
SQL Parquet Vectorized (Enabled Nested Column)            4300           4358          82         97.6          10.3      19.2X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single FLOAT Column Scan in Struct:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              85791          86323         753          4.9         204.5       1.0X
SQL ORC Vectorized (Enabled Nested Column)                7231           7246          21         58.0          17.2      11.9X
SQL Parquet Vectorized (Disabled Nested Column)          48381          48476         134          8.7         115.3       1.8X
SQL Parquet Vectorized (Enabled Nested Column)            2770           2791          29        151.4           6.6      31.0X

OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16
Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
SQL Single DOUBLE Column Scan in Struct:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL ORC Vectorized (Disabled Nested Column)              85566          85598          45          4.9         204.0       1.0X
SQL ORC Vectorized (Enabled Nested Column)                8579           8591          17         48.9          20.5      10.0X
SQL Parquet Vectorized (Disabled Nested Column)          56052          56106          77          7.5         133.6       1.5X
SQL Parquet Vectorized (Enabled Nested Column)            4135           4185          70        101.4           9.9      20.7X

Does this PR introduce any user-facing change?

With the PR Spark should now support reading complex types in its vectorized Parquet reader. A new config spark.sql.parquet.enableNestedColumnVectorizedReader is introduced to turn the feature on or off.

How was this patch tested?

Added new unit tests.

@sunchao sunchao force-pushed the SPARK-34861-nested branch from 1bf6e10 to 1d86a47 Compare August 10, 2021 22:00
@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Test build #142294 has finished for PR 33695 at commit 1d86a47.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class ColumnIOUtil
  • final class ParquetColumn
  • trait ParquetTypeInfo
  • case class ParquetPrimitiveTypeInfo(
  • case class ParquetGroupTypeInfo(

@sunchao sunchao force-pushed the SPARK-34861-nested branch from 1d86a47 to 4422153 Compare August 10, 2021 22:16
@sunchao sunchao changed the title [SPARK-34861][SQL] Support complex types for Parquet vectorized reader [SPARK-34863][SQL] Support complex types for Parquet vectorized reader Aug 10, 2021
@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Test build #142297 has finished for PR 33695 at commit 4422153.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class ColumnIOUtil
  • final class ParquetColumn
  • trait ParquetTypeInfo
  • case class ParquetPrimitiveTypeInfo(
  • case class ParquetGroupTypeInfo(

@sunchao sunchao force-pushed the SPARK-34861-nested branch from 4422153 to 430b4b7 Compare August 10, 2021 22:41
@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46801/

@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46804/

@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46804/

@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46801/

@SparkQA
Copy link

SparkQA commented Aug 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46805/

@SparkQA
Copy link

SparkQA commented Aug 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46805/

@SparkQA
Copy link

SparkQA commented Aug 11, 2021

Test build #142298 has finished for PR 33695 at commit 430b4b7.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class ColumnIOUtil
  • final class ParquetColumn
  • trait ParquetTypeInfo
  • case class ParquetPrimitiveTypeInfo(
  • case class ParquetGroupTypeInfo(

@SparkQA
Copy link

SparkQA commented Aug 11, 2021

Test build #142351 has finished for PR 33695 at commit 3b2155d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46859/

@sunchao sunchao force-pushed the SPARK-34861-nested branch from 3b2155d to ba6d332 Compare August 11, 2021 23:28
@SparkQA
Copy link

SparkQA commented Aug 11, 2021

Test build #142356 has finished for PR 33695 at commit ba6d332.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46864/

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46864/

@sunchao sunchao force-pushed the SPARK-34861-nested branch from ba6d332 to 9aec74a Compare August 12, 2021 05:32
@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46873/

@dbtsai
Copy link
Member

dbtsai commented Aug 12, 2021

Thank you, @sunchao for great work. Do we have benchmark result?

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46873/

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Test build #142365 has finished for PR 33695 at commit 9aec74a.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines +71 to +75
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"after the current batch." or "in the current batch."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is after the current batch, suppose we have the following repetition levels:

0, 1, 0, 1, 0, 1, 0, 1, 0, 1

and suppose batch size is 4, we'll need to see the 5th 0 (which is the first list of the next batch) before we know that we've completed 4 lists for the batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if you have better way to write the comments, it could be confusing.

@sunchao sunchao force-pushed the SPARK-34861-nested branch from 10fc357 to 9aa1212 Compare August 31, 2021 18:12
@SparkQA
Copy link

SparkQA commented Aug 31, 2021

Test build #142890 has started for PR 33695 at commit 9aa1212.

@SparkQA
Copy link

SparkQA commented Aug 31, 2021

Kubernetes integration test unable to build dist.

exiting with code: 141
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47393/

@sunchao sunchao force-pushed the SPARK-34861-nested branch from 9aa1212 to 2a74691 Compare August 31, 2021 20:05
@SparkQA
Copy link

SparkQA commented Aug 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47394/

@SparkQA
Copy link

SparkQA commented Aug 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47394/

@sunchao sunchao marked this pull request as ready for review August 31, 2021 22:33
@SparkQA
Copy link

SparkQA commented Aug 31, 2021

Test build #142891 has finished for PR 33695 at commit 2a74691.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Aug 31, 2021

I think this is ready for review (although I plan to beef up the test coverage a bit more). cc @viirya @dongjoon-hyun @cloud-fan @vkorukanti @sadikovi @michal-databricks @c21

Let me know also if you prefer to break this into sub-PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is borrowed from #31958 by @c21

@dongjoon-hyun
Copy link
Member

It's great! Thank you, @sunchao .

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening a PR. Fantastic work! I left a few initial comments and will review it in more detail soon-ish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that this is only enabled when vectorised reader is enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it be possible to capitalise all of the javadoc comments in this file? I think it would look neater 😄 . Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code handle situations when there are no repetition or definition levels for the type? Would it allocate vectors even when the column is null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column is null (i.e., a missing column), the code will exit at line 74 above, so it won't allocate vectors for these. I'm not quite sure what do you mean by "when there are no repetition or definition levels for the type", can you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have misunderstood the code, my question was about what happens when you read a primitive type that is non-null and non-repeated. Would the code still allocate the level vectors even if they are not necessarily needed? Again, it probably does not matter that much but maybe it is something we can clarify with a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadikovi ah I see what you mean. Good point. Let me see if I can remove these in those cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's a bit tricky to remove these, although possible. Let me mark a TODO here and we can tackle it separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought it might be tricky to fix. It is fine to keep as is, I guess, especially if performance is good anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, isPrimitive == children.isEmpty() would this hold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't hold for empty structs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's remove new line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sunchao
Copy link
Member Author

sunchao commented Sep 10, 2021

Thanks @sadikovi for taking a look! really appreciate it.

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47637/

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Test build #143133 has finished for PR 33695 at commit 90f531b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47880/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47880/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Test build #143373 has finished for PR 33695 at commit 5f9f186.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Sep 28, 2021

I'm going to break this into smaller PRs for easier review. Closing it now.

@sunchao sunchao closed this Sep 28, 2021

private int readPage() {
DataPage page = pageReader.readPage();
if (page == null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious when/why does this case now happen ? Can we run past a column in the middle ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry @agrawaldevesh just saw your comment.

This could happen when we are reading a repeated column (e.g., an int list). In this case, we don't know whether the list is completed or not until we have no more page to read.

I think the newer version of parquet-mr (since 0.11 I think) no longer allow a repeated list to span across multiple pages, but it is allowed in older versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always great to put these nice details in code comments, as there are so many small tricks people don't know

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll add some comments in the new PR.

@sunchao
Copy link
Member Author

sunchao commented Nov 19, 2021

I've opened a new PR #34659 replacing this one. Will appreciate your review there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants