Skip to content

Comments

[SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor, plus related optimization in ParquetMapConverter#27088

Closed
JoshRosen wants to merge 6 commits intoapache:masterfrom
JoshRosen:joshrosen/GenericArrayData-optimization
Closed

[SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor, plus related optimization in ParquetMapConverter#27088
JoshRosen wants to merge 6 commits intoapache:masterfrom
JoshRosen:joshrosen/GenericArrayData-optimization

Conversation

@JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Jan 3, 2020

What changes were proposed in this pull request?

This PR implements a tiny performance optimization for a GenericArrayData constructor, avoiding an unnecessary roundtrip through WrappedArray when the provided value is already an array of objects.

It also fixes a related performance problem in ParquetRowConverter.

Why are the changes needed?

GenericArrayData has a this(seqOrArray: Any) constructor, which was originally added in #13138 for use in RowEncoder (where we may not know concrete types until runtime) but is also called (perhaps unintentionally) in a few other code paths.

In this constructor's existing implementation, a call to new WrappedArray(Array[Object]("")) is dispatched to the this(seqOrArray: Any) constructor, where we then call this(array.toSeq): this wraps the provided array into a WrappedArray, which is subsequently unwrapped in a this(seq.toArray) call. For an interactive example, see https://scastie.scala-lang.org/7jOHydbNTaGSU677FWA8nA

This PR changes the this(seqOrArray: Any) constructor so that it calls the primary this(array: Array[Any]) constructor, allowing us to save a .toSeq.toArray call; this comes at the cost of one additional case in the match statement (but I believe this has a negligible performance impact relative to the other savings).

As code cleanup, I also reverted the JVM 1.7 workaround from #14271.

I also fixed a related performance problem in ParquetRowConverter: previously, this code called ArrayBasedMapData.apply which, in turn, called the this(Any) constructor for GenericArrayData: this PR's micro-benchmarks show that this is significantly slower than calling the this(Array[Any]) constructor (and I also observed time spent here during other Parquet scan benchmarking work). To fix this performance problem, I replaced the call to the ArrayBasedMapData.apply method with direct calls to the ArrayBasedMapData and GenericArrayData constructors.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I tested this by running code in a debugger and by running microbenchmarks (which I've added to a new GenericArrayDataBenchmark in this PR):

  • With JDK8 benchmarks: this PR's changes more than double the performance of calls to the this(Any) constructor. Even after improvements, however, calls to the this(Array[Any]) constructor are still ~60x faster than calls to this(Any) when passing a non-primitive array (thereby motivating this patch's other change in ParquetRowConverter).
  • With JDK11 benchmarks: the changes more-or-less completely eliminate the performance penalty associated with the this(Any) constructor.

@JoshRosen JoshRosen added the SQL label Jan 3, 2020
@SparkQA
Copy link

SparkQA commented Jan 3, 2020

Test build #116082 has finished for PR 27088 at commit c25bd4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu
Copy link
Member

maropu commented Jan 15, 2020

I tested this by running code in a debugger and by running toy microbenchmarks.

Can you put performance numbers in the PR description if possible?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 17, 2020

Thank you for pinging me, @maropu .

@dongjoon-hyun
Copy link
Member

Hi, @JoshRosen . Could you address @maropu 's comment?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (except @maropu 's comment.)
If it's difficult, we can remove the performance related stuff from the PR description.

cc @dbtsai

@JoshRosen
Copy link
Contributor Author

JoshRosen commented Jan 17, 2020 via email

@dongjoon-hyun
Copy link
Member

Thanks, @JoshRosen .

@JoshRosen
Copy link
Contributor Author

JoshRosen commented Jan 19, 2020

@kiszk, @maropu, @dongjoon-hyun, thank you all for reviewing.

I've pushed a commit adding benchmarks and another to fix a pre-existing performance problem in ParquetRowConverter's map converter (motivated by the benchmarks in this PR plus some ongoing benchmarking work for Parquet scans). I updated the PR description to reflect those changes.

arrayOfAnyAsSeq 25 27 2 398.0 2.5 0.3X
arrayOfInt 613 630 15 16.3 61.3 0.0X
arrayOfIntAsObject 866 872 8 11.5 86.6 0.0X

Copy link
Contributor Author

@JoshRosen JoshRosen Jan 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the JDK8 benchmark results before this patch's changes:

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i5-8210Y CPU @ 1.60GHz
[info] constructor:                              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] arrayOfAny                                            8             12           4       1281.5           0.8       1.0X
[info] arrayOfAnyAsObject                                  531            605          76         18.8          53.1       0.0X
[info] arrayOfAnyAsSeq                                      26             35          20        384.4           2.6       0.3X
[info] arrayOfInt                                          703            779         117         14.2          70.3       0.0X
[info] arrayOfIntAsObject                                 1336           1397          86          7.5         133.6       0.0X

@JoshRosen JoshRosen changed the title [SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor [SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor, plus related optimization in ParquetMapConverter Jan 19, 2020
@SparkQA
Copy link

SparkQA commented Jan 19, 2020

Test build #116990 has finished for PR 27088 at commit ad93f68.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

jenkins retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @JoshRosen . The new result looks impressive.
(If you don't mind, you can run the benchmark on JDK11 in order to generate another result file for JDK11. It's okay to skip in this PR.)

arrayOfAnyAsSeq 5 6 2 2195.5 0.5 1.2X
arrayOfInt 452 469 13 22.1 45.2 0.0X
arrayOfIntAsObject 678 690 11 14.7 67.8 0.0X

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the JDK11 benchmark results before this patch's changes:

[info] OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i5-8210Y CPU @ 1.60GHz
[info] constructor:                              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] arrayOfAny                                            6              6           1       1776.5           0.6       1.0X
[info] arrayOfAnyAsObject                                  374            386           9         26.7          37.4       0.0X
[info] arrayOfAnyAsSeq                                       5              5           0       2211.1           0.5       1.2X
[info] arrayOfInt                                          472            494          29         21.2          47.2       0.0X
[info] arrayOfIntAsObject                                  798            799           1         12.5          79.8       0.0X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pleasant discovery is that this patch has an even larger positive impact on JDK11: on JDK11, this PR's changes seem to more-or-less completely eliminate the performance gap between the this(Any) and this(Array[Any]) constructors 🎉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the result looks nice.

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117019 has finished for PR 27088 at commit ad93f68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master. (The last one is only adding the JDK11 generated result.)
Thank you again!

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117032 has finished for PR 27088 at commit 512eb53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants