-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter #26993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
d831414
ed894b4
2ed8ea9
3fb3391
e67327a
fffe72b
e6945e8
4651b2f
0f1af94
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -318,10 +318,32 @@ private[parquet] class ParquetRowConverter( | |
| new ParquetMapConverter(parquetType.asGroupType(), t, updater) | ||
|
|
||
| case t: StructType => | ||
| val wrappedUpdater = { | ||
| if (updater.isInstanceOf[RowUpdater]) { | ||
| // `updater` is a RowUpdater, implying that the parent container is a struct. | ||
| // We do NOT need to perform defensive copying here because either: | ||
| // | ||
| // 1. The path from the schema root to this field consists only of nested | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we have deeply nested struct inside an array, is it the first case here? I think it is fine because at the element converter the top level struct inside an array element will do the defensive copying. So in nested struct converter, we will see RowUpdater from parent struct so don't need defensive copying too. Just maybe good to also update it in the doc.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's right. After thinking about this some more, I think I've come up with a clearer explanation and have updated the code comment: 4651b2f |
||
| // structs, so this converter will only be invoked once per record and | ||
| // we don't need to copy because copying will be done in the final | ||
| // UnsafeProjection, or | ||
| // 2. The path from the schema root to this field contains a map or array, | ||
| // in which case we will perform a recursive defensive copy via the | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correctness relies on the copy actually being a deep copy. Looking elsewhere in this file, we have comments like which suggest that certain copying might be shallow, so it's important to double-check and make sure that the copies are indeed deep. Here, the state being copied is an
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the existing comment about
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update: in #27089 I'm removing these other unnecessary ArrayBuffer copies. |
||
| // `else` branch below. | ||
| updater | ||
| } else { | ||
| // `updater` is NOT a RowUpdater, implying that the parent container is not a struct. | ||
| // Therefore, the parent container must be a map or array. We need to copy the row | ||
| // because this converter might be invoked multiple times per Parquet input record. | ||
| new ParentContainerUpdater { | ||
| override def set(value: Any): Unit = { | ||
| updater.set(value.asInstanceOf[SpecificInternalRow].copy()) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| new ParquetRowConverter( | ||
| schemaConverter, parquetType.asGroupType(), t, convertTz, new ParentContainerUpdater { | ||
| override def set(value: Any): Unit = updater.set(value.asInstanceOf[InternalRow].copy()) | ||
| }) | ||
| schemaConverter, parquetType.asGroupType(), t, convertTz, wrappedUpdater) | ||
|
|
||
| case t => | ||
| throw new RuntimeException( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -204,6 +204,23 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession | |
| } | ||
| } | ||
|
|
||
| testStandardAndLegacyModes("array of struct") { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have a test for array of struct of struct?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a new test case for this in 0f1af94 |
||
| val data = (1 to 4).map { i => | ||
| Tuple1( | ||
| Seq( | ||
| Tuple1(s"1st_val_$i"), | ||
| Tuple1(s"2nd_val_$i") | ||
| ) | ||
| ) | ||
| } | ||
| withParquetDataFrame(data) { df => | ||
| // Structs are converted to `Row`s | ||
| checkAnswer(df, data.map { case Tuple1(array) => | ||
| Row(array.map(struct => Row(struct.productIterator.toSeq: _*))) | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| testStandardAndLegacyModes("nested struct with array of array as field") { | ||
| val data = (1 to 4).map(i => Tuple1((i, Seq(Seq(s"val_$i"))))) | ||
| withParquetDataFrame(data) { df => | ||
|
|
@@ -214,9 +231,34 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession | |
| } | ||
| } | ||
|
|
||
| testStandardAndLegacyModes("nested map with struct as key type") { | ||
| val data = (1 to 4).map { i => | ||
| Tuple1( | ||
| Map( | ||
| (i, s"kA_$i") -> s"vA_$i", | ||
| (i, s"kB_$i") -> s"vB_$i" | ||
| ) | ||
| ) | ||
| } | ||
| withParquetDataFrame(data) { df => | ||
| // Structs are converted to `Row`s | ||
| checkAnswer(df, data.map { case Tuple1(m) => | ||
| Row(m.map { case (k, v) => Row(k.productIterator.toSeq: _*) -> v }) | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| testStandardAndLegacyModes("nested map with struct as value type") { | ||
| val data = (1 to 4).map(i => Tuple1(Map(i -> ((i, s"val_$i"))))) | ||
| val data = (1 to 4).map { i => | ||
| Tuple1( | ||
| Map( | ||
| s"kA_$i" -> ((i, s"vA_$i")), | ||
| s"kB_$i" -> ((i, s"vB_$i")) | ||
| ) | ||
| ) | ||
| } | ||
| withParquetDataFrame(data) { df => | ||
| // Structs are converted to `Row`s | ||
| checkAnswer(df, data.map { case Tuple1(m) => | ||
| Row(m.mapValues(struct => Row(struct.productIterator.toSeq: _*))) | ||
| }) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen, no big deal at all but how about we put the JIRA ID somewhere in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea: I added a JIRA reference in e6945e8