-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32511][SQL] Add dropFields method to Column class #29795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32511][SQL] Add dropFields method to Column class #29795
Conversation
| * [[CreateNamedStruct]] respectively inside of [[UpdateFields]]. | ||
| */ | ||
| def apply(values: Seq[(StructField, Expression)]): Seq[(StructField, Expression)] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I'm not quite done with this PR yet but I wanted to share it with you early because some of the changes I'm making in here may be helpful for #29587 (assuming this PR is accepted). Specifically, it would be possible to implement sorting of fields in a struct simply by:
case class OrderStructFieldsByName() extends StructFieldsOperation {
override def apply(values: Seq[(StructField, Expression)]): Seq[(StructField, Expression)] =
values.sortBy { case (field, _) => field.name }
}
UpdateFields(structExpr, OrderStructFieldsByName() :: Nil)
|
Test build #128840 has finished for PR 29795 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala
Outdated
Show resolved
Hide resolved
| WithField("a2", UpdateFields(GetStructField('a1, 0), WithField("b3", 2) :: Nil)), | ||
| WithField("a2", UpdateFields(GetStructField('a1, 0), WithField("b3", 2) :: WithField("c3", 3) :: Nil)) | ||
| // scalastyle:on line.size.limit | ||
| )).as("a1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first WithField in here is entirely redundant and ideally we would optimize this away as well.
However, in the interests of keeping this PR simple, I have opted to forgo writing any such optimizer rule.
If necessary, we can address this in a future PR.
| UpdateFields('a1, Seq( | ||
| WithField("a2", UpdateFields(GetStructField('a1, 0), Seq(DropField("b3")))), | ||
| WithField("a2", UpdateFields(GetStructField('a1, 0), Seq(DropField("b3"), DropField("c3")))) | ||
| )).as("a1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first WithField in here is entirely redundant as well and ideally we would optimize this away as well.
However, in the interests of keeping this PR simple, I have opted to forgo writing any such optimizer rule.
If necessary, we can address this in a future PR.
| * df.select($"struct_col".withField("a", $"struct_col.a".withField("c", lit(3)).withField("d", lit(4)))) | ||
| * // result: {"a":{"a":1,"b":2,"c":3,"d":4}} | ||
| * }}} | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the issues in master branch with the current Column.withField implementation is the size of the parsed logical plan scales non-linearly with the number of directly-add-nested-column operations. This results in the driver spending a considerable amount of time analyzing and optimizing the logical plan (literally minutes, if it ever completes).
Users can avoid this issue entirely by writing their queries in a performant manner.
For example:
lazy val nullableStructLevel2: DataFrame = spark.createDataFrame(
sparkContext.parallelize(Row(Row(Row(0))) :: Nil),
StructType(Seq(
StructField("a1", StructType(Seq(
StructField("a2", StructType(Seq(
StructField("col0", IntegerType, nullable = false))),
nullable = true))),
nullable = true))))
val numColsToAdd = 100
val expectedRows = Row(Row(Row(0 to numColsToAdd: _*))) :: Nil
val expectedSchema =
StructType(Seq(
StructField("a1", StructType(Seq(
StructField("a2", StructType((0 to numColsToAdd).map(num =>
StructField(s"col$num", IntegerType, nullable = false))),
nullable = true))),
nullable = true)))
test("good way of writing query") {
// Spark can easily analyze and optimize the parsed logical plan in seconds
checkAnswer(
nullableStructLevel2
.select(col("a1").withField("a2", (1 to numColsToAdd).foldLeft(col("a1.a2")) {
(column, num) => column.withField(s"col$num", lit(num))
}).as("a1")),
expectedRows,
expectedSchema)
}
test("bad way of writing the same query that will eventually fail with timeout exception with as little as numColsToAdd = 10") {
checkAnswer(
nullableStructLevel2
.select((1 to numColsToAdd).foldLeft(col("a1")) {
(column, num) => column.withField(s"a2.col$num", lit(num))
}.as("a1")),
expectedRows,
expectedSchema)
}
This issue and its solution is what I've attempted to capture here as part of the method doc.
There are other options here instead of method-doc-note:
- We could potentially write some kind of optimization in
updateFieldsHelper(I've bashed my head against this for a while but haven't been able to come up with anything satisfactory). - Remove the ability to change nested fields directly entirely. While this has the advantage that there will be absolutely no way to run into this "performance" issue, the user-experience definitely suffers for more advanced users who would know how to use these methods properly.
I've gone with what made most sense to me (method-doc-note) but am open to hearing other people's thoughts on the matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same issue happens in withColumn as well. I'm fine with method doc.
| StructField("d", IntegerType, nullable = false), | ||
| StructField("e", IntegerType, nullable = false))), | ||
| nullable = false)))) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect anyone will be surprised or feel that this is wrong but nevertheless, I did want to highlight this behaviour. Same goes for the two tests below.
|
Test build #128923 has finished for PR 29795 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala
Outdated
Show resolved
Hide resolved
|
Test build #128995 has finished for PR 29795 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsPerformanceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsPerformanceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsPerformanceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsPerformanceSuite.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| * represents the next depth. | ||
| * @param nullable: This value is used to set the nullability of StructType columns. | ||
| */ | ||
| def nestedDf(maxDepth: Int, numColsAtEachDepth: Int, nullable: Boolean): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the nullable only controls the top-level column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the nullable parameter affects the top-level and nested StructType columns.
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsBenchmark.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UpdateFieldsBenchmark.scala
Outdated
Show resolved
Hide resolved
|
Test build #129098 has finished for PR 29795 at commit
|
| To non-nullable StructTypes using performant method 4595 4927 470 0.0 Infinity 1.0X | ||
| To nullable StructTypes using performant method 5185 5516 468 0.0 Infinity 0.9X | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the benchmark up a little bit so that we can compare the performant and non-performant methods of updating multiple nested columns.
| To non-nullable StructTypes using performant method 10 11 2 0.0 Infinity 1.0X | ||
| To nullable StructTypes using performant method 9 10 1 0.0 Infinity 1.0X | ||
| To non-nullable StructTypes using non-performant method 2457 2464 10 0.0 Infinity 0.0X | ||
| To nullable StructTypes using non-performant method 42641 43804 1644 0.0 Infinity 0.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As expected, this last result isn't great (43 seconds).
It's partially because of the non-performant method and partially because the optimizer rules aren't able to perfectly optimize complex nullable StructType scenarios (I've documented these scenarios in this commit).
It should be possible to improve the optimizer rules further in the future. I have a couple of simple ideas I'm toying around with but it will take me a while to reason/test if they are safe from a correctness point of view.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129156 has finished for PR 29795 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129251 has finished for PR 29795 at commit
|
|
thanks, merging to master! |
… Maven ### What changes were proposed in this pull request? This PR fixes the broken build for Scala 2.13 with Maven. https://github.com/apache/spark/pull/29913/checks?check_run_id=1187826966 #29795 was merged though it doesn't successfully finish the build for Scala 2.13 ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -DskipTests package` Closes #29954 from sarutak/hotfix-seq. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
WithFieldsExpression to make it more extensible (nowUpdateFields).dropFieldsmethod to theColumnclass. This method should allow users to drop aStructFieldin aStructTypecolumn (with similar semantics to thedropmethod onDataset).Why are the changes needed?
Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing
StructField. To do this with the existing Spark APIs, users have to rebuild the entire struct column.For example, let's say you have the following deeply nested data structure which has a data quality issue (
5is missing):Currently, to drop the missing value users would have to do something like this:
As you can see above, with the existing methods users must call the
structfunction and list all fields, including fields they don't want to change. This is not ideal as:In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
This is the second of maybe 3 methods that could be added to the
Columnclass to make it easier to manipulate nested data.Other methods under discussion in SPARK-22231 include
withFieldRenamed.However, this should be added in a separate PR.
Does this PR introduce any user-facing change?
The documentation for
Column.withFieldmethod has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly.How was this patch tested?
New unit tests were added. Jenkins must pass them.
Related JIRAs:
More discussion on this topic can be found here: