-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31317][SQL] Add withField method to Column #27066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this since it's already able to work around (e.g., withColumn and struct). Does any of DBMSes have such function as a reference?
|
I would be supportive of this. |
|
ok to test |
|
I didn't notice there are investigations and discussion already made in the JIRA, which I wanted to happen. I am good with it. |
|
Test build #116026 has finished for PR 27066 at commit
|
|
Can we take one step back and come up with the list of functions we want to add, and make sure they are coherent, and then get to PR reviews? |
6ac28c9 to
b768da7
Compare
|
Test build #120564 has finished for PR 27066 at commit
|
|
Test build #120565 has finished for PR 27066 at commit
|
|
retest this please |
|
Test build #120615 has finished for PR 27066 at commit
|
|
Test build #120617 has finished for PR 27066 at commit
|
|
Test build #120640 has finished for PR 27066 at commit
|
|
Updated the code in this PR based on some learnings I had while developing an external library with feature, including:
All tests are passing now. Please feel free to leave a review/comments. |
|
Sorry for asking this, but can you break this into multiple pull requests? cc @cloud-fan for review. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
|
Can you show the plan of this implementation by |
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
Outdated
Show resolved
Hide resolved
|
@rxin no problems, I've removed the Python changes from this PR. |
|
@dbtsai here is an example explain output with the current implementation: |
|
Test build #120652 has finished for PR 27066 at commit
|
|
Test build #120654 has finished for PR 27066 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Outdated
Show resolved
Hide resolved
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala
Outdated
Show resolved
Hide resolved
|
Test build #124957 has finished for PR 27066 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #125072 has finished for PR 27066 at commit
|
|
Test build #5051 has started for PR 27066 at commit |
|
Not sure why but it looks like someone aborted the build mid-way. |
|
retest this please |
|
Test build #125125 has finished for PR 27066 at commit
|
|
looks like a flaky test failed unfortunately |
|
retest this please |
|
jenkins is very unstable recently... |
|
Test build #125198 has finished for PR 27066 at commit
|
|
thanks, merging to master! |
|
@fqaiser94 can you leave a comment in the JIRA ticket? So that I can assign it to you. Thanks for your great work! |
|
Done, and thank you all for a great experience! |
| * | ||
| * {{{ | ||
| * val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col") | ||
| * df.select($"struct_col".withField("c", lit(3))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have any of you try to run these examples? The optimizer ConstantFolding rule will break these examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird, we have tests to cover these examples. @fqaiser94 can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I failed to write a test case to cover this scenario, my bad.
And yea, I just tried this example again, and I can see that it fails.
The issue is that I override foldable for this Unevaluable Expression. And so, when foldable returns true, Spark tries to evaluate the expression and it fails at that point.
I kind-of realized this as well recently and in my PR for dropFields here, I've fixed the issue (basically i just don't override foldable anymore, which by default returns false).
I guess I should submit a follow-up PR to fix this immediately with associated unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising the issue @gatorsmile
I've created a JIRA and PR to address the issue.
### What changes were proposed in this pull request? This PR adds a `withField` method on the pyspark Column class to call the Scala API method added in #27066. ### Why are the changes needed? To update the Python API to match a new feature in the Scala API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test Closes #29699 from Kimahriman/feature/pyspark-with-field. Authored-by: Adam Binford <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Added a new
withFieldmethod to theColumnclass. This method should allow users to add or replace aStructFieldin aStructTypecolumn (with very similar semantics to thewithColumnmethod onDataset).Why are the changes needed?
Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing
StructField. To do this with the existing Spark APIs, users have to rebuild the entire struct column.For example, let's say you have the following deeply nested data structure which has a data quality issue (
5is missing):Currently, to replace the missing value users would have to do something like this:
As you can see above, with the existing methods users must call the
structfunction and list all fields, including fields they don't want to change. This is not ideal as:In contrast, with the method added in this PR, a user could simply do something like this:
This is the first of maybe a few methods that could be added to the
Columnclass to make it easier to manipulate nested data. Other methods under discussion in SPARK-22231 includedropandrenameField. However, these should be added in a separate PR.Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit tests were added. Jenkins must pass them.
Related JIRAs: