-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21402][SQL] Fix java array of structs deserialization #22708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Synch with apache:master
|
Can you explain how this happens? Why thhe fields of structs get mixed up? |
|
The original problem is described here: https://issues.apache.org/jira/browse/SPARK-21402 I'll try to explain what happens in detail. Let's consider this data structure: And let's say we have a java bean class with corresponding structure. When building a deserializer for the field intervals in JavaTypeInference.deserializerFor we construct a MapObjects expression to convert structs to java beans: MapObjects requires DataType of array elements. It is extracted from java element type using JavaTypeInference.inferDataType which gets java bean properties and maps them to StructFields. The order of properties in the resulting StructType may not correspond to their declaration order as the declaration order is simply unknown. So the resulting element StructType may look like this: This StructType is passed to MapObjects and then to its loop variable LambdaVariable. For deserialization of single array elements an InitializeJavaBean expression is created. It contains UnresolvedExtractValue expressions for each field, and these expressions have LambdaVariable as a child. They are resolved during analysis: For each field startTime and endTime ordinals are calculated. For that child's DataType is used, and in our case this is StructType of LambdaVariable with incorrect field order. |
|
In a nutshell: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we exclude other changes except this one? This one is very easy to reason about. We did the same thing in ScalaReflection.
We need more time to think about the map case, and fix it in ScalaReflection as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Should I create another pull request with this change only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed other changes from this PR and created a new one with only map case.
As far as I see, everything works fine with scala classes, because StructTypes are generated based on constructor parameters, and they are available in correct order with correct names. Which is hardly achievable with Java beans..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to unresolved.scala? cc @cloud-fan
Synch with apache:master
4b5d334 to
4103257
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it does not seem to be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removed it.
|
Please modify the PR title and description accordingly. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also unused now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add the license headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import orders here are not compliant with Spark codebase. You can follow the style in other tests like JavaApplySchemaSuite.
|
And please add [SQL] to the PR title. Like |
|
Corrected all issues. |
|
ok to test |
|
lgtm |
|
Test build #97454 has finished for PR 22708 at commit
|
|
cc @dongjoon-hyun here is another instance of the FileBasedDataSourceSuite flaky test. |
|
retest this please |
|
LGTM |
|
Test build #97458 has finished for PR 22708 at commit
|
|
Thank you for pinging me, @cloud-fan . |
|
|
||
| private int id; | ||
| private List<Interval> intervals; | ||
| private List<Integer> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this list of int affect the test? If no, maybe we can get rid of it to simplify the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention was to test a non struct case too. But I think, it's really not critical and we can get rid of it.
| } | ||
| } | ||
|
|
||
| public static class Interval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicate to your another PR. Maybe we can consider put two tests in one Java file so we don't need to have two Interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I guess, to do that we need to have these both changes in one PR?
Or correct another PR to add the second test in the same Java file later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have this merged first and rebase another PR to have another test in the same file too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure.
|
Fixed all issues in test class. Thanks a lot for your help and patience. |
|
Test build #97489 has finished for PR 22708 at commit
|
| import org.apache.spark.sql.types.StructField; | ||
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| public class JavaBeanWithMapSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include this change here? Don't you want to have it in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's my fault..
b1f74ac to
571a0fe
Compare
|
Test build #97490 has finished for PR 22708 at commit
|
|
Test build #97488 has finished for PR 22708 at commit
|
|
Test build #97491 has finished for PR 22708 at commit
|
|
thanks, merging to master/2.4! |
When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. michalsenkyr cloud-fan marmbrus liancheng Closes #22708 from vofque/SPARK-21402. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e5b8136) Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, late LGTM. Thank you, @vofque .
|
@cloud-fan and @vofque . |
|
@dongjoon-hyun, sure, I'll create equal pull requests. |
|
Thanks! |
…ation This PR is to backport #22708 to branch 2.2. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. dongjoon-hyun cloud-fan Closes #22768 from vofque/SPARK-21402-2.2. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ation This PR is to backport #22708 to branch 2.3. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. dongjoon-hyun cloud-fan Closes #22767 from vofque/SPARK-21402-2.3. Authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
This is a follow-up PR for #22708. It considers another case of java beans deserialization: java maps with struct keys/values. When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order. I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects). Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType. Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case. ## How was this patch tested? Added a test case. Built complete project on travis. viirya kiszk cloud-fan michalsenkyr marmbrus liancheng Closes #22745 from vofque/SPARK-21402-FOLLOWUP. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. michalsenkyr cloud-fan marmbrus liancheng Closes apache#22708 from vofque/SPARK-21402. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This is a follow-up PR for apache#22708. It considers another case of java beans deserialization: java maps with struct keys/values. When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order. I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects). Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType. Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case. ## How was this patch tested? Added a test case. Built complete project on travis. viirya kiszk cloud-fan michalsenkyr marmbrus liancheng Closes apache#22745 from vofque/SPARK-21402-FOLLOWUP. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ation This PR is to backport apache#22708 to branch 2.2. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. dongjoon-hyun cloud-fan Closes apache#22768 from vofque/SPARK-21402-2.2. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ation This PR is to backport apache#22708 to branch 2.2. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. dongjoon-hyun cloud-fan Closes apache#22768 from vofque/SPARK-21402-2.2. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up.
I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.
What changes were proposed in this pull request?
MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.
How was this patch tested?
Added a test case.
Built complete project on travis.
@michalsenkyr @cloud-fan @marmbrus @liancheng