-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25772][SQL] Fix java map of structs deserialization #22745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Synch with apache:master
Synch with apache:master
|
Is this a separate PR because this part is pretty separable, and you think could be considered separately? if it's all part of one logical change that should go in together or not at all, they can be in the original PR. |
|
It's a different issue, I think it worth a new ticket |
| return true; | ||
| } | ||
|
|
||
| private static <K, V> boolean equals(Map<K, V> aMap, Map<K, V> bMap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, do we need this specific equals? Can't we use Map.equals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure, that's absolutely redundand code...
Synch with apache:master
1896f9a to
fa99c2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is too general. Maybe we should just create a new expression GetArrayFromMap and resolve it to Invoke later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had such doubts too. OK.
|
Added these classes: UnresolvedGetArrayFromMap in unresolved.scala - an unresolved substitution. Tried to follow the example of UnresolvedExtractValue and ExtractValue classes implementation. |
|
ok to test. |
|
Test build #97668 has finished for PR 22745 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @vofque . Could you run the following in your branch and fix the scala style issues like this?
$ dev/scalastyle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure, @dongjoon-hyun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, I've fixed scala style issues in my code.
097412a to
790cfda
Compare
| p => deserializerFor(keyType, Some(p)), | ||
| Invoke(getPath, "keyArray", ArrayType(keyDataType)), | ||
| keyDataType), | ||
| UnresolvedGetArrayFromMap(getPath, GetArrayFromMap.Key())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we don't need to make it unresolved
case class GetArrayFromMap(map: Expression, getKey: Boolean) extends Expression = {
override def inputTypes = Seq(MapType)
override def dataType = {
val MapType(kt, vt) = map.dataType.asInstanceOf[MapType]
if (getKey) kv else vt
}
override def eval...
override def doCodegen...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to replace this unresolved expression with Invoke.
Do you mean to write eval and doGenCode from scratch?
I don't see a way to wrap Invoke, as doGenCode is protected and can't be delegated to Invoke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we can write eval and doGenCode from scratch. It's also more efficient since we can omit the useless try-catch in Invoke.
e.g.
// from UnaryExpression
override def nullSafeEval(input: Any) = input.asInstanceOf[MapData].keys
override def doGenCode = defineCodeGen(ctx, ev, c => s"$c.keys")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, OK!
|
Test build #97813 has finished for PR 22745 at commit
|
|
Test build #97819 has finished for PR 22745 at commit
|
|
Test build #97848 has started for PR 22745 at commit |
|
Test build #97853 has started for PR 22745 at commit |
|
Test build #97861 has started for PR 22745 at commit |
|
Test build #97870 has started for PR 22745 at commit |
076e603 to
d9222d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume that result can't be null if the underlying map is not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought, this adds readability. Is a simple boolean param better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion, but shall we use enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this check. eval is performance critical and we should assume there is no bug. We don't have this check in other expressions either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well this does reuse the functionName, but performance is more important here. how about
private lazy val arrayGetter: MapData => ArrayData = if (source) ...
def eval...
arrayGetter( input.asInstanceOf[MapData])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this expression extends UnaryExpression, then we can write
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
defineCodeGen(ctx, ev, map => s"map.$functionName")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, what about this approach: to pass function name, type getter and array getter as parameters of a private constructor + add two objects with specific constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do this check in
override def checkInputDataTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from Invoke, but it looks like it doesn't really affect anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's save it. Otherwise other reviewers may get confused as well.
|
LGTM except 2 minor comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case _: MapType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just use if-else?
if (isinstanceOf[MapType]) ... else ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure.
|
Test build #97918 has finished for PR 22745 at commit
|
|
Test build #97919 has finished for PR 22745 at commit
|
8b26a5c to
0670bbc
Compare
|
Test build #97920 has finished for PR 22745 at commit
|
|
Test build #97921 has finished for PR 22745 at commit
|
|
Test build #97927 has finished for PR 22745 at commit
|
|
thanks, merging to master! |
…and product type ## What changes were proposed in this pull request? After #22745 , Dataset encoder supports the combination of java bean and map type. This PR is to fix the Scala side. The reason why it didn't work before is, `CatalystToExternalMap` tries to get the data type of the input map expression, while it can be unresolved and its data type is known. To fix it, we can follow `UnresolvedMapObjects`, to create a `UnresolvedCatalystToExternalMap`, and only create `CatalystToExternalMap` when the input map expression is resolved and the data type is known. ## How was this patch tested? enable a old test case Closes #22812 from cloud-fan/map. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This is a follow-up PR for apache#22708. It considers another case of java beans deserialization: java maps with struct keys/values. When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order. I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects). Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType. Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case. ## How was this patch tested? Added a test case. Built complete project on travis. viirya kiszk cloud-fan michalsenkyr marmbrus liancheng Closes apache#22745 from vofque/SPARK-21402-FOLLOWUP. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? In apache#22745 we introduced the `GetArrayFromMap` expression. Later on I realized this is duplicated as we already have `MapKeys` and `MapValues`. This PR removes `GetArrayFromMap` ## How was this patch tested? existing tests Closes apache#22825 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…and product type ## What changes were proposed in this pull request? After apache#22745 , Dataset encoder supports the combination of java bean and map type. This PR is to fix the Scala side. The reason why it didn't work before is, `CatalystToExternalMap` tries to get the data type of the input map expression, while it can be unresolved and its data type is known. To fix it, we can follow `UnresolvedMapObjects`, to create a `UnresolvedCatalystToExternalMap`, and only create `CatalystToExternalMap` when the input map expression is resolved and the data type is known. ## How was this patch tested? enable a old test case Closes apache#22812 from cloud-fan/map. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This is a follow-up PR for #22708. It considers another case of java beans deserialization: java maps with struct keys/values.
When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.
What changes were proposed in this pull request?
Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order.
I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects).
Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType.
Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case.
How was this patch tested?
Added a test case.
Built complete project on travis.
@viirya @kiszk @cloud-fan @michalsenkyr @marmbrus @liancheng