Skip to content

Conversation

@vofque
Copy link
Contributor

@vofque vofque commented Oct 16, 2018

This is a follow-up PR for #22708. It considers another case of java beans deserialization: java maps with struct keys/values.

When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

What changes were proposed in this pull request?

Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order.
I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects).

Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType.

Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case.

How was this patch tested?

Added a test case.
Built complete project on travis.

@viirya @kiszk @cloud-fan @michalsenkyr @marmbrus @liancheng

Synch with apache:master
Synch with apache:master
@vofque vofque changed the title [SPARK-21402][SQL][FOLLOWUP] Fix java map of structs deserialization [SPARK-21402][SQL][FOLLOW-UP] Fix java map of structs deserialization Oct 16, 2018
@srowen
Copy link
Member

srowen commented Oct 17, 2018

Is this a separate PR because this part is pretty separable, and you think could be considered separately? if it's all part of one logical change that should go in together or not at all, they can be in the original PR.

@cloud-fan
Copy link
Contributor

It's a different issue, I think it worth a new ticket

return true;
}

private static <K, V> boolean equals(Map<K, V> aMap, Map<K, V> bMap) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we need this specific equals? Can't we use Map.equals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure, that's absolutely redundand code...

Synch with apache:master
@vofque vofque force-pushed the SPARK-21402-FOLLOWUP branch from 1896f9a to fa99c2d Compare October 17, 2018 15:26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is too general. Maybe we should just create a new expression GetArrayFromMap and resolve it to Invoke later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had such doubts too. OK.

@vofque
Copy link
Contributor Author

vofque commented Oct 18, 2018

Added these classes:

UnresolvedGetArrayFromMap in unresolved.scala - an unresolved substitution.
GetArrayFromMap in complexTypeExtractors.scala - an extraction algorithm (not sure this fits the idea of complexTypeExtractors.scala file)

Tried to follow the example of UnresolvedExtractValue and ExtractValue classes implementation.

@vofque vofque changed the title [SPARK-21402][SQL][FOLLOW-UP] Fix java map of structs deserialization [SPARK-25772][SQL] Fix java map of structs deserialization Oct 18, 2018
@dongjoon-hyun
Copy link
Member

ok to test.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97668 has finished for PR 22745 at commit 835b6f4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Key() extends Source
  • case class Value() extends Source

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @vofque . Could you run the following in your branch and fix the scala style issues like this?

$ dev/scalastyle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure, @dongjoon-hyun.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun, I've fixed scala style issues in my code.

@vofque vofque force-pushed the SPARK-21402-FOLLOWUP branch 2 times, most recently from 097412a to 790cfda Compare October 22, 2018 07:15
p => deserializerFor(keyType, Some(p)),
Invoke(getPath, "keyArray", ArrayType(keyDataType)),
keyDataType),
UnresolvedGetArrayFromMap(getPath, GetArrayFromMap.Key())),
Copy link
Contributor

@cloud-fan cloud-fan Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we don't need to make it unresolved

case class GetArrayFromMap(map: Expression, getKey: Boolean) extends Expression = {
  override def inputTypes = Seq(MapType)

  override def dataType = {
    val MapType(kt, vt) = map.dataType.asInstanceOf[MapType]
    if (getKey) kv else vt
  }

  override def eval...
 
  override def doCodegen...
}

Copy link
Contributor Author

@vofque vofque Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to replace this unresolved expression with Invoke.
Do you mean to write eval and doGenCode from scratch?
I don't see a way to wrap Invoke, as doGenCode is protected and can't be delegated to Invoke.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we can write eval and doGenCode from scratch. It's also more efficient since we can omit the useless try-catch in Invoke.

e.g.

// from UnaryExpression
override def nullSafeEval(input: Any) = input.asInstanceOf[MapData].keys

override def doGenCode = defineCodeGen(ctx, ev, c => s"$c.keys")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, OK!

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97813 has finished for PR 22745 at commit 790cfda.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Key() extends Source
  • case class Value() extends Source

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97819 has finished for PR 22745 at commit 835b6f4.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97848 has started for PR 22745 at commit 6f449c4.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97853 has started for PR 22745 at commit 790cfda.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97861 has started for PR 22745 at commit 790cfda.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97870 has started for PR 22745 at commit 6f449c4.

@vofque vofque force-pushed the SPARK-21402-FOLLOWUP branch from 076e603 to d9222d5 Compare October 23, 2018 11:34
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume that result can't be null if the underlying map is not null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought, this adds readability. Is a simple boolean param better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, but shall we use enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this check. eval is performance critical and we should assume there is no bug. We don't have this check in other expressions either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this does reuse the functionName, but performance is more important here. how about

private lazy val arrayGetter: MapData => ArrayData = if (source) ...

def eval...
  arrayGetter( input.asInstanceOf[MapData])

Copy link
Contributor

@cloud-fan cloud-fan Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this expression extends UnaryExpression, then we can write

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
  defineCodeGen(ctx, ev, map => s"map.$functionName")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that.

Copy link
Contributor Author

@vofque vofque Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, what about this approach: to pass function name, type getter and array getter as parameters of a private constructor + add two objects with specific constructors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do this check in

override def checkInputDataTypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from Invoke, but it looks like it doesn't really affect anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's save it. Otherwise other reviewers may get confused as well.

@cloud-fan
Copy link
Contributor

LGTM except 2 minor comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case _: MapType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just use if-else?

if (isinstanceOf[MapType]) ... else ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure.

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97918 has finished for PR 22745 at commit 076e603.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Key() extends Source
  • case class Value() extends Source
  • case class GetArrayFromMap(

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97919 has finished for PR 22745 at commit d9222d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Key() extends Source
  • case class Value() extends Source
  • case class GetArrayFromMap(

@vofque vofque force-pushed the SPARK-21402-FOLLOWUP branch from 8b26a5c to 0670bbc Compare October 23, 2018 15:29
@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97920 has finished for PR 22745 at commit 882fd22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97921 has finished for PR 22745 at commit 8b26a5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97927 has finished for PR 22745 at commit 0670bbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 584e767 Oct 24, 2018
asfgit pushed a commit that referenced this pull request Oct 28, 2018
…and product type

## What changes were proposed in this pull request?

After #22745 , Dataset encoder supports the combination of java bean and map type. This PR is to fix the Scala side.

The reason why it didn't work before is, `CatalystToExternalMap` tries to get the data type of the input map expression, while it can be unresolved and its data type is known. To fix it, we can follow `UnresolvedMapObjects`, to create a `UnresolvedCatalystToExternalMap`, and only create `CatalystToExternalMap` when the input map expression is resolved and the data type is known.

## How was this patch tested?

enable a old test case

Closes #22812 from cloud-fan/map.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
This is a follow-up PR for apache#22708. It considers another case of java beans deserialization: java maps with struct keys/values.

When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

## What changes were proposed in this pull request?

Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order.
I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects).

Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType.

Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case.

## How was this patch tested?

Added a test case.
Built complete project on travis.

viirya kiszk cloud-fan michalsenkyr marmbrus liancheng

Closes apache#22745 from vofque/SPARK-21402-FOLLOWUP.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In apache#22745 we introduced the `GetArrayFromMap` expression. Later on I realized this is duplicated as we already have `MapKeys` and `MapValues`.

This PR removes `GetArrayFromMap`

## How was this patch tested?

existing tests

Closes apache#22825 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…and product type

## What changes were proposed in this pull request?

After apache#22745 , Dataset encoder supports the combination of java bean and map type. This PR is to fix the Scala side.

The reason why it didn't work before is, `CatalystToExternalMap` tries to get the data type of the input map expression, while it can be unresolved and its data type is known. To fix it, we can follow `UnresolvedMapObjects`, to create a `UnresolvedCatalystToExternalMap`, and only create `CatalystToExternalMap` when the input map expression is resolved and the data type is known.

## How was this patch tested?

enable a old test case

Closes apache#22812 from cloud-fan/map.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants