Skip to content

Conversation

@vofque
Copy link
Contributor

@vofque vofque commented Oct 12, 2018

When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up.
I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

How was this patch tested?

Added a test case.
Built complete project on travis.

@michalsenkyr @cloud-fan @marmbrus @liancheng

Synch with apache:master
@vofque vofque changed the title [Spark 21402] Fix java array/map of structs deserialization [SPARK-21402] Fix java array/map of structs deserialization Oct 12, 2018
@cloud-fan
Copy link
Contributor

Can you explain how this happens? Why thhe fields of structs get mixed up?

@vofque
Copy link
Contributor Author

vofque commented Oct 15, 2018

The original problem is described here: https://issues.apache.org/jira/browse/SPARK-21402

I'll try to explain what happens in detail.

Let's consider this data structure:

root
 |-- intervals: array
 |    |-- element: struct
 |    |    |-- startTime: long
 |    |    |-- endTime: long

And let's say we have a java bean class with corresponding structure.

When building a deserializer for the field intervals in JavaTypeInference.deserializerFor we construct a MapObjects expression to convert structs to java beans:

case c if listType.isAssignableFrom(typeToken) =>
  val et = elementType(typeToken)
  MapObjects(
    p => deserializerFor(et, Some(p)),
    getPath,
    inferDataType(et)._1,
    customCollectionCls = Some(c))

MapObjects requires DataType of array elements. It is extracted from java element type using JavaTypeInference.inferDataType which gets java bean properties and maps them to StructFields.

case other =>
  // some more code goes here
  val properties = getJavaBeanReadableProperties(other)
  val fields = properties.map { property =>
    val returnType = typeToken.method(property.getReadMethod).getReturnType
    val (dataType, nullable) = inferDataType(returnType, seenTypeSet + other)
    new StructField(property.getName, dataType, nullable)
}

The order of properties in the resulting StructType may not correspond to their declaration order as the declaration order is simply unknown. So the resulting element StructType may look like this:

root
 |-- endTime: long
 |-- startTime: long

This StructType is passed to MapObjects and then to its loop variable LambdaVariable.

For deserialization of single array elements an InitializeJavaBean expression is created. It contains UnresolvedExtractValue expressions for each field, and these expressions have LambdaVariable as a child. They are resolved during analysis:

case UnresolvedExtractValue(child, fieldName) if child.resolved =>
  ExtractValue(child, fieldName, resolver)

For each field startTime and endTime ordinals are calculated. For that child's DataType is used, and in our case this is StructType of LambdaVariable with incorrect field order.
As a result we get GetStructField expressions with ordinal = 0 for endTime and ordinal = 1 for startTime.

@vofque
Copy link
Contributor Author

vofque commented Oct 15, 2018

In a nutshell:
First we calculate field ordinals of array elements using java bean field information (which doesn't guarantee any particular field order).
Then we apply these ordinals to the actual data to retrieve values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we exclude other changes except this one? This one is very easy to reason about. We did the same thing in ScalaReflection.

We need more time to think about the map case, and fix it in ScalaReflection as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Should I create another pull request with this change only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed other changes from this PR and created a new one with only map case.
As far as I see, everything works fine with scala classes, because StructTypes are generated based on constructor parameters, and they are available in correct order with correct names. Which is hardly achievable with Java beans..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to unresolved.scala? cc @cloud-fan

Synch with apache:master
@vofque vofque force-pushed the SPARK-21402 branch 2 times, most recently from 4b5d334 to 4103257 Compare October 16, 2018 08:06
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it does not seem to be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed it.

@viirya
Copy link
Member

viirya commented Oct 16, 2018

Please modify the PR title and description accordingly. Thanks.

@vofque vofque changed the title [SPARK-21402] Fix java array/map of structs deserialization [SPARK-21402] Fix java array of structs deserialization Oct 16, 2018
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also unused now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the license headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import orders here are not compliant with Spark codebase. You can follow the style in other tests like JavaApplySchemaSuite.

@viirya
Copy link
Member

viirya commented Oct 16, 2018

And please add [SQL] to the PR title. Like [SPARK-21402][SQL]

@vofque vofque changed the title [SPARK-21402] Fix java array of structs deserialization [SPARK-21402][SQL] Fix java array of structs deserialization Oct 16, 2018
@vofque
Copy link
Contributor Author

vofque commented Oct 16, 2018

Corrected all issues.

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

lgtm

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97454 has finished for PR 22708 at commit 8bc66e6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cc @dongjoon-hyun here is another instance of the FileBasedDataSourceSuite flaky test.

@cloud-fan
Copy link
Contributor

retest this please

@viirya
Copy link
Member

viirya commented Oct 16, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97458 has finished for PR 22708 at commit 8bc66e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @cloud-fan .


private int id;
private List<Interval> intervals;
private List<Integer> values;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this list of int affect the test? If no, maybe we can get rid of it to simplify the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to test a non struct case too. But I think, it's really not critical and we can get rid of it.

}
}

public static class Interval {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicate to your another PR. Maybe we can consider put two tests in one Java file so we don't need to have two Interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I guess, to do that we need to have these both changes in one PR?
Or correct another PR to add the second test in the same Java file later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have this merged first and rebase another PR to have another test in the same file too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure.

@vofque
Copy link
Contributor Author

vofque commented Oct 17, 2018

Fixed all issues in test class. Thanks a lot for your help and patience.
Rebase was probably a bad idea, sorry for that..

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97489 has finished for PR 22708 at commit f6c40b6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedInvoke(

import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class JavaBeanWithMapSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include this change here? Don't you want to have it in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my fault..

@vofque vofque force-pushed the SPARK-21402 branch 2 times, most recently from b1f74ac to 571a0fe Compare October 17, 2018 09:03
@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97490 has finished for PR 22708 at commit b1f74ac.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97488 has finished for PR 22708 at commit 811e45a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97491 has finished for PR 22708 at commit 46e942d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

asfgit pushed a commit that referenced this pull request Oct 17, 2018
When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up.
I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

michalsenkyr cloud-fan marmbrus liancheng

Closes #22708 from vofque/SPARK-21402.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e5b8136)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in e5b8136 Oct 17, 2018
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, late LGTM. Thank you, @vofque .

@dongjoon-hyun
Copy link
Member

@cloud-fan and @vofque .
Can we have this fix in branch-2.3 and branch-2.2, too?
It seems that we need another backport PRs for them.

@vofque
Copy link
Contributor Author

vofque commented Oct 17, 2018

@dongjoon-hyun, sure, I'll create equal pull requests.

@dongjoon-hyun
Copy link
Member

Thanks!

asfgit pushed a commit that referenced this pull request Oct 18, 2018
…ation

This PR is to backport #22708 to branch 2.2.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

dongjoon-hyun cloud-fan

Closes #22768 from vofque/SPARK-21402-2.2.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 18, 2018
…ation

This PR is to backport #22708 to branch 2.3.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

dongjoon-hyun cloud-fan

Closes #22767 from vofque/SPARK-21402-2.3.

Authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 24, 2018
This is a follow-up PR for #22708. It considers another case of java beans deserialization: java maps with struct keys/values.

When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

## What changes were proposed in this pull request?

Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order.
I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects).

Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType.

Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case.

## How was this patch tested?

Added a test case.
Built complete project on travis.

viirya kiszk cloud-fan michalsenkyr marmbrus liancheng

Closes #22745 from vofque/SPARK-21402-FOLLOWUP.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up.
I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

michalsenkyr cloud-fan marmbrus liancheng

Closes apache#22708 from vofque/SPARK-21402.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
This is a follow-up PR for apache#22708. It considers another case of java beans deserialization: java maps with struct keys/values.

When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans.

## What changes were proposed in this pull request?

Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order.
I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects).

Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType.

Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case.

## How was this patch tested?

Added a test case.
Built complete project on travis.

viirya kiszk cloud-fan michalsenkyr marmbrus liancheng

Closes apache#22745 from vofque/SPARK-21402-FOLLOWUP.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Willymontaz pushed a commit to criteo-forks/spark that referenced this pull request Sep 26, 2019
…ation

This PR is to backport apache#22708 to branch 2.2.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

dongjoon-hyun cloud-fan

Closes apache#22768 from vofque/SPARK-21402-2.2.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Willymontaz pushed a commit to criteo-forks/spark that referenced this pull request Sep 27, 2019
…ation

This PR is to backport apache#22708 to branch 2.2.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

dongjoon-hyun cloud-fan

Closes apache#22768 from vofque/SPARK-21402-2.2.

Lead-authored-by: Vladimir Kuriatkov <[email protected]>
Co-authored-by: Vladimir Kuriatkov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants