[SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame #22646

michalsenkyr · 2018-10-05T19:45:27Z

What changes were proposed in this pull request?

Continuing from #22527, this PR seeks to add support for beans in array, list and map fields when creating DataFrames from Java beans.

How was this patch tested?

Appropriate unit tests were amended.

Also manually tested in Spark shell:

scala> import scala.beans.BeanProperty
import scala.beans.BeanProperty

scala> class Nested(@BeanProperty var i: Int) extends Serializable
defined class Nested

scala> class Test(@BeanProperty var array: Array[Nested], @BeanProperty var list: java.util.List[Nested], @BeanProperty var map: java.util.Map[Integer, Nested]) extends Serializable
defined class Test

scala> import scala.collection.JavaConverters._
import scala.collection.JavaConverters._

scala> val array = Array(new Nested(1))
array: Array[Nested] = Array(Nested@757ad227)

scala> val list = Seq(new Nested(2), new Nested(3)).asJava
list: java.util.List[Nested] = [Nested@633dce39, Nested@4dd28982]

scala> val map = Map(Int.box(1) -> new Nested(4), Int.box(2) -> new Nested(5)).asJava
map: java.util.Map[Integer,Nested] = {1=Nested@57421e4e, 2=Nested@5a75bad4}

scala> val df = spark.createDataFrame(Seq(new Test(array, list, map)).asJava, classOf[Test])
df: org.apache.spark.sql.DataFrame = [array: array<struct<i:int>>, list: array<struct<i:int>> ... 1 more field]

scala> df.show()
+-----+----------+--------------------+
|array|      list|                 map|
+-----+----------+--------------------+
|[[1]]|[[2], [3]]|[1 -> [4], 2 -> [5]]|
+-----+----------+--------------------+

Previous behavior:

scala> val df = spark.createDataFrame(Seq(new Test(array, list, map)).asJava, classOf[Test])
java.lang.IllegalArgumentException: The value (Nested@3dedc8b8) of the type (Nested) cannot be converted to struct<i:int>
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:162)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:162)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:154)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
  at org.apache.spark.sql.SQLContext$$anonfun$createStructConverter$1$1$$anonfun$apply$1.apply(SQLContext.scala:1114)
  at org.apache.spark.sql.SQLContext$$anonfun$createStructConverter$1$1$$anonfun$apply$1.apply(SQLContext.scala:1113)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.apache.spark.sql.SQLContext$$anonfun$createStructConverter$1$1.apply(SQLContext.scala:1113)
  at org.apache.spark.sql.SQLContext$$anonfun$createStructConverter$1$1.apply(SQLContext.scala:1108)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
  at scala.collection.Iterator$class.toStream(Iterator.scala:1320)
  at scala.collection.AbstractIterator.toStream(Iterator.scala:1334)
  at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
  at scala.collection.AbstractIterator.toSeq(Iterator.scala:1334)
  at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:423)
  ... 51 elided

viirya · 2018-10-06T02:23:38Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

      beanClass: Class[_],
      attrs: Seq[AttributeReference]): Iterator[InternalRow] = {
+    import scala.collection.JavaConverters._
+    import java.lang.reflect.{Type, ParameterizedType, Array => JavaArray}


Why add import here? Can we move it to top?

I didn't want to needlessly add those to the whole file as the reflection stuff is needed only in this method. Ditto with collection converters. But if you think it is better at the top, I'll move it.

It seems rarely to see import like this in Spark codebase.

viirya · 2018-10-06T02:26:02Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+    def interfaceParameters(t: Type, interface: Class[_]): Array[Type] = t match {
+      case parType: ParameterizedType if parType.getRawType == interface =>
+        parType.getActualTypeArguments
+      case _ => throw new UnsupportedOperationException(s"$t is not an $interface")


This exception message looks a bit confusing. We can say the given type is not supported and we only support the certain type (java.util.List and java.util.Map).

viirya · 2018-10-06T02:46:24Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+        value => new GenericArrayData(
+          (0 until JavaArray.getLength(value)).map(i =>
+            converter(JavaArray.get(value, i))).toArray)
+      case (_, array: ArrayType) =>


Can you add few code comments explaining why having two cases both for ArrayType?

Sorry, I should have added a check for cls.isArray in the array case. That would make it clearer. I will also add a comment to each case with the actual type expected for that conversion.

viirya · 2018-10-06T02:49:35Z

The createDataFrame API for Java Beans doesn't have clear document about what JavaBeans are supportd. Can you also update it to explicitly document this?

HyukjinKwon · 2018-10-06T06:45:32Z

ok to test

SparkQA · 2018-10-06T07:05:01Z

Test build #97034 has finished for PR 22646 at commit b477d07.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-07T17:12:32Z

Test build #97086 has finished for PR 22646 at commit 095c923.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-10-12T05:07:05Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+            converter(JavaArray.get(value, i))).toArray)
+      case (_, array: ArrayType) =>
+        // java.util.List type
+        val cls = classOf[java.util.List[_]]


Seems like JavaTypeInference.inferDataType() supports java.lang.Iterable, not only List, but serializer/deserializer don't. I'm not sure whether we should change inferDataType(). This issue would be in a separate pr anyway, though.

I think you are right. It should be better to change it to avoid confusion. I also agree with a separate PR for that.

On second thoughts, we should use java.lang.Iterable here. We can convert Iterable to ArrayType as ArrayConverter is trying. If we use java.util.List here, it leads behavior changes for list of primitives.

HyukjinKwon · 2018-10-15T00:33:09Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

-    def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match {
-      case struct: StructType => createStructConverter(cls, struct.map(_.dataType))
-      case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+    def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match {


BTW, how about we put this method in CatalystTypeConverters? Looks it is a Catalyst converter for beans. Few Java types like java.lang.Iterable, java.math.BigDecimal and java.math.BigInteger are being handled there.

I'm okay to move this to CatalystTypeConverters , but note that unfortunately seems like CatalystTypeConverters doesn't work properly with nested beans as we are trying to support it here.

Yea .. was just thinking of moving this func to there .. looks ugly that this file getting long.

I took a quick look at CatalystTypeConverters and I believe there would be a problem in not being able to reliably distinguish Java beans from other arbitrary classes. We might use setters or set fields directly to objects which would not be prepared for such manipulation, potentially creating hard to find errors. This method already assumes a Java bean so that problem is not present here. Isn't that so?

HyukjinKwon · 2018-10-15T01:46:07Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

-      case struct: StructType => createStructConverter(cls, struct.map(_.dataType))
-      case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+    def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match {
+      case (cls: Class[_], struct: StructType) =>


wait .. can we reuse JavaTypeInference.serializerFor and make a projection, rather then reimplementing whole logics here?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

Lines 131 to 132 in 5264164

// TODO: we should only collect properties that have getter and setter. However, some tests

// pass in scala case class as java bean class which doesn't have getter and setter.

We should drop the support for getter or setter only. adding @cloud-fan here as well.

Reusing JavaTypeInference.serializerFor would be great, but currently it behaves a little differently. At least it doesn't support java.lang.Iterable[_], so we can't use it immediately. We need to extend it to support Iterable (and also deserializerFor).

Hm, how about we fix them together while we are here?

I also checked another difference which is beans without getter and/or setter but I think this is something we should fix in 3.0.

Frankly, I was not really sure about serializing sets as arrays as the result stops behaving like a set, but I found a PR (#18416) where this seems to have been permitted, so I will go ahead and add that.

AmplabJenkins · 2019-09-16T18:18:34Z

Can one of the admins verify this patch?

github-actions · 2020-01-06T00:07:22Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

Add array, list and map support to SQLContext.beansToRow

b477d07

michalsenkyr changed the title ~~Support for nested JavaBean arrays, lists and maps in createDataFrame~~ [SPARK-25654] Support for nested JavaBean arrays, lists and maps in createDataFrame Oct 5, 2018

michalsenkyr changed the title ~~[SPARK-25654] Support for nested JavaBean arrays, lists and maps in createDataFrame~~ [SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame Oct 5, 2018

viirya reviewed Oct 6, 2018

View reviewed changes

michalsenkyr force-pushed the SPARK-25654 branch from f0ca5be to 5db0e2d Compare October 7, 2018 13:33

Add documentation, improve error message, move imports, add comments

095c923

michalsenkyr force-pushed the SPARK-25654 branch from 5db0e2d to 095c923 Compare October 7, 2018 13:34

ueshin reviewed Oct 12, 2018

View reviewed changes

HyukjinKwon reviewed Oct 15, 2018

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 6, 2020

github-actions bot closed this Jan 7, 2020

	// TODO: we should only collect properties that have getter and setter. However, some tests
	// pass in scala case class as java bean class which doesn't have getter and setter.

[SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame #22646

[SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame #22646

Uh oh!

Conversation

michalsenkyr commented Oct 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 6, 2018

Uh oh!

HyukjinKwon commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 7, 2018

Uh oh!

ueshin Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

viirya Oct 6, 2018 •

edited

Loading

viirya Oct 6, 2018 •

edited

Loading

ueshin Oct 12, 2018 •

edited

Loading

HyukjinKwon Oct 17, 2018 •

edited

Loading