[SPARK-17952][SQL] Nested Java beans support in createDataFrame #22527

michalsenkyr · 2018-09-22T19:10:11Z

What changes were proposed in this pull request?

When constructing a DataFrame from a Java bean, using nested beans throws an error despite documentation stating otherwise. This PR aims to add that support.

This PR does not yet add nested beans support in array or List fields. This can be added later or in another PR.

How was this patch tested?

Nested bean was added to the appropriate unit test.

Also manually tested in Spark shell on code emulating the referenced JIRA:

scala> import scala.beans.BeanProperty
import scala.beans.BeanProperty

scala> class SubCategory(@BeanProperty var id: String, @BeanProperty var name: String) extends Serializable
defined class SubCategory

scala> class Category(@BeanProperty var id: String, @BeanProperty var subCategory: SubCategory) extends Serializable
defined class Category

scala> import scala.collection.JavaConverters._
import scala.collection.JavaConverters._

scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category])
java.lang.IllegalArgumentException: The value (SubCategory@65130cf2) of the type (SubCategory) cannot be converted to struct<id:string,name:string>
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
  at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108)
  at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1108)
  at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
  at scala.collection.Iterator$class.toStream(Iterator.scala:1320)
  at scala.collection.AbstractIterator.toStream(Iterator.scala:1334)
  at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
  at scala.collection.AbstractIterator.toSeq(Iterator.scala:1334)
  at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:423)
  ... 51 elided

New behavior:

scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category])
res0: org.apache.spark.sql.DataFrame = [id: string, subCategory: struct<id: string, name: string>]

scala> res0.show()
+-----+---------------+
|   id|    subCategory|
+-----+---------------+
|s-111|[sc-111, Sub-1]|
+-----+---------------+

HyukjinKwon · 2018-09-23T02:07:31Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+            .toMap
+        new GenericInternalRow(structType.map(nestedProperty =>
+          invoke(value)(nestedExtractors(nestedProperty.name) -> nestedProperty.dataType)
+        ).toArray)


Why should we use a map here while we don't need it for the root bean?

Right, we don't have to. Just checked and JavaTypeInference.inferDataType also uses JavaTypeInference.getJavaBeanReadableProperties so the order should be the same. Also double-checked manually in Spark shell with a more complex nested bean to be sure.

HyukjinKwon · 2018-09-23T02:07:43Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java

+                    DataTypes.createStructType(Collections.singletonList(new StructField(
+                            "a", IntegerType$.MODULE$, false, Metadata.empty()))),
+                    true, Metadata.empty()),
+            schema.apply("f"));


should be double spaced.

felixcheung · 2018-10-01T03:12:23Z

Jenkins, ok to test

SparkQA · 2018-10-01T07:04:29Z

Test build #96809 has finished for PR 22527 at commit d8083cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-10-02T08:09:37Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+    val methodsToTypes = extractors.zip(attrs).map { case (e, attr) =>
+      (e, attr.dataType)
+    }
+    def invoke(element: Any)(tuple: (Method, DataType)): Any = tuple match {


Can we create converters before data.map { ... } instead of calculating converters for each row?

I mean something like:

def converter(e: Method, dt: DataType): Any => Any = dt match { case StructType(fields) => val nestedExtractors = JavaTypeInference.getJavaBeanReadableProperties(e.getReturnType).map(_.getReadMethod) val nestedConverters = nestedExtractors.zip(fields).map { case (extractor, field) => converter(extractor, field.dataType) } element => val value = e.invoke(element) new GenericInternalRow(nestedConverters.map(_(value))) case _ => val convert = CatalystTypeConverters.createToCatalystConverter(dt) element => convert(e.invoke(element)) }

and then

val converters = extractors.zip(attrs).map { case (e, attr) => converter(e, attr.dataType) } data.map { element => new GenericInternalRow(converters.map(_(element))): InternalRow }

Good point. Thank you. Changed it in the latest commit

michalsenkyr · 2018-10-02T19:23:09Z

I restructured the code in this commit to allow easier addition of array/list support in the future.

SparkQA · 2018-10-02T23:03:02Z

Test build #96869 has finished for PR 22527 at commit 3fe63c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM for the current changes except for some comments.
I guess we need to support array/list of beans as you suggested, and map of beans as well.
@michalsenkyr Do you want to address them? We can do that here or in separate prs.

ueshin · 2018-10-03T05:15:31Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

-      JavaTypeInference.getJavaBeanReadableProperties(beanClass).map(_.getReadMethod)
-    val methodsToConverts = extractors.zip(attrs).map { case (e, attr) =>
-      (e, CatalystTypeConverters.createToCatalystConverter(attr.dataType))
+    def createStructConverter(cls: Class[_], fieldTypes: Iterator[DataType]): Any => InternalRow = {


nit: Seq[DataType] instead of Iterator[DataType]?

I used Iterators instead of Seqs in order to avoid creating intermediate collections. However, I agree it's more concise without that.

ueshin · 2018-10-03T05:31:45Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+            val method = property.getReadMethod
+            method -> createConverter(method.getReturnType, fieldType)
+          }.toArray
+      value => new GenericInternalRow(


We should check whether the value is null or not? Also could you add a test for the case?

You're right. Added. Thanks.

michalsenkyr · 2018-10-03T18:29:43Z

@ueshin Yes. I am already working on array/list support. Will add maps as well. It shouldn't require a rewrite now that the code is restructured, just new cases in pattern match. So I think it's ok to do in another PR.

SparkQA · 2018-10-03T23:00:41Z

Test build #96901 has finished for PR 22527 at commit 3b2a431.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM except for a nit.
@HyukjinKwon Could you take a look again please?

ueshin · 2018-10-04T02:11:30Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+        else new GenericInternalRow(
+          methodConverters.map { case (method, converter) =>
+            converter(method.invoke(value))
+          })


nit: please use braces for multi-lined if-else.

SparkQA · 2018-10-05T00:22:54Z

Test build #96954 has finished for PR 22527 at commit e9b5a98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-10-05T08:47:35Z

I'd merge this not to block the following prs to support array/list and map of beans.

ueshin · 2018-10-05T08:49:02Z

Thanks! merging to master.

ueshin · 2018-10-05T09:01:45Z

Sorry, the merge script failed. Let me try again a while later.

ueshin · 2018-10-05T09:29:05Z

Seems like there is a merge commit in apache git https://git-wip-us.apache.org/repos/asf?p=spark.git, but not in GitHub yet.

michalsenkyr · 2018-10-05T19:47:31Z

Thanks! I created a new PR with array, list and map support.

HyukjinKwon · 2018-10-14T23:51:35Z

Hey @ueshin, sorry I was late. I'm a bit busy for these couple of weeks so please don't block by me and ignore me. Thank you for asking it to me.

HyukjinKwon · 2018-10-15T00:13:32Z

Looks good to me too!

## What changes were proposed in this pull request? When constructing a DataFrame from a Java bean, using nested beans throws an error despite [documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection) stating otherwise. This PR aims to add that support. This PR does not yet add nested beans support in array or List fields. This can be added later or in another PR. ## How was this patch tested? Nested bean was added to the appropriate unit test. Also manually tested in Spark shell on code emulating the referenced JIRA: ``` scala> import scala.beans.BeanProperty import scala.beans.BeanProperty scala> class SubCategory(BeanProperty var id: String, BeanProperty var name: String) extends Serializable defined class SubCategory scala> class Category(BeanProperty var id: String, BeanProperty var subCategory: SubCategory) extends Serializable defined class Category scala> import scala.collection.JavaConverters._ import scala.collection.JavaConverters._ scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category]) java.lang.IllegalArgumentException: The value (SubCategory65130cf2) of the type (SubCategory) cannot be converted to struct<id:string,name:string> at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1108) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$class.toStream(Iterator.scala:1320) at scala.collection.AbstractIterator.toStream(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1334) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:423) ... 51 elided ``` New behavior: ``` scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category]) res0: org.apache.spark.sql.DataFrame = [id: string, subCategory: struct<id: string, name: string>] scala> res0.show() +-----+---------------+ | id| subCategory| +-----+---------------+ |s-111|[sc-111, Sub-1]| +-----+---------------+ ``` Closes apache#22527 from michalsenkyr/SPARK-17952. Authored-by: Michal Senkyr <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

Add nested Java beans support to SQLContext.beansToRow

ccea758

HyukjinKwon reviewed Sep 23, 2018

View reviewed changes

Replace nested bean field name lookup with index access, fix indent

d8083cf

michalsenkyr force-pushed the SPARK-17952 branch from e9e5749 to d8083cf Compare September 23, 2018 08:35

ueshin reviewed Oct 2, 2018

View reviewed changes

Rework for future array support, converter preparation

3fe63c8

ueshin reviewed Oct 3, 2018

View reviewed changes

Add null check for fields, remove iterator conversions

3b2a431

ueshin reviewed Oct 4, 2018

View reviewed changes

Fixed code style

e9b5a98

michalsenkyr force-pushed the SPARK-17952 branch from 5d07254 to e9b5a98 Compare October 4, 2018 20:31

asfgit closed this in 434ada1 Oct 5, 2018

michalsenkyr mentioned this pull request Oct 5, 2018

[SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame #22646

Closed

[SPARK-17952][SQL] Nested Java beans support in createDataFrame #22527

[SPARK-17952][SQL] Nested Java beans support in createDataFrame #22527

Uh oh!

Conversation

michalsenkyr commented Sep 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Oct 1, 2018

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalsenkyr commented Oct 2, 2018

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalsenkyr commented Oct 3, 2018

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

ueshin commented Oct 5, 2018

Uh oh!

ueshin commented Oct 5, 2018

Uh oh!

ueshin commented Oct 5, 2018

Uh oh!

ueshin commented Oct 5, 2018

Uh oh!

michalsenkyr commented Oct 5, 2018

Uh oh!

HyukjinKwon commented Oct 14, 2018

Uh oh!

HyukjinKwon commented Oct 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants