-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame #22646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| beanClass: Class[_], | ||
| attrs: Seq[AttributeReference]): Iterator[InternalRow] = { | ||
| import scala.collection.JavaConverters._ | ||
| import java.lang.reflect.{Type, ParameterizedType, Array => JavaArray} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add import here? Can we move it to top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to needlessly add those to the whole file as the reflection stuff is needed only in this method. Ditto with collection converters. But if you think it is better at the top, I'll move it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems rarely to see import like this in Spark codebase.
| def interfaceParameters(t: Type, interface: Class[_]): Array[Type] = t match { | ||
| case parType: ParameterizedType if parType.getRawType == interface => | ||
| parType.getActualTypeArguments | ||
| case _ => throw new UnsupportedOperationException(s"$t is not an $interface") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception message looks a bit confusing. We can say the given type is not supported and we only support the certain type (java.util.List and java.util.Map).
| value => new GenericArrayData( | ||
| (0 until JavaArray.getLength(value)).map(i => | ||
| converter(JavaArray.get(value, i))).toArray) | ||
| case (_, array: ArrayType) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add few code comments explaining why having two cases both for ArrayType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have added a check for cls.isArray in the array case. That would make it clearer. I will also add a comment to each case with the actual type expected for that conversion.
|
The |
|
ok to test |
|
Test build #97034 has finished for PR 22646 at commit
|
f0ca5be to
5db0e2d
Compare
5db0e2d to
095c923
Compare
|
Test build #97086 has finished for PR 22646 at commit
|
| converter(JavaArray.get(value, i))).toArray) | ||
| case (_, array: ArrayType) => | ||
| // java.util.List type | ||
| val cls = classOf[java.util.List[_]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like JavaTypeInference.inferDataType() supports java.lang.Iterable, not only List, but serializer/deserializer don't. I'm not sure whether we should change inferDataType(). This issue would be in a separate pr anyway, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. It should be better to change it to avoid confusion. I also agree with a separate PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, we should use java.lang.Iterable here. We can convert Iterable to ArrayType as ArrayConverter is trying. If we use java.util.List here, it leads behavior changes for list of primitives.
| def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { | ||
| case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) | ||
| case _ => CatalystTypeConverters.createToCatalystConverter(dataType) | ||
| def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, how about we put this method in CatalystTypeConverters? Looks it is a Catalyst converter for beans. Few Java types like java.lang.Iterable, java.math.BigDecimal and java.math.BigInteger are being handled there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay to move this to CatalystTypeConverters , but note that unfortunately seems like CatalystTypeConverters doesn't work properly with nested beans as we are trying to support it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea .. was just thinking of moving this func to there .. looks ugly that this file getting long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at CatalystTypeConverters and I believe there would be a problem in not being able to reliably distinguish Java beans from other arbitrary classes. We might use setters or set fields directly to objects which would not be prepared for such manipulation, potentially creating hard to find errors. This method already assumes a Java bean so that problem is not present here. Isn't that so?
| case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) | ||
| case _ => CatalystTypeConverters.createToCatalystConverter(dataType) | ||
| def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { | ||
| case (cls: Class[_], struct: StructType) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait .. can we reuse JavaTypeInference.serializerFor and make a projection, rather then reimplementing whole logics here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
Lines 131 to 132 in 5264164
| // TODO: we should only collect properties that have getter and setter. However, some tests | |
| // pass in scala case class as java bean class which doesn't have getter and setter. |
We should drop the support for getter or setter only. adding @cloud-fan here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reusing JavaTypeInference.serializerFor would be great, but currently it behaves a little differently. At least it doesn't support java.lang.Iterable[_], so we can't use it immediately. We need to extend it to support Iterable (and also deserializerFor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, how about we fix them together while we are here?
I also checked another difference which is beans without getter and/or setter but I think this is something we should fix in 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, I was not really sure about serializing sets as arrays as the result stops behaving like a set, but I found a PR (#18416) where this seems to have been permitted, so I will go ahead and add that.
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Continuing from #22527, this PR seeks to add support for beans in array, list and map fields when creating DataFrames from Java beans.
How was this patch tested?
Appropriate unit tests were amended.
Also manually tested in Spark shell:
Previous behavior: