-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32585][SQL] Support scala enumeration in ScalaReflection #29403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #127308 has finished for PR 29403 at commit
|
|
Test build #127319 has finished for PR 29403 at commit
|
|
Test build #127352 has finished for PR 29403 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rednaxelafx would you have any thoughts? you might know of any potential issues in use of Scala enums
|
I don't know if there exists any issue or reason why we not support it. @maropu @cloud-fan @dongjoon-hyun do you have any thought ? |
marmbrus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems generally useful to me. I would use it and there are blog posts that talk about working around this. A few comments on nullability.
| StructField(fieldName, dataType, nullable) | ||
| }), nullable = true) | ||
| case t if isSubtype(t, localTypeOf[Enumeration#Value]) => | ||
| Schema(StringType, nullable = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is nullable = false, I believe this should be reserved for primitive types that cannot be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be true.
| test("SPARK-32585: Support scala enumeration in ScalaReflection") { | ||
| checkDataset( | ||
| Seq(FooClassWithEnum(1, FooEnum.E1), FooClassWithEnum(2, FooEnum.E2)).toDS(), | ||
| Seq(FooClassWithEnum(1, FooEnum.E1), FooClassWithEnum(2, FooEnum.E2)): _* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the above comment, I would add a test case where the enum value is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the null test.
| assert(deserializerFor[FooWithAnnotation].dataType == ObjectType(classOf[FooWithAnnotation])) | ||
| } | ||
|
|
||
| test("SPARK-32585: Support scala enumeration in ScalaReflection") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider putting a test case in ExpressionEncoderSuite as well as I think that checks various combinations of evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, not realized this test.
| // we can call example.Foo.withName to deserialize string to enumeration. | ||
| val className = t.asInstanceOf[TypeRef].pre.typeSymbol.asClass.fullName | ||
| // this check is for spark-shell which give a default package name like '$line1.$read$$iw' | ||
| if (className.startsWith("$")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually understand this limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually a jline issue that default package is something like $line32.$read.$iw.$iw.$iw.$iw.$iw. We can't find Enum class using reflect. I choose to forbid it, but please tell me if there exists a better way to go.
| s"Enumeration class required package name, but found $className") | ||
| } | ||
|
|
||
| val clazz = Utils.classForName(className) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative:
val parent = t.asInstanceOf[TypeRef].pre.typeSymbol.asClass
val cls = scala.reflect.runtime.universe
.runtimeMirror(getClass.getClassLoader)
.runtimeClass(parent)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pure scala reflection fix the issues with the REPL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fixed, exists the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what is the error? Regardless, I'm not sure this limitation should hold up merging the basic support for enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check this again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After deep check some code, I found the magic.
- we can get
FooEnumclass with default package usingmirror. runtimeClass. - we failed in
StaticInvokewhich will re-reflect the class if class name end with$.
The StaticInvoke related code and pr #20753:
val objectName = staticObject.getName.stripSuffix("$")
val cls = if (staticObject.getName == objectName) {
staticObject
} else {
Utils.classForName(objectName)
}
I cann't find more comment about this code, seems we can use staticObject directly instead of re-reflect it, is it ?
cc @kiszk
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129147 has finished for PR 29403 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129158 has finished for PR 29403 at commit
|
| ObjectType(getClassFromType(t)), | ||
| "withName", | ||
| createDeserializerForString(path, false) :: Nil, | ||
| returnNullable = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if I'm correct, it would be good to include a test case for this case. What happens if you actually do operations on the null value stored as an enum? Is the nullability of the resulting schema correct. Do operations like .isNotNull work correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need returnNullable = false since Enumeration.withName will never return null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you actually do operations on the null value stored as an enum
We will get an exception, and the behavior is same as Java enum what we supported.
case other if other.isEnum =>
createSerializerForString(
Invoke(inputObject, "name", ObjectType(classOf[String]), returnNullable = false))
| inputObject, | ||
| "toString", | ||
| ObjectType(classOf[java.lang.String]), | ||
| returnNullable = false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as withName, toString also return a non-null value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I agree.
|
cc @tdas modulo the nullability concerns above, this LGTM. |
| StructField(fieldName, dataType, nullable) | ||
| }), nullable = true) | ||
| case t if isSubtype(t, localTypeOf[Enumeration#Value]) => | ||
| Schema(StringType, nullable = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are returning the schema generated from enum as nullable, but the serializer expression is not nullable (because it does not produce null). Why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two side of nullable.
- For schema,
nullablemeans the column can benullvalue. - For
returnNullablewhich used inInvokeandStaticInvokemeans the method promise will never produce anullvalue.
Some code get from Invoke
def invoke(
obj: Any,
method: Method,
arguments: Seq[Expression],
input: InternalRow,
dataType: DataType): Any = {
val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
if (needNullCheck && args.exists(_ == null)) {
// return null if one of arguments is null
null
} else {
val ret = method.invoke(obj, args: _*)
val boxedClass = ScalaReflection.typeBoxedJavaMapping.get(dataType)
if (boxedClass.isDefined) {
boxedClass.get.cast(ret)
} else {
ret
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So effectively, in the serialized form, we are allowing nulls to be present in the column mapping to the enum field. But if there is indeed a row with a null in that column and we attempt to deserialize that rows, then it will cause a runtime failure...isnt it?
If this understanding is correct, then serializing will never produce null, and even if there is a null, serializing will fail. Then why keep this column nullable = true? I am not necessarily opposed to it, I am just trying to understand the rationale. cc @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if there is indeed a row with a null in that column and we attempt to deserialize that rows, then it will cause a runtime failure...isnt it
Actually we will get null back. It's a trick that if input value is null the serializing will return null.
|
Test build #129216 has finished for PR 29403 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| // the fullName of tpe is example.Foo.Foo, but we need example.Foo so that | ||
| // we can call example.Foo.withName to deserialize string to enumeration. | ||
| val parent = t.asInstanceOf[TypeRef].pre.typeSymbol.asClass | ||
| val cls = mirror.runtimeClass(parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow scala reflection.
|
LGTM. Merging it to master. |
|
thanks for merging and review ! |
What changes were proposed in this pull request?
Add code in
ScalaReflectionto support scala enumeration and make enumeration type as string type in Spark.Why are the changes needed?
We support java enum but failed with scala enum, it's better to keep the same behavior.
Here is a example.
Before this PR
After this PR
org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]Does this PR introduce any user-facing change?
Yes, user can make case class which include scala enumeration field as dataset.
How was this patch tested?
Add test.