-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20384][SQL] Support value class in nested schema for Dataset #33205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20384][SQL] Support value class in nested schema for Dataset #33205
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
eejbyfeldt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR does not handle case class with generics types correctly, but this is probably fixable. But will probably require some more complexity in the getConstructorUnwrappedParameters function.
The old PR by @mt40 #22309 implemenation was such that a value class would always be encodeded by the underlying type. But in this PR it looks like in this PR encoded the elements like they are handled in the JVM, meaning that context is such that the AnyVal is represented as the underlying type the schema will just be the underlying type, while if it is boxed in the wrapper type this will also be reflected in the schema.
So with the following types
case class IntWrapper(val i: Int) extends AnyVal
case class CaseClassWithGeneric[T](generic: T, value: IntWrapper)
the schema in the PR: #22309 would be (and is for what is currently in this branch)_
Schema(
StructType(
Seq(
StructField("generic", IntegerType, false),
StructField("value", IntegerType, false)
)
),
nullable = true)
)
But I guess the deseired behavior for the implemenation suggestion in this branch would give a schema
Schema(
StructType(
Seq(
StructField("generic", StructType(Seq(StructField("i", IntegerType, false)))),
StructField("value", IntegerType, false)
)
),
nullable = true)
)
I don't have any strong feeling which is the correct approach to take in the implemenation. This approach gives slightly simpler implementation, while the other one hides some of the ugliness/complexities of how AnyVals work on the JVM.
But it would be good to add some test cases with case classes with genereic fields so their behavior is covered in test cases.
| private def getConstructorUnwrappedParameters(tpe: Type, isTupleType: Boolean): | ||
| Seq[(String, Type)] = { | ||
| val params = getConstructorParameters(tpe) | ||
| if (isTupleType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A scala Tuple is just a case class with generics: https://github.com/scala/scala/blob/2.13.x/src/library/scala/Tuple2.scala#L24
But since the current code looks specifically for scala.Tuple this means that it will not handle user defined case classes with generics.
For example I think adding the following test in the ExpressionEncodeSuite case on this branch will fail:
case class CaseClassWithGeneric[T](generic: T, value: IntWrapper)
encodeDecodeTest(
CaseClassWithGeneric[IntWrapper](IntWrapper(1), IntWrapper(2)),
"case class with generic fields")
|
I pushed in an initial implementation to a branch on my fork here: https://github.com/eejbyfeldt/spark/compare/master...support-value-classes-in-datasets?expand=1 I believe this implementation solves the cases where a case class has a generic member. My branch takes the approach that was in @mt40 original PR that value class are always encoded as the underlying type, since I am starting to think that is the correct way to implement this feature. |
The test cases are from: apache#33205
The test cases are from: apache#33205
|
@mickjermsurawong-stripe I created a WIP PR with my branch here: #33316 I think you can just take the @cloud-fan You previously were involved in reviewing this PR: #22309 Could you provide some input with regards to the importance of backwards compatibility with schema changes for case classes. It does not seem like that was really dicussed when that patch was being proposed. Based on searches in the mailing lists and stackoverflow (and this being broken for so long) it would seem like a lot of people are not using this feature: http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=value+class&days=0 |
|
hi @eejbyfeldt please feel free to drive this forward. Thank you for the work. Happy if you'd like to make a patch here :) |
I pushed a commit here (Since I think I am not allow to push to your branch): eejbyfeldt@cbbe05e it should be based on this branch therefore you can just cherry-pick it to this branch. Yeah, thinking about it I agree that it better to be backwards compatible. Then this change can be seen as a bug fix for "nested value classes" and will not break an existing code, which should be a much "safer" change to do. |
I also remove the unwrapped params naming, as I belive this is not helpful. Replacing the value class with their underlying type is what you have to do to get the correct type signature. Consider the following example: ``` $ cat TestValueClass.scala package example case class ValueClass(value: Int) extends AnyVal case class HasCaseClass(value1: ValueClass, value2: Int) $ scalac TestValueClass.scala $ javap -p example/HasCaseClass.class | grep 'public example.Has' public example.HasCaseClass copy(int, int); public example.HasCaseClass(int, int); ``` The Constructor for `HasCaseClass` does take two `int` and does not mention `ValueClass`.
|
thanks @eejbyfeldt this is much neater! |
eejbyfeldt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this implementation looks good! Would be great if some admin had look.
|
Jenkins test this please |
|
Is there any behavior change you can think of that might affect users? |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #141957 has finished for PR 33205 at commit
|
Hi Sean, thanks for having a look! This only changes is for case class containing value class. e.g Before this patch trying to create a But with this patch it will work like expected. Unless someone explicitly depend on having this failure I don't think there should be any behavior change that is noticeable for users. |
|
Hi Sean, I can also confirm that the impact on users here is rather fixing the currently broken behavior. There's no schema change to the previously working edge case eg. |
|
Just to triple check my understanding, the following isn't a problem because it doesn't work at all now, or something else? "However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened" |
|
That's correct @srowen. Nested AnyVal (value class) generally does not work currently. Value class in nested schema 1) currently does not work because the schema described has AnyVal class 2) but when accessing that nested value actually has unwrapped type To your specific clarification
It does work in one case of value class in parameterized class like This fix will also resolve schema issue SPARK-20384 originally described; the reporter will be able to access the value class in an unwrapped fashion. |
|
Merged to master |
| } | ||
|
|
||
| test("SPARK-20384: schema for tuple_2 of value class") { | ||
| val schema = schemaFor[(IntWrapper, StrWrapper)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird that the schema of case class of value classes is not consistent with the schema of tuple of value classes, but there seems no better solution as we need to keep backward compatibility.
What changes were proposed in this pull request?
AnyValvalue class."schema for case class that is a value class". This PR keeps the schema.Tuple2[AnyVal, AnyVal]which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly toSeq[AnyVal]Why are the changes needed?
anyValclass in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated codeFor example,
For a given
AnyValwrapper and its root-level class containerThe problematic part of generated code:
We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
From doc on value class: , Given:
class Wrapper(val underlying: Int) extends AnyVal,Wrapper, but at runtime, the representation is anInt". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.Wrapper"must be instantiated... when a value class is used as a type argument". This implies thatscala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]will still contain Wrapper as-is in during runtime instead ofInt.Does this PR introduce any user-facing change?
How was this patch tested?