[SPARK-20384][SQL] Support value class in nested schema for Dataset #33205

mickjermsurawong-stripe · 2021-07-05T00:13:20Z

What changes were proposed in this pull request?

This PR revisits [SPARK-20384][SQL] Support value class in schema of Dataset #22309, and SPARK-20384 solving the original problem, but additionally will prevent backward-compat break on schema of top-level AnyVal value class.
Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test "schema for case class that is a value class". This PR keeps the schema.
We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
This PR revisits [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153 by handling tuple Tuple2[AnyVal, AnyVal] which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to Seq[AnyVal]

Why are the changes needed?

Currently, nested value class isn't supported. This is because when the generated code treats anyVal class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given AnyVal wrapper and its root-level class container

case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)

The problematic part of generated code:

    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }

We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

From doc on value class: , Given: class Wrapper(val underlying: Int) extends AnyVal,

"The type at compile time is Wrapper, but at runtime, the representation is an Int". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
Wrapper "must be instantiated... when a value class is used as a type argument". This implies that scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper] will still contain Wrapper as-is in during runtime instead of Int.

Does this PR introduce any user-facing change?

Yes, this will allow support for the nested value class.

How was this patch tested?

Added unit tests to illustrate
- raw schema
- projection
- round-trip encode/decode

clean up

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

eejbyfeldt

I think this PR does not handle case class with generics types correctly, but this is probably fixable. But will probably require some more complexity in the getConstructorUnwrappedParameters function.

The old PR by @mt40 #22309 implemenation was such that a value class would always be encodeded by the underlying type. But in this PR it looks like in this PR encoded the elements like they are handled in the JVM, meaning that context is such that the AnyVal is represented as the underlying type the schema will just be the underlying type, while if it is boxed in the wrapper type this will also be reflected in the schema.

So with the following types

case class IntWrapper(val i: Int) extends AnyVal
case class CaseClassWithGeneric[T](generic: T, value: IntWrapper)

the schema in the PR: #22309 would be (and is for what is currently in this branch)_

Schema(
  StructType(
    Seq(
      StructField("generic", IntegerType, false),
      StructField("value", IntegerType, false)
    )
  ),
  nullable = true)
)

But I guess the deseired behavior for the implemenation suggestion in this branch would give a schema

Schema(
  StructType(
    Seq(  
      StructField("generic", StructType(Seq(StructField("i", IntegerType, false)))),
      StructField("value", IntegerType, false)
    )
  ),
  nullable = true) 
)

I don't have any strong feeling which is the correct approach to take in the implemenation. This approach gives slightly simpler implementation, while the other one hides some of the ugliness/complexities of how AnyVals work on the JVM.

But it would be good to add some test cases with case classes with genereic fields so their behavior is covered in test cases.

eejbyfeldt · 2021-07-05T12:25:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+  private def getConstructorUnwrappedParameters(tpe: Type, isTupleType: Boolean):
+  Seq[(String, Type)] = {
+    val params = getConstructorParameters(tpe)
+    if (isTupleType) {


A scala Tuple is just a case class with generics: https://github.com/scala/scala/blob/2.13.x/src/library/scala/Tuple2.scala#L24

But since the current code looks specifically for scala.Tuple this means that it will not handle user defined case classes with generics.

For example I think adding the following test in the ExpressionEncodeSuite case on this branch will fail:

case class CaseClassWithGeneric[T](generic: T, value: IntWrapper) encodeDecodeTest( CaseClassWithGeneric[IntWrapper](IntWrapper(1), IntWrapper(2)), "case class with generic fields")

eejbyfeldt · 2021-07-10T10:10:28Z

I pushed in an initial implementation to a branch on my fork here: https://github.com/eejbyfeldt/spark/compare/master...support-value-classes-in-datasets?expand=1 I believe this implementation solves the cases where a case class has a generic member. My branch takes the approach that was in @mt40 original PR that value class are always encoded as the underlying type, since I am starting to think that is the correct way to implement this feature.

The test cases are from: apache#33205

eejbyfeldt · 2021-07-13T06:50:55Z

@mickjermsurawong-stripe I created a WIP PR with my branch here: #33316 I think you can just take the getConstructorParameters from there to make this patch handle case classes and tuples correctly. Let me know if you want me to make a patch.

@cloud-fan You previously were involved in reviewing this PR: #22309 Could you provide some input with regards to the importance of backwards compatibility with schema changes for case classes. It does not seem like that was really dicussed when that patch was being proposed.

Based on searches in the mailing lists and stackoverflow (and this being broken for so long) it would seem like a lot of people are not using this feature:

http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=value+class&days=0
http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=value+class&days=0
https://stackoverflow.com/search?tab=newest&q=%5Bapache-spark%5D%20%22value%20class%22

mickjermsurawong-stripe · 2021-07-13T16:20:12Z

hi @eejbyfeldt please feel free to drive this forward. Thank you for the work. Happy if you'd like to make a patch here :)
On backward-compat, i think one problem is the historical data that was written with List[ValueClass] will have wrapped schema. This has been working and we likely won't get good grasp from stackoverflow or mailing list.

eejbyfeldt · 2021-07-15T11:10:28Z

hi @eejbyfeldt please feel free to drive this forward. Thank you for the work. Happy if you'd like to make a patch here :)
On backward-compat, i think one problem is the historical data that was written with List[ValueClass] will have wrapped schema. This has been working and we likely won't get good grasp from stackoverflow or mailing list.

I pushed a commit here (Since I think I am not allow to push to your branch): eejbyfeldt@cbbe05e it should be based on this branch therefore you can just cherry-pick it to this branch.

Yeah, thinking about it I agree that it better to be backwards compatible. Then this change can be seen as a bug fix for "nested value classes" and will not break an existing code, which should be a much "safer" change to do.

I also remove the unwrapped params naming, as I belive this is not helpful. Replacing the value class with their underlying type is what you have to do to get the correct type signature. Consider the following example: ``` $ cat TestValueClass.scala package example case class ValueClass(value: Int) extends AnyVal case class HasCaseClass(value1: ValueClass, value2: Int) $ scalac TestValueClass.scala $ javap -p example/HasCaseClass.class | grep 'public example.Has' public example.HasCaseClass copy(int, int); public example.HasCaseClass(int, int); ``` The Constructor for `HasCaseClass` does take two `int` and does not mention `ValueClass`.

mickjermsurawong-stripe · 2021-07-16T23:19:14Z

thanks @eejbyfeldt this is much neater!

eejbyfeldt

To me this implementation looks good! Would be great if some admin had look.

srowen · 2021-08-02T14:19:54Z

Jenkins test this please

srowen · 2021-08-02T14:20:52Z

Is there any behavior change you can think of that might affect users?

SparkQA · 2021-08-02T15:40:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46468/

SparkQA · 2021-08-02T16:33:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46468/

SparkQA · 2021-08-02T19:28:00Z

Test build #141957 has finished for PR 33205 at commit 23d0e94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

eejbyfeldt · 2021-08-03T06:05:06Z

Is there any behavior change you can think of that might affect users?

Hi Sean, thanks for having a look!

This only changes is for case class containing value class. e.g

case class IntWrapper(value: Int) extends AnyVal
case class DatasetModel(wrappedInt: IntWrapper)

Before this patch trying to create a Dataset using the DatasetModel would result in rumtime error like:

21/08/03 07:50:01 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 1: Assignment conversion not possible from type "int" to type "example.IntWrapper"
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 1: Assignment conversion not possible from type "int" to type "example.IntWrapper"
	at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021)
	at org.codehaus.janino.UnitCompiler.assignmentConversion(UnitCompiler.java:10851)
	...

But with this patch it will work like expected. Unless someone explicitly depend on having this failure I don't think there should be any behavior change that is noticeable for users.

mickjermsurawong-stripe · 2021-08-03T17:59:17Z

Hi Sean, I can also confirm that the impact on users here is rather fixing the currently broken behavior. There's no schema change to the previously working edge case eg. List[AnyValClass]

srowen · 2021-08-08T19:07:19Z

Just to triple check my understanding, the following isn't a problem because it doesn't work at all now, or something else?

"However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened"

mickjermsurawong-stripe · 2021-08-09T00:25:57Z

That's correct @srowen. Nested AnyVal (value class) generally does not work currently.

Value class in nested schema 1) currently does not work because the schema described has AnyVal class 2) but when accessing that nested value actually has unwrapped type int 3), resulting in this exception 4). Essentially, we currently describe schema in an incompatible way with how AnyVal class operates "The type at compile time is Wrapper, but at runtime, the representation is an Int". (doc)

    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);

########################## 1) The root-level case class we care ##########################

        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;

########################## 2) We specify its member to be unwrapped case class extending `AnyVal`

        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {

########################## 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error

                value_43 = value_46.c();
            }
        }

java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

To your specific clarification

the following isn't a problem because it doesn't work at all now, or something else?

It does work in one case of value class in parameterized class like Seq[AnyVal]. This is because there is no unwrapping, and the wrapper remains as-is. From the same scala doc ref, Wrapper "must be instantiated... when a value class is used as a type argument". This implies that scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper] will still contain Wrapper as-is in during runtime instead of Int.

This fix will also resolve schema issue SPARK-20384 originally described; the reporter will be able to access the value class in an unwrapped fashion.

srowen · 2021-08-09T13:49:21Z

Merged to master

cloud-fan · 2021-08-09T17:18:44Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

+  }
+
+  test("SPARK-20384: schema for tuple_2 of value class") {
+    val schema = schemaFor[(IntWrapper, StrWrapper)]


It's a bit weird that the schema of case class of value classes is not consistent with the schema of tuple of value classes, but there seems no better solution as we need to keep backward compatibility.

mickjermsurawong-stripe added 10 commits July 4, 2021 12:36

failing test on nested/complex value

67e39fb

assert current spark schema for collection of values

2f95fbc

unwrap params passing encoder suite

69c3c8c

clean up

fix test: inner value class unwrapped

aa89c6c

clean up case class instantiation

4f95068

add more test for nested collection

7181446

fix nits

4872ed9

add filter tests for explicitness in sql schema

fcc88f4

handle tuple

418a048

add enc/decoder test

7f6931d

github-actions bot added the SQL label Jul 5, 2021

mickjermsurawong-stripe mentioned this pull request Jul 5, 2021

[SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153

Closed

mickjermsurawong-stripe commented Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala Outdated Show resolved Hide resolved

mickjermsurawong-stripe commented Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala Outdated Show resolved Hide resolved

mickjermsurawong-stripe commented Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala Outdated Show resolved Hide resolved

eejbyfeldt reviewed Jul 5, 2021

View reviewed changes

eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Jul 10, 2021

Add and adopt DataFrameSuite tests cases

7fdff8c

The test cases are from: apache#33205

eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Jul 12, 2021

Add and adopt DataFrameSuite tests cases

cf1a5d6

The test cases are from: apache#33205

eejbyfeldt mentioned this pull request Jul 13, 2021

[WIP][SPARK-20384][SQL] Support value classes and always encoded as underlying type #33316

Closed

eejbyfeldt approved these changes Jul 18, 2021

View reviewed changes

srowen closed this in 33c6d11 Aug 9, 2021

cloud-fan reviewed Aug 9, 2021

View reviewed changes

eejbyfeldt mentioned this pull request May 19, 2022

[SPARK-38681][SQL] Support nested generic case classes #36004

Closed

[SPARK-20384][SQL] Support value class in nested schema for Dataset #33205

[SPARK-20384][SQL] Support value class in nested schema for Dataset #33205

Uh oh!

Conversation

mickjermsurawong-stripe commented Jul 5, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eejbyfeldt left a comment

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Jul 5, 2021

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt commented Jul 10, 2021

Uh oh!

eejbyfeldt commented Jul 13, 2021

Uh oh!

mickjermsurawong-stripe commented Jul 13, 2021

Uh oh!

eejbyfeldt commented Jul 15, 2021

Uh oh!

mickjermsurawong-stripe commented Jul 16, 2021

Uh oh!

eejbyfeldt left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Aug 2, 2021

Uh oh!

srowen commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

eejbyfeldt commented Aug 3, 2021

Uh oh!

mickjermsurawong-stripe commented Aug 3, 2021

Uh oh!

srowen commented Aug 8, 2021

Uh oh!

mickjermsurawong-stripe commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Aug 9, 2021

Uh oh!

cloud-fan Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mickjermsurawong-stripe commented Aug 9, 2021 •

edited

Loading