Skip to content

Conversation

@mickjermsurawong-stripe
Copy link
Contributor

What changes were proposed in this pull request?

  • This PR revisits [SPARK-20384][SQL] Support value class in schema of Dataset #22309, and SPARK-20384 solving the original problem, but additionally will prevent backward-compat break on schema of top-level AnyVal value class.
  • Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test "schema for case class that is a value class". This PR keeps the schema.
  • We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
  • However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
  • With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
  • This PR revisits [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153 by handling tuple Tuple2[AnyVal, AnyVal] which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to Seq[AnyVal]

Why are the changes needed?

  • Currently, nested value class isn't supported. This is because when the generated code treats anyVal class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
    For example,
    For a given AnyVal wrapper and its root-level class container
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)

The problematic part of generated code:

    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }

We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

From doc on value class: , Given: class Wrapper(val underlying: Int) extends AnyVal,

  1. "The type at compile time is Wrapper, but at runtime, the representation is an Int". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
  2. Wrapper "must be instantiated... when a value class is used as a type argument". This implies that scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper] will still contain Wrapper as-is in during runtime instead of Int.

Does this PR introduce any user-facing change?

  • Yes, this will allow support for the nested value class.

How was this patch tested?

  • Added unit tests to illustrate
    • raw schema
    • projection
    • round-trip encode/decode

Copy link
Contributor

@eejbyfeldt eejbyfeldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR does not handle case class with generics types correctly, but this is probably fixable. But will probably require some more complexity in the getConstructorUnwrappedParameters function.

The old PR by @mt40 #22309 implemenation was such that a value class would always be encodeded by the underlying type. But in this PR it looks like in this PR encoded the elements like they are handled in the JVM, meaning that context is such that the AnyVal is represented as the underlying type the schema will just be the underlying type, while if it is boxed in the wrapper type this will also be reflected in the schema.

So with the following types

case class IntWrapper(val i: Int) extends AnyVal
case class CaseClassWithGeneric[T](generic: T, value: IntWrapper)

the schema in the PR: #22309 would be (and is for what is currently in this branch)_

Schema(
  StructType(
    Seq(
      StructField("generic", IntegerType, false),
      StructField("value", IntegerType, false)
    )
  ),
  nullable = true)
)

But I guess the deseired behavior for the implemenation suggestion in this branch would give a schema

Schema(
  StructType(
    Seq(  
      StructField("generic", StructType(Seq(StructField("i", IntegerType, false)))),
      StructField("value", IntegerType, false)
    )
  ),
  nullable = true) 
)

I don't have any strong feeling which is the correct approach to take in the implemenation. This approach gives slightly simpler implementation, while the other one hides some of the ugliness/complexities of how AnyVals work on the JVM.

But it would be good to add some test cases with case classes with genereic fields so their behavior is covered in test cases.

private def getConstructorUnwrappedParameters(tpe: Type, isTupleType: Boolean):
Seq[(String, Type)] = {
val params = getConstructorParameters(tpe)
if (isTupleType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scala Tuple is just a case class with generics: https://github.com/scala/scala/blob/2.13.x/src/library/scala/Tuple2.scala#L24

But since the current code looks specifically for scala.Tuple this means that it will not handle user defined case classes with generics.

For example I think adding the following test in the ExpressionEncodeSuite case on this branch will fail:

case class CaseClassWithGeneric[T](generic: T, value: IntWrapper)
encodeDecodeTest(
  CaseClassWithGeneric[IntWrapper](IntWrapper(1), IntWrapper(2)),
  "case class with generic fields")

@eejbyfeldt
Copy link
Contributor

I pushed in an initial implementation to a branch on my fork here: https://github.com/eejbyfeldt/spark/compare/master...support-value-classes-in-datasets?expand=1 I believe this implementation solves the cases where a case class has a generic member. My branch takes the approach that was in @mt40 original PR that value class are always encoded as the underlying type, since I am starting to think that is the correct way to implement this feature.

eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Jul 10, 2021
eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Jul 12, 2021
@eejbyfeldt
Copy link
Contributor

@mickjermsurawong-stripe I created a WIP PR with my branch here: #33316 I think you can just take the getConstructorParameters from there to make this patch handle case classes and tuples correctly. Let me know if you want me to make a patch.

@cloud-fan You previously were involved in reviewing this PR: #22309 Could you provide some input with regards to the importance of backwards compatibility with schema changes for case classes. It does not seem like that was really dicussed when that patch was being proposed.

Based on searches in the mailing lists and stackoverflow (and this being broken for so long) it would seem like a lot of people are not using this feature:

http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=value+class&days=0
http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=value+class&days=0
https://stackoverflow.com/search?tab=newest&q=%5Bapache-spark%5D%20%22value%20class%22

@mickjermsurawong-stripe
Copy link
Contributor Author

hi @eejbyfeldt please feel free to drive this forward. Thank you for the work. Happy if you'd like to make a patch here :)
On backward-compat, i think one problem is the historical data that was written with List[ValueClass] will have wrapped schema. This has been working and we likely won't get good grasp from stackoverflow or mailing list.

@eejbyfeldt
Copy link
Contributor

hi @eejbyfeldt please feel free to drive this forward. Thank you for the work. Happy if you'd like to make a patch here :)
On backward-compat, i think one problem is the historical data that was written with List[ValueClass] will have wrapped schema. This has been working and we likely won't get good grasp from stackoverflow or mailing list.

I pushed a commit here (Since I think I am not allow to push to your branch): eejbyfeldt@cbbe05e it should be based on this branch therefore you can just cherry-pick it to this branch.

Yeah, thinking about it I agree that it better to be backwards compatible. Then this change can be seen as a bug fix for "nested value classes" and will not break an existing code, which should be a much "safer" change to do.

I also remove the unwrapped params naming, as I belive this is not
helpful. Replacing the value class with their underlying type is what
you have to do to get the correct type signature. Consider the following
example:
```
$ cat TestValueClass.scala
package example

case class ValueClass(value: Int) extends AnyVal
case class HasCaseClass(value1: ValueClass, value2: Int)
$ scalac TestValueClass.scala
$ javap -p example/HasCaseClass.class | grep 'public example.Has'
  public example.HasCaseClass copy(int, int);
  public example.HasCaseClass(int, int);
```
The Constructor for `HasCaseClass` does take two `int` and does not
mention `ValueClass`.
@mickjermsurawong-stripe
Copy link
Contributor Author

thanks @eejbyfeldt this is much neater!

Copy link
Contributor

@eejbyfeldt eejbyfeldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this implementation looks good! Would be great if some admin had look.

@srowen
Copy link
Member

srowen commented Aug 2, 2021

Jenkins test this please

@srowen
Copy link
Member

srowen commented Aug 2, 2021

Is there any behavior change you can think of that might affect users?

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46468/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46468/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Test build #141957 has finished for PR 33205 at commit 23d0e94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eejbyfeldt
Copy link
Contributor

Is there any behavior change you can think of that might affect users?

Hi Sean, thanks for having a look!

This only changes is for case class containing value class. e.g

case class IntWrapper(value: Int) extends AnyVal
case class DatasetModel(wrappedInt: IntWrapper)

Before this patch trying to create a Dataset using the DatasetModel would result in rumtime error like:

21/08/03 07:50:01 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 1: Assignment conversion not possible from type "int" to type "example.IntWrapper"
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 1: Assignment conversion not possible from type "int" to type "example.IntWrapper"
	at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021)
	at org.codehaus.janino.UnitCompiler.assignmentConversion(UnitCompiler.java:10851)
	...

But with this patch it will work like expected. Unless someone explicitly depend on having this failure I don't think there should be any behavior change that is noticeable for users.

@mickjermsurawong-stripe
Copy link
Contributor Author

Hi Sean, I can also confirm that the impact on users here is rather fixing the currently broken behavior. There's no schema change to the previously working edge case eg. List[AnyValClass]

@srowen
Copy link
Member

srowen commented Aug 8, 2021

Just to triple check my understanding, the following isn't a problem because it doesn't work at all now, or something else?

"However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened"

@mickjermsurawong-stripe
Copy link
Contributor Author

mickjermsurawong-stripe commented Aug 9, 2021

That's correct @srowen. Nested AnyVal (value class) generally does not work currently.

Value class in nested schema 1) currently does not work because the schema described has AnyVal class 2) but when accessing that nested value actually has unwrapped type int 3), resulting in this exception 4). Essentially, we currently describe schema in an incompatible way with how AnyVal class operates "The type at compile time is Wrapper, but at runtime, the representation is an Int". (doc)

    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);

########################## 1) The root-level case class we care ##########################

        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;

########################## 2) We specify its member to be unwrapped case class extending `AnyVal`

        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {

########################## 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error

                value_43 = value_46.c();
            }
        }
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

To your specific clarification

the following isn't a problem because it doesn't work at all now, or something else?

It does work in one case of value class in parameterized class like Seq[AnyVal]. This is because there is no unwrapping, and the wrapper remains as-is. From the same scala doc ref, Wrapper "must be instantiated... when a value class is used as a type argument". This implies that scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper] will still contain Wrapper as-is in during runtime instead of Int.

This fix will also resolve schema issue SPARK-20384 originally described; the reporter will be able to access the value class in an unwrapped fashion.

@srowen srowen closed this in 33c6d11 Aug 9, 2021
@srowen
Copy link
Member

srowen commented Aug 9, 2021

Merged to master

}

test("SPARK-20384: schema for tuple_2 of value class") {
val schema = schemaFor[(IntWrapper, StrWrapper)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that the schema of case class of value classes is not consistent with the schema of tuple of value classes, but there seems no better solution as we need to keep backward compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants