Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 16, 2018

What changes were proposed in this pull request?

This is inspired during implementing #21732. For now ScalaReflection needs to consider how ExpressionEncoder uses generated serializers and deserializers. And ExpressionEncoder has a weird flat flag. After discussion with @cloud-fan, it seems to be better to refactor ExpressionEncoder. It should make SPARK-24762 easier to do.

To summarize the proposed changes:

  1. serializerFor and deserializerFor return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly.
  2. serializerForType and deserializerForType returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row.

So in other words, serializerForType and deserializerForType return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value.

A serializer returned by serializerForType will serialize an object at row(0) to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct.

A deserializer returned by deserializerForType will deserialize an input field at row(0) to an object with given type.

  1. The construction of ExpressionEncoder takes a pair of serializer and deserializer for type T. It uses them to create serializer and deserializer for T <-> row serialization. Now ExpressionEncoder dones't need to remember if serializer is flat or not. When we need to construct new ExpressionEncoder based on existing ones, we only need to change input location in the atomic serializer and deserializer.

How was this patch tested?

Existing tests.

@viirya viirya force-pushed the SPARK-24762-refactor branch from d755e84 to 84f3ce0 Compare October 16, 2018 15:34
@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97459 has finished for PR 22749 at commit d755e84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • throw new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97460 has finished for PR 22749 at commit 84f3ce0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • throw new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97479 has finished for PR 22749 at commit 6a6fa45.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97480 has finished for PR 22749 at commit 25a6162.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Oct 17, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97485 has finished for PR 22749 at commit 25a6162.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// We convert the not-serializable TypeTag into StructType and ClassTag.
val mirror = ScalaReflection.mirror
val tpe = typeTag[T].in(mirror).tpe
val tpe = ScalaReflection.localTypeOf[T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change it from typeTag[T].in(mirror).tpe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localTypeOf is actually doing the same thing. I think it is better to use ScalaReflection for such thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localTypeOf has a dealias at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine, but let me revert this change first.

* name/positional binding is preserved.
*/
def tuple(encoders: Seq[ExpressionEncoder[_]]): ExpressionEncoder[_] = {
if (encoders.length > 22) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do it in a separated PR with a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. ok.


val newSerializer = enc.serializer.map(_.transformUp {
val newSerializer = enc.objSerializer.transformUp {
case b: BoundReference if b == originalInputObject => newInputObject
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is only one distinct BoundReference, we can just write case b: BoundReference => newInputObject

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, right.

* N-tuple. Note that these encoders should be unresolved so that information about
* name/positional binding is preserved.
*/
def tuple(encoders: Seq[ExpressionEncoder[_]]): ExpressionEncoder[_] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, this method is simplified a lot with the new abstraction.

AssertNotNull(r, Seq("top level Product or row object"))
}
nullSafeSerializer match {
case If(_, _, s: CreateNamedStruct) => s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also make sure the if condition is IsNull, which better explains why we strip it(it can't be null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.


if (flat) require(serializer.size == 1)
/**
* A set of expressions, one for each top-level field that can be used to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -> sequence

// The schema after converting `T` to a Spark SQL row. This schema is dependent on the given
// serialier.
val schema: StructType = StructType(serializer.map { s =>
StructField(s.name, s.dataType, s.nullable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call dataType before serializer is analyzed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, serializer don't need analysis

@cloud-fan
Copy link
Contributor

I like this idea! waiting for tests pass

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97539 has finished for PR 22749 at commit 85a9122.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

hmm, it still has conflict...

@viirya
Copy link
Member Author

viirya commented Oct 24, 2018

Let me rebase again.

@SparkQA
Copy link

SparkQA commented Oct 24, 2018

Test build #97964 has finished for PR 22749 at commit ed4f4c9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Oct 24, 2018

retest this please.


// The input object to `ExpressionEncoder` is located at first column of an row.
val inputObject = BoundReference(0, dataTypeFor(tpe),
nullable = !cls.isPrimitive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just check isPrimitive of the given cls, can we check tpe directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can check tpe.typeSymbol.asClass.isPrimitive instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, then we don't need cls as a parameter.


/** Helper for extracting internal fields from a case class. */
/**
* Returns an expression for serializing the value of an input expression into Spark SQL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to duplicate the doc in this private method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did simplify a lot of it.

val serializer = serializerFor(AssertNotNull(inputObject, Seq("top level row object")), schema)
val deserializer = deserializerFor(schema)
val serializer = serializerFor(inputObject, schema)
val deserializer = deserializerFor(GetColumnByOrdinal(0, serializer.dataType), schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ScalaReflection, we create GetColumnByOrdinal in deserializeFor, shall we follow it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we need to access serializer.dataType here. So if we want to create GetColumnByOrdinal in deserializeFor, we need to pass this data type too. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, then let's leave it.

// side, in cases like outer-join.
val left = {
val combined = if (this.exprEnc.flat) {
val combined = if (!this.exprEnc.objSerializer.dataType.isInstanceOf[StructType]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create a method in ExpressionEncoder for this check?

@SparkQA
Copy link

SparkQA commented Oct 24, 2018

Test build #97967 has finished for PR 22749 at commit ed4f4c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* A sequence of expressions, one for each top-level field that can be used to
* extract the values from a raw object into an [[InternalRow]]:
* 1. If `serializer` encodes a raw object to a struct, we directly use the `serializer`.
* 2. For other cases, we create a struct to wrap the `serializer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these 2 comments more precise

1. If `serializer` encodes a raw object to a struct, strip the outer if-IsNull and get the CreateNamedStruct
2. For other cases, wrap the single serializer with CreateNamedStruct

assert(numberOfCheckedArguments(deserializerFor[(java.lang.Double, Int)]) == 1)
assert(numberOfCheckedArguments(deserializerFor[(java.lang.Integer, java.lang.Integer)]) == 0)
assert(numberOfCheckedArguments(
deserializerForType(ScalaReflection.localTypeOf[(Double, Double)])) == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create a deserializerFor method in this test suite to save some code diff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@cloud-fan
Copy link
Contributor

LGTM except 2 minor comments

@SparkQA
Copy link

SparkQA commented Oct 24, 2018

Test build #97969 has finished for PR 22749 at commit 8cb710b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test("SPARK-22442: Generate correct field names for special characters") {
val serializer = serializerFor[SpecialCharAsFieldData](BoundReference(
0, ObjectType(classOf[SpecialCharAsFieldData]), nullable = false))
val serializer = serializerForType(ScalaReflection.localTypeOf[SpecialCharAsFieldData])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we replace all the serializerForType with serializerFor in this suite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to create a method serializerFor in this suite? Or replace serializerForType with ScalaReflection.serializerFor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like deserializerFor in this suite, let's also create a serializerFor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I see.

@SparkQA
Copy link

SparkQA commented Oct 24, 2018

Test build #97971 has finished for PR 22749 at commit 078a071.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #97991 has finished for PR 22749 at commit c00d5e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
nullSafeSerializer match {
case If(_: IsNull, _, s: CreateNamedStruct) => s
case s: CreateNamedStruct => s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will we hit this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, good catch! I think this is redundant pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is minor, we can update it in another PR. We don't need to wait for another jenkins QA round.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds good to me.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@viirya
Copy link
Member Author

viirya commented Oct 25, 2018

Thanks @cloud-fan

@asfgit asfgit closed this in cb5ea20 Oct 25, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This is inspired during implementing apache#21732. For now `ScalaReflection` needs to consider how `ExpressionEncoder` uses generated serializers and deserializers. And `ExpressionEncoder` has a weird `flat` flag. After discussion with cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should make SPARK-24762 easier to do.

To summarize the proposed changes:

1. `serializerFor` and `deserializerFor` return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly.
2. `serializerForType` and `deserializerForType` returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row.

So in other words, `serializerForType` and `deserializerForType` return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value.

A serializer returned by `serializerForType` will serialize an object at `row(0)` to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct.

A deserializer returned by `deserializerForType` will deserialize an input field at `row(0)` to an object with given type.

3. The construction of `ExpressionEncoder` takes a pair of serializer and deserializer for type `T`. It uses them to create serializer and deserializer for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember if serializer is flat or not. When we need to construct new `ExpressionEncoder` based on existing ones, we only need to change input location in the atomic serializer and deserializer.

## How was this patch tested?

Existing tests.

Closes apache#22749 from viirya/SPARK-24762-refactor.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

a followup of apache#22749.

When we construct the new serializer in `ExpressionEncoder.tuple`, we don't need to add `if(isnull ...)` check for each field. They are either simple expressions that can propagate null correctly(e.g. `GetStructField(GetColumnByOrdinal(0, schema), index)`), or complex expression that already have the isnull check.

## How was this patch tested?

existing tests

Closes apache#22898 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@viirya viirya deleted the SPARK-24762-refactor branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants