[SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks #38922

SandishKumarHN · 2022-12-05T21:51:26Z

Oneof fields allow a message to contain one and only one of a defined set of field types, while recursive fields provide a way to define messages that can refer to themselves, allowing for the creation of complex and nested data structures. with this change users will be able to use protobuf OneOf fields with spark-protobuf, making it a more complete and useful tool for processing protobuf data.

Support for circularReferenceDepth:
The recursive.fields.max.depth parameter can be specified in the from_protobuf options to control the maximum allowed recursion depth for a field. Setting recursive.fields.max.depth to 0 drops all-recursive fields, setting it to 1 allows it to be recursed once, and setting it to 2 allows it to be recursed twice. Attempting to set the recursive.fields.max.depth to a value greater than 10 is not allowed. If the recursive.fields.max.depth is not specified, it will default to -1; recursive fields are not permitted. if a protobuf record has more depth for recursive fields than the allowed value, it will be truncated and some fields may be discarded. This check is based on the fully qualified field type.
SQL Schema for the protobuf message

message Person { 
    string name = 1;
    Person bff = 2
}

will vary based on the value of recursive.fields.max.depth.

0: struct<name: string, bff: null>
1: struct<name string, bff: <name: string, bff: null>>
2: struct<name string, bff: <name: string, bff: struct<name: string, bff: null>>> ...

What changes were proposed in this pull request?

Add support for protobuf oneof field
Stop recursion at the first level when a recursive field is encountered. (instead of throwing an error)

Why are the changes needed?

Stop recursion at the first level and handle nulltype in deserilization.

Does this PR introduce any user-facing change?

NA

How was this patch tested?

Added Unit tests for OneOf field support and recursion checks.
Tested full support for nested OneOf fields and message types using real data from Kafka on a real cluster

cc: @rangadi @mposdev21

SandishKumarHN · 2022-12-05T21:53:46Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

            nullable = false))
      case MESSAGE =>
+        // Stop recursion at the first level when a recursive field is encountered.
+        // TODO: The user should be given the option to set the recursion level to 1, 2, or 3


@rangadi @mposdev21 Instead of limiting the recursion to only one level, the user should be able to choose a recursion level of 1, 2, or 3. Going beyond 3 levels of recursion should not be allowed. any thoughts?

spark.protobuf.recursion.level

Yeah, I think it is useful. Users may not be able to remove recursive references, but might be willing to limit recursion.
I think the default should be an error with a clear message about how users can set configuration.
Also, I don't think it should be spark config, but rather an option passed in.

Are you planning to add selectable recursion depth here or in a follow up?

@rangadi planning to add the selectable recursion depth in this PR.

rangadi · 2022-12-06T21:18:27Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala


      case (null, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal)

+      case (MESSAGE, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal)


What is this for? For handling limited recursion?

yes, correct.

Could you add a comment about we might be dropping data here? It will not be easy to see for a future reader.
We could have an option to error our if the actual data has more recursion than the configure.

rangadi · 2022-12-06T21:19:40Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

            nullable = false))
      case MESSAGE =>
+        // Stop recursion at the first level when a recursive field is encountered.
+        // TODO: The user should be given the option to set the recursion level to 1, 2, or 3


Are you planning to add selectable recursion depth here or in a follow up?

rangadi · 2022-12-06T21:20:57Z

connector/protobuf/src/test/resources/protobuf/functions_suite.proto

-}
+}
+
+message OneOfEvent {


Are you testing more OneOf and recusion in the same message? Could you split them into separate messages?

@rangadi I see a lot of use cases for the "payload" Oneof the field and recursive fields in it. So I thought combining Oneof with recursion would be a good test. will separate

Combined one is fine, we could keep it. Better to have a simpler separate tests as well.

rangadi · 2022-12-06T21:23:09Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

      parameters = Map("descFilePath" -> testFileDescriptor))
  }
+
+  test("Unit tests for OneOf field support and recursion checks") {


Lets separate these two into separate tests with separate protobuf message.

will do that.

rangadi · 2022-12-06T21:26:10Z

connector/protobuf/src/test/resources/protobuf/functions_suite.proto

+
+message OneOfEvent {
+  string key = 1;
+  oneof payload {


How do one-of fields look like in spark schema? Could you give an example? I could not see the schema in the unit tests.

@rangadi the "Oneof" field is of message type, Oneof will be converted to a struct type.

SandishKumarHN · 2022-12-07T00:14:38Z

#38922 (comment)

@rangadi made the below changes.

Added selectable recursion depth option to from_protobuf.
Added two unit tests for Oneof type, a simple one for Oneof field, and a complex Onefield with recursionDepth=2.
Existing unit tests should cover foundRecursionInProtobufSchema if recursionDepth is not set and a recursion field is discovered.

rangadi · 2022-12-07T00:37:31Z

Added selectable recursion depth option to from_protobuf.

Do we need to this for 'to_protobuf()' too? What would happen in that case?

SandishKumarHN · 2022-12-07T00:51:23Z

Added selectable recursion depth option to from_protobuf.

Do we need to this for 'to_protobuf()' too? What would happen in that case?

@rangadi
The source dataframe struct field should match the protobuf recursion message for "to protobuf." It will convert until the recursion level is matched. like struct within a struct to recursive message. This is true even for existing code.

rangadi · 2022-12-07T00:55:22Z

The source dataframe struct field should match the protobuf recursion message for "to protobuf." It will convert until the recursion level is matched. like struct within a struct to recursive message. This is true even for existing code.

Interesting. So we would make that null after some depth. Could you add test for this?

SandishKumarHN · 2022-12-07T01:02:25Z

The source dataframe struct field should match the protobuf recursion message for "to protobuf." It will convert until the recursion level is matched. like struct within a struct to recursive message. This is true even for existing code.

Interesting. So we would make that null after some depth. Could you add test for this?

@rangadi will add a test for the above case.
A Spark dataframe with complex nested structures should typically be converted to a protobuf message. It is the user's responsibility to specify right .proto(.desc) file that corresponds to the source dataframe.

rangadi · 2022-12-07T01:04:25Z

file that corresponds to the source dataframe.

They might have used from_protobuf() to get that schema, which supports recursive fields. They should be able to do to_protobuf() with the same protobuf definition.

SandishKumarHN · 2022-12-07T01:07:23Z

file that corresponds to the source dataframe.

They might have used from_protobuf() to get that schema, which supports recursive fields. They should be able to do to_protobuf() with the same protobuf definition.

This case is already been covered in the unit tests. will add a unit test for direct struct to protobuf conversion.

AmplabJenkins · 2022-12-07T12:20:13Z

Can one of the admins verify this patch?

baganokodo2022 · 2022-12-07T20:13:49Z

Hi @SandishKumarHN,

For the recursionDepth option, could we consider naming it as CircularReferenceTolerance or CircularReferenceDepth for clarity?
For instance, -1 (default value) will error out on any circular reference, 0 drops any circular reference field, 1 allows the same field to be entered twice, and on.

Besides, can we also support a "CircularReferenceType" option with a enum value of [FIELD_NAME, FIELD_TYPE]. The reason is because navigation can go very deep before the same fully-qualified FIELD_NAME is encountered again. While FIELD_TYPE stops recursive navigation much faster. We could make FIELD_NAME the default option. In my test cases, with FIELD_TYPE, a circular reference can repeat 3 times before the executor hit OOM, while FIELD_NAME hit OOM when CircularReferenceTolerance is set to 1.

Please let me know your thoughts.

cc @rangadi

Thank you

Xinyu Liu

SandishKumarHN · 2022-12-07T22:27:22Z

@baganokodo2022 Circler type(specially MESSAGE) occurs frequently in a single message. The user won't be able to distinguish it and fix it (imagine which field should user keep or remove). because each type will have a unique field name that is valid. The user can verify and fix the circular reference in the full field name scenario.

anyways, I have made the initial change as per your idea. have a look at it.

cc: @rangadi

rangadi · 2022-12-08T19:00:14Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

-        if (existingRecordNames.contains(fd.getFullName)) {
-          throw QueryCompilationErrors.foundRecursionInProtobufSchema(fd.toString())
+        // User can set circularReferenceDepth of 0 or 1 or 2.
+        // Going beyond 3 levels of recursion is not allowed.


Could you add a justification for this?

@rangadi The user can specify the maximum allowed recursion depth for a field by setting the circularReferenceDepth property to 0, 1, or 2. Setting the circularReferenceDepth to 0 allows the field to be recursed once, setting it to 1 allows it to be recursed twice, and setting it to 2 allows it to be recursed thrice. Attempting to set the circularReferenceDepth to a value greater than 2 is not allowed. If the circularReferenceDepth is not specified, it will default to -1, which disables recursive fields.

rangadi · 2022-12-08T19:07:28Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala

  val parseMode: ParseMode =
    parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode)
+
+  val circularReferenceType: String = parameters.getOrElse("circularReferenceType", "FIELD_NAME")


@SandishKumarHN @baganokodo2022 moving the discussion here (for threading).

Besides, can we also support a "CircularReferenceType" option with a enum value of [FIELD_NAME, FIELD_TYPE]. The reason is because navigation can go very deep before the same fully-qualified FIELD_NAME is encountered again. While FIELD_TYPE stops recursive navigation much faster. ...

I didn't quite follow the motivation here. Could you give a concrete examples for the two difference cases?

@rangadi we already know about field_name circusive check. using fd.getFullName we detect the recursion and throw and error. another option is to detect recursion through field type. example below.

message A { B b; } message B { A c; }

in the case of field_name recursive check it is A.B.C no recursion.
in the case of field_type recursive check. it is MESSAGE.MESSAGE.MESSAGE recursion will be found and throw an error or drop the certain recursive depth.
but it will also throw an error for the below case with the field_type check. since it will be MESSAGE.MESSAGE.MESSAGE.MESSAGE

message A { B b = 1; } message B { D d = 1; } message D { E e = 1; } message E { int32 key = 1; }

@baganokodo2022 argument is field_type base check will give users an option to drop recursion more quickly because with complex nested schema recursive field_name can be found at very deep. before hitting this we might see OOM. field_type base check finds the circle reference more quickly.

@baganokodo2022 please correct me if I'm wrong.

in the case of field_name recursive check it is A.B.C no recursion.

The first example is clearly recursion. What is 'C' here?

but it will also throw an error for the below case with the field_type check. since it will be MESSAGE.MESSAGE.MESSAGE.MESSAGE

Why is this recursion?

Are our unit tests showing these cases?

I would have @baganokodo2022 give more details on the field type case.

We have not yet added unit tests for the field-type case. would like to discuss this before adding unit tests.

thread would be A.B.A.aa.D.d.A.aaa.E

What is this thread?

Given this discussion, let's write down functionality and examples, before we implement so that we are all on the same page.

@rangadi fd.fullName is able to detect the recursive field with different field names. add a unit test. now I'm confused.
Fail for recursion field with different field names

:) yeah, field names should not matter at all.
We can do video chat to clarify all this.

@rangadi @baganokodo2022 thanks for the quick meet. meeting conclusion was to use descriptor type full name and added unit tests with some complex schema.

val recordName = fd.getMessageType.getFullName

rangadi · 2022-12-08T19:09:46Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

+            (protobufOptions.circularReferenceDepth < 0 ||
+              protobufOptions.circularReferenceDepth >= 3)) {
+            throw QueryCompilationErrors.foundRecursionInProtobufSchema(fd.toString())
+          } else if (existingRecordTypes.contains(fd.getType.name()) &&


name or full name?
also what keeps track of the recursion depth?

@rangadi we have two maps with incremental counters, one for field_name base check and one for field_type.

@SandishKumarHN and @rangadi , should we error out on -1 the default value unless users specifically override?
0 (tolerance) -> drop all recursed fields once encountered
1 (tolerance) -> allowed the same field name (type) to be entered twice.
2 (tolerance) -> allowed the same field name (type) to be entered 3 timce.

thoughts?

In my back-ported branch,

val recordName = circularReferenceType match { case CircularReferenceTypes.FIELD_NAME => fd.getFullName case CircularReferenceTypes.FIELD_TYPE => fd.getFullName().substring(0, fd.getFullName().lastIndexOf(".")) } if (circularReferenceTolerance < 0 && existingRecordNames(recordName) > 0) { // no tolerance on circular reference logError(s"circular reference in protobuf schema detected [no tolerance] - ${recordName}") throw new IllegalStateException(s"circular reference in protobuf schema detected [no tolerance] - ${recordName}") } if (existingRecordNames(recordName) > (circularReferenceTolerance max 0) ) { // stop navigation and drop the repetitive field logInfo(s"circular reference in protobuf schema detected [max tolerance breached] field dropped - ${recordName} = ${existingRecordNames(recordName)}") Some(NullType) } else { val newRecordNames: Map[String, Int] = existingRecordNames + (recordName -> (1 + existingRecordNames(recordName))) Option( fd.getMessageType.getFields.asScala .flatMap(structFieldFor(_, newRecordNames, protobufOptions)) .toSeq) .filter(_.nonEmpty) .map(StructType.apply) }```

rangadi · 2022-12-08T19:11:01Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

+      assert(expectedFields.contains(f.getName))
+    })
+
+    val schema = StructType(Seq(StructField("sample",


Btw, using `val schema = DataType.fromJson("json string") is lot more readable.
Optional we could update many of these in follow up PRs.

rangadi · 2022-12-08T19:11:52Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

      parameters = Map("descFilePath" -> testFileDescriptor))
  }
+
+  test("Unit test for Protobuf OneOf field") {


Add a short description of the test at the top. It improves readability. What is this verifying?

Remove "Unit test for", this is already a unit test :).

rangadi · 2022-12-08T19:13:17Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.functions.{lit, struct}
-import org.apache.spark.sql.protobuf.protos.SimpleMessageProtos.SimpleMessageRepeated
+import org.apache.spark.sql.protobuf.protos.SimpleMessageProtos.{EventRecursiveA, EventRecursiveB, OneOfEvent, OneOfEventWithRecursion, SimpleMessageRepeated}


Are there tests for recursive fields?

@rangadi yes,
Handle recursive fields in Protobuf schema, C->D->Array(C) and
Handle recursive fields in Protobuf schema, A->B->A

Could we move that to different tests?

@rangadi I didn't understand. these are already two different tests.

baganokodo2022

thank you for the PR

baganokodo2022 · 2022-12-08T22:39:17Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala

  val parseMode: ParseMode =
    parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode)
+
+  val circularReferenceType: String = parameters.getOrElse("circularReferenceType", "FIELD_NAME")


Yes @SandishKumarHN you are right. That is discovered from a very complex Proto schema shared across many micro services.

baganokodo2022 · 2022-12-08T22:55:34Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala

  val parseMode: ParseMode =
    parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode)
+
+  val circularReferenceType: String = parameters.getOrElse("circularReferenceType", "FIELD_NAME")


Hi @rangadi , under certain circumstances dropping fields with data seems inevitable when dealing with circular references. We can't tell which fields are intended to be kept. One example is the parent-child relationship in a RDB data model, considering IC -> EM -> EM2 -> Director -> Senior Director -> VP -> CTO -> CEO, which are all Employee type, assuming the relationship is bi-directional. The longest path for level-1 circular reference on FIELD_NAME is IC -> EM -> EM2 -> Director -> Senior Director -> VP -> CTO -> CEO -> CTO -> VP -> Senior Director -> Director -> EM2 -> EM -> IC. In reality, data scientists may just want to keep 2 levels of circular reference on FIELD_TYPE , IC -> EM -> EM2, or EM2 -> Director -> Senior Director. This greatly reduces redundant data in the warehouse.

Hope it make sense

Thanks
Xinyu

baganokodo2022 · 2022-12-08T22:58:08Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

      fd: FieldDescriptor,
-      existingRecordNames: Set[String]): Option[StructField] = {
+      existingRecordNames: Map[String, Int],
+      existingRecordTypes: Map[String, Int],


@SandishKumarHN since it is going to be either FIELD_NAME or FIELD_TYPE, do we need keep both 2 Maps?

baganokodo2022 · 2022-12-08T23:03:17Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

+            (protobufOptions.circularReferenceDepth < 0 ||
+              protobufOptions.circularReferenceDepth >= 3)) {
+            throw QueryCompilationErrors.foundRecursionInProtobufSchema(fd.toString())
+          } else if (existingRecordTypes.contains(fd.getType.name()) &&


@SandishKumarHN and @rangadi , should we error out on -1 the default value unless users specifically override?
0 (tolerance) -> drop all recursed fields once encountered
1 (tolerance) -> allowed the same field name (type) to be entered twice.
2 (tolerance) -> allowed the same field name (type) to be entered 3 timce.

thoughts?

baganokodo2022 · 2022-12-08T23:09:43Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

+            (protobufOptions.circularReferenceDepth < 0 ||
+              protobufOptions.circularReferenceDepth >= 3)) {
+            throw QueryCompilationErrors.foundRecursionInProtobufSchema(fd.toString())
+          } else if (existingRecordTypes.contains(fd.getType.name()) &&


In my back-ported branch,

val recordName = circularReferenceType match { case CircularReferenceTypes.FIELD_NAME => fd.getFullName case CircularReferenceTypes.FIELD_TYPE => fd.getFullName().substring(0, fd.getFullName().lastIndexOf(".")) } if (circularReferenceTolerance < 0 && existingRecordNames(recordName) > 0) { // no tolerance on circular reference logError(s"circular reference in protobuf schema detected [no tolerance] - ${recordName}") throw new IllegalStateException(s"circular reference in protobuf schema detected [no tolerance] - ${recordName}") } if (existingRecordNames(recordName) > (circularReferenceTolerance max 0) ) { // stop navigation and drop the repetitive field logInfo(s"circular reference in protobuf schema detected [max tolerance breached] field dropped - ${recordName} = ${existingRecordNames(recordName)}") Some(NullType) } else { val newRecordNames: Map[String, Int] = existingRecordNames + (recordName -> (1 + existingRecordNames(recordName))) Option( fd.getMessageType.getFields.asScala .flatMap(structFieldFor(_, newRecordNames, protobufOptions)) .toSeq) .filter(_.nonEmpty) .map(StructType.apply) }```

baganokodo2022 · 2022-12-08T23:10:37Z

connector/protobuf/src/test/resources/protobuf/functions_suite.proto

-}
+}
+
+message OneOfEvent {


HeartSaVioR · 2022-12-08T23:54:01Z

cc. @cloud-fan I guess there has been some demands on recursive schema already. Could you please help looking into the proposal and see whether it makes sense to you? Please add more ppl in the loop if you know others who would be interested.

HeartSaVioR · 2022-12-09T00:07:22Z

I guess the demand on supporting recursive schema is not specific to protobuf, it also applies to Avro. If we construct a way how to project the recursive schema into Spark SQL's schema, we may want to apply this consistently across components.

The visibility of this PR is too limited, ppl interested in protobuf would only look into this. Instead of deciding such thing within this PR, it seems like going through discussion thread in dev@ for this is not a bad idea. What do you all think?

If you have a proposal, please write it down to the doc format e.g. google doc, with several examples, and share it in the discussion thread. The description of the PR does not seem to be enough to understand what this PR (or some other) is proposing.

SandishKumarHN · 2022-12-09T00:49:56Z

@baganokodo2022 instead of implementing a check for circular reference type in this PR, can we discuss this further and write a proposal for it before adding it to the next PR? We can share the proposal with the [email protected] mailing list for feedback and input.

@rangadi @HeartSaVioR In this PR, we will only implement a check for circular references through the full field name. Let me know if any further changes are needed.

rangadi · 2022-12-09T01:04:39Z

@SandishKumarHN cold you keep the discussion in the code review thread? It is hard to piece together multiple messages otherwise. I think the it is fairly straight forward what recursion means. There seems to be some confusion about that. Let's discuss in the thread.

rangadi · 2022-12-17T02:27:32Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala

  val parseMode: ParseMode =
    parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode)
+
+  // Setting the `recursive.fields.max.depth` to 0 allows the field to be recurse once,


'0' disables recursion right? Why once? This might be difference in terminology. Thats why giving a quick example is better. Could you add this example?:

Consider a simple simple recursive proto 'message Person { string name = 1; Person bff = 2}

What would be spark schema when recursion 0, 1, and 2? I think :

0: struct<name: string, bff: null>

1: struct<name string, bff: <name: string, bff: null>>

2: struct<name string, bff: <name: string, bff: struct<name: string, bff: null>>>

@rangadi Thank you for your suggestion. I have implemented it by adding a comment and a unit test to make the example more clear to users.

rangadi

Overall LGTM. Made a couple of suggestions to clarify how schema looks like with recursive limit. Both in comment and unit test.

rangadi · 2022-12-17T02:36:22Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

+    }
+  }
+
+  test("Fail for recursion field with different field names without circularReferenceDepth") {


Fix circularReferenceDepth in the name.

rangadi · 2022-12-17T02:36:39Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

+    }
+  }
+
+  test("recursion field with different field names with circularReferenceDepth") {


Fix the name.

What is this testing? As we discussed, field name does not matter.

rangadi · 2022-12-17T02:40:13Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

+    eventFromSparkSchema.getDescriptorForType.getFields.asScala.map(f => {
+      assert(expectedFields.contains(f.getName))
+    })
+  }


Could you add test that clearly shows the expected schema similar to my comment here: https://github.com/apache/spark/pull/38922/files#r1051292604

It is not easy to seem from these test what schema does 0 or 2 results in.

rangadi · 2022-12-17T02:41:55Z

core/src/main/resources/error/error-classes.json

  "RECURSIVE_PROTOBUF_SCHEMA" : {
    "message" : [
-      "Found recursive reference in Protobuf schema, which can not be processed by Spark: <fieldDescriptor>"
+      "Found recursive reference in Protobuf schema, which can not be processed by Spark by default: <fieldDescriptor>. try setting the option `recursive.fields.max.depth` as 0 or 1 or 2. Going beyond 3 levels of recursion is not allowed."


Why is 3 or above not alloweded? Seems pretty low. If a customer wants to set the level, they will be conscious. I think it should be at least high single digits to cover most cases. How about 10?

@rangadi agree.

rangadi

This looks great. Thanks.

rangadi · 2022-12-20T02:42:44Z

Asking @HeartSaVioR to take a quick look to approve.
@cloud-fan take a look at the updated PR description for example of how spark schema would look like with the different setting for the config.

cloud-fan · 2022-12-20T15:54:31Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala

+  // specified, the default value is -1; recursive fields are not permitted. If a protobuf
+  // record has more depth than the allowed value for recursive fields, it will be truncated
+  // and some fields may be discarded.
+  val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt


The option name may need a bit more discussion. Usually data source options do not have long names, and don't contains dot. See JSONOptions. How about maxRecursiveFieldDepth?

@cloud-fan this is in line with options for Kafka source. e.g. 'kafka.' prefix allows setting Kafka clientconfigs.

In addition, we will be passing more options. E.g. for schema registry auth configs. They will have a prefix like 'confluent.schemaregistry.[actual registry client conf]'

cloud-fan · 2022-12-20T15:56:10Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

-  def toSqlTypeHelper(descriptor: Descriptor): SchemaType = ScalaReflectionLock.synchronized {
+  def toSqlTypeHelper(
+      descriptor: Descriptor,
+      protobufOptions: ProtobufOptions): SchemaType = ScalaReflectionLock.synchronized {


not related to this PR, but why would we lock ScalaReflectionLock here?

cc @gengliangwang

Yeah, I just noticed. Not sure if if we need.
@SandishKumarHN could we remove this in a follow up?

cloud-fan · 2022-12-20T15:57:07Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

  def structFieldFor(
      fd: FieldDescriptor,
-      existingRecordNames: Set[String]): Option[StructField] = {
+      existingRecordNames: Map[String, Int],


can we add comments to explain what map key and value means here?

@cloud-fan added a comment.

cloud-fan · 2022-12-20T16:00:03Z

connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala

+        // 0: struct<name: string, bff: null>
+        // 1: struct<name string, bff: <name: string, bff: null>>
+        // 2: struct<name string, bff: <name: string, bff: struct<name: string, bff: null>>> ...
+        val recordName = fd.getMessageType.getFullName


Suggested change

val recordName = fd.getMessageType.getFullName

val recordName = fd.getFullName

are they same? The previous code uses fd.getFullName

Good catch. I think the previous code was incorrect. We need to verify if a same Protobuf type was seen before in this DFS traversal.
@SandishKumarHN what was the unit test that verified recursion?

@cloud-fan fd.getFullName gives a fully qualified name along with a field name, we needed the fully qualified type name. we made this decision above.

here is the difference.

println(s"${fd.getFullName} : ${fd.getMessageType.getFullName}") org.apache.spark.sql.protobuf.protos.Employee.ic : org.apache.spark.sql.protobuf.protos.IC org.apache.spark.sql.protobuf.protos.IC.icManager : org.apache.spark.sql.protobuf.protos.Employee org.apache.spark.sql.protobuf.protos.Employee.ic : org.apache.spark.sql.protobuf.protos.IC org.apache.spark.sql.protobuf.protos.IC.icManager : org.apache.spark.sql.protobuf.protos.Employee org.apache.spark.sql.protobuf.protos.Employee.em : org.apache.spark.sql.protobuf.protos.EM

@rangadi previous code fd.getFullName fully qualified name along with a field name works to find out recursion. so before we just use to throw errors on any recursion field.

cloud-fan

looks good to me except for some minor comments

rangadi · 2022-12-20T19:36:17Z

jenkins merge

cloud-fan · 2022-12-21T01:36:51Z

thanks, merging to master!

…nverters ### What changes were proposed in this pull request? Following up from PR #38922 to remove unnecessary ScalaReflectionLock from SchemaConvertors file. cc: cloud-fan ### Why are the changes needed? removing unnecessary code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing unit tests Closes #39147 from SandishKumarHN/SPARK-41639. Authored-by: SandishKumarHN <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

rangadi · 2023-02-14T19:44:13Z

See #40011 for a follow up tweak for this config. '0' is not supported. Fixes how the limit is applied.

OneOf field support and recursion checks

1266857

github-actions bot added PROTOBUF SQL labels Dec 5, 2022

SandishKumarHN commented Dec 5, 2022

View reviewed changes

rangadi reviewed Dec 6, 2022

View reviewed changes

review changes, recursionDepth option, unit tests

d38cc71

SandishKumarHN force-pushed the SPARK-41396 branch from 5488e57 to d38cc71 Compare December 7, 2022 00:13

github-actions bot added the CORE label Dec 7, 2022

SandishKumarHN added 2 commits December 6, 2022 20:15

review changes: adding unit tests for struct to protobuf

e2dc559

Merge remote-tracking branch 'remote-spark/master' into SPARK-41396

f0d2e5f

SandishKumarHN requested a review from rangadi December 7, 2022 18:48

circularReferenceType && circularReferenceDepth

c8c7bd7

rangadi reviewed Dec 8, 2022

View reviewed changes

review changes, desc for unittests, jsonschema

2337892

baganokodo2022 reviewed Dec 8, 2022

View reviewed changes

review changes, circularReferenceDepth changes

5340bb4

SandishKumarHN force-pushed the SPARK-41396 branch from b971837 to af1b11b Compare December 15, 2022 03:15

review changes, recursive.fields.max.depth, comments, nit

a48f7d6

SandishKumarHN force-pushed the SPARK-41396 branch from af1b11b to a48f7d6 Compare December 15, 2022 03:18

rangadi reviewed Dec 17, 2022

View reviewed changes

recursive depth to 10, unite tests, nit, comments

dd47096

SandishKumarHN force-pushed the SPARK-41396 branch from 7c2967d to dd47096 Compare December 17, 2022 07:00

SandishKumarHN requested a review from rangadi December 17, 2022 09:13

rangadi approved these changes Dec 20, 2022

View reviewed changes

cloud-fan reviewed Dec 20, 2022

View reviewed changes

cloud-fan approved these changes Dec 20, 2022

View reviewed changes

review changes, comment for existingRecordNames

231f0a8

cloud-fan closed this in d33a59c Dec 21, 2022

SandishKumarHN mentioned this pull request Dec 21, 2022

[SPARK-41639][SQL][PROTOBUF] : Remove ScalaReflectionLock from SchemaConverters #39147

Closed


		case (null, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal)

		case (MESSAGE, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal)

	val recordName = fd.getMessageType.getFullName
	val recordName = fd.getFullName

[SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks #38922

[SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks #38922

Uh oh!

Conversation

SandishKumarHN commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SandishKumarHN Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SandishKumarHN commented Dec 7, 2022

Uh oh!

rangadi commented Dec 7, 2022

Uh oh!

SandishKumarHN commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rangadi commented Dec 7, 2022

Uh oh!

SandishKumarHN commented Dec 7, 2022

Uh oh!

rangadi commented Dec 7, 2022

Uh oh!

SandishKumarHN commented Dec 7, 2022

Uh oh!

AmplabJenkins commented Dec 7, 2022

Uh oh!

baganokodo2022 commented Dec 7, 2022

Uh oh!

SandishKumarHN commented Dec 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SandishKumarHN commented Dec 5, 2022 •

edited

Loading

SandishKumarHN Dec 5, 2022 •

edited

Loading

SandishKumarHN commented Dec 7, 2022 •

edited

Loading

rangadi Dec 8, 2022 •

edited

Loading

SandishKumarHN Dec 9, 2022 •

edited

Loading

baganokodo2022 Dec 8, 2022 •

edited

Loading