[SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client #40332

LuciferYang · 2023-03-08T08:50:16Z

What changes were proposed in this pull request?

This pr add a new proto message

message Parse {
  // (Required) Input relation to Parse. The input is expected to have single text column.
  Relation input = 1;
  // (Required) The expected format of the text.
  ParseFormat format = 2;

  // (Optional) DataType representing the schema. If not set, Spark will infer the schema.
  optional DataType schema = 3;

  // Options for the csv/json parser. The map key is case insensitive.
  map<string, string> options = 4;
  enum ParseFormat {
    PARSE_FORMAT_UNSPECIFIED = 0;
    PARSE_FORMAT_CSV = 1;
    PARSE_FORMAT_JSON = 2;
  }
}

and implement CSV/JSON parsing functions for Scala client.

Why are the changes needed?

Add Spark connect jvm client api coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass Github Actions
Manual checked Scala 2.13

.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala

zhengruifeng · 2023-03-08T10:29:58Z

why not using from_json and from_csv to do this?

LuciferYang · 2023-03-08T11:07:28Z

why not using from_json and from_csv to do this?

How to get the schema?

zhengruifeng · 2023-03-08T12:51:23Z

not sure whether I’m missing something, but isn’t the schema already provided by users?

LuciferYang · 2023-03-08T12:58:25Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala

+        new StructType().add("age", LongType).add("city", StringType).add("name", StringType)))
+    val ds = Seq("""{"name":"Kong","age":73,"city":'Shandong'}""").toDS()
+    val result = spark.read.option("allowSingleQuotes", "true").json(ds)
+    checkSameResult(expected, result)


spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines 404 to 417 in 69dd20b

def json(jsonDataset: Dataset[String]): DataFrame = {

val parsedOptions = new JSONOptions(

extraOptions.toMap,

sparkSession.sessionState.conf.sessionLocalTimeZone,

sparkSession.sessionState.conf.columnNameOfCorruptRecord)

userSpecifiedSchema.foreach(checkJsonSchema)

val schema = userSpecifiedSchema.map {

case s if !SQLConf.get.getConf(

SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION) => s.asNullable

case other => other

}.getOrElse {

TextInputJsonDataSource.inferFromDataset(jsonDataset, parsedOptions)

}

From the code of the server side, userSpecifiedSchema is an Option[StructType] and default is None, so I think we can use it without specifying theuserSpecifiedSchema for this function? Or is my test case not the correct scenario?

@zhengruifeng

Make sense, you are right

Probably we should add the user provided schema in the message? Or always discard it?

Will inferFromDataset trigger an job? If so, I think we’d better skip it if possible

Yes, I think you are right, we should add schema to the message if it exists, thanks ~ I will update it later

hvanhovell · 2023-03-08T14:54:41Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala

+    session.read
+      .schema(new StructType().add("c1", StringType).add("c2", IntegerType))
+      .option("allowSingleQuotes", "true")
+      .json(session.createDataset(Seq.empty[String])(StringEncoder))


session.emptyDataset(StringEncoder)?

hvanhovell · 2023-03-08T14:56:23Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  // (Optional) If not set, Spark will infer the schema.
+  //
+  // This schema string should be either DDL-formatted or JSON-formatted.
+  optional string schema = 3;


We can't use the actual data type?

#40332 (comment) & #40332 (comment)

and I think the userSpecifiedSchema can be different from infered schema？

So if userSpecifiedSchema is set, we should pass it

hvanhovell · 2023-03-08T14:56:53Z

...ector/connect/common/src/test/resources/query-tests/explain-results/csv_from_dataset.explain

@@ -0,0 +1 @@
+LogicalRDD [c1#0, c2#0], false


Oh this makes me sad. We are we using RDDs here?

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines 424 to 433 in 39a5512

val parsed = jsonDataset.rdd.mapPartitions { iter =>

val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = true)

val parser = new FailureSafeParser[String](

input => rawParser.parse(input, createParser, UTF8String.fromString),

parsedOptions.parseMode,

schema,

parsedOptions.columnNameOfCorruptRecord)

iter.flatMap(parser.parse)

}

sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = jsonDataset.isStreaming)

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines 503 to 521 in 39a5512

val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>

val headerChecker = new CSVHeaderChecker(

actualSchema,

parsedOptions,

source = s"CSV source: $csvDataset")

headerChecker.checkHeaderColumnNames(firstLine)

filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, parsedOptions))

}.getOrElse(filteredLines.rdd)

val parsed = linesWithoutHeader.mapPartitions { iter =>

val rawParser = new UnivocityParser(actualSchema, parsedOptions)

val parser = new FailureSafeParser[String](

input => rawParser.parse(input),

parsedOptions.parseMode,

schema,

parsedOptions.columnNameOfCorruptRecord)

iter.flatMap(parser.parse)

}

sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = csvDataset.isStreaming)

spark/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

Lines 560 to 571 in 39a5512

private[sql] def internalCreateDataFrame(

catalystRows: RDD[InternalRow],

schema: StructType,

isStreaming: Boolean = false): DataFrame = {

// TODO: use MutableProjection when rowRDD is another DataFrame and the applied

// schema differs from the existing schema on any field data type.

val logicalPlan = LogicalRDD(

schema.toAttributes,

catalystRows,

isStreaming = isStreaming)(self)

Dataset.ofRows(self, logicalPlan)

}

On the server side, the input csvDataset and jsonDataset are still LocalRelation, and the above code path(sparkSession.internalCreateDataFrame) is converted them to LogicalRDD .

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

hvanhovell

LGTM

zhengruifeng · 2023-03-09T02:31:03Z

probably not related to this PR:

spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines 63 to 76 in 39a5512

    
             /** 
        
              * Specifies the schema by using the input DDL-formatted string. Some data sources (e.g. JSON) 
        
              * can infer the input schema automatically from data. By specifying the schema here, the 
        
              * underlying data source can skip the schema inference step, and thus speed up data loading. 
        
              * 
        
              * {{{ 
        
              *   spark.read.schema("a INT, b STRING, c DOUBLE").csv("test.csv") 
        
              * }}} 
        
              * 
        
              * @since 3.4.0 
        
              */ 
        
             def schema(schemaString: String): DataFrameReader = { 
        
               schema(StructType.fromDDL(schemaString)) 
        
             }

  def schema(schemaString: String): DataFrameReader = {
    schema(StructType.fromDDL(schemaString))
  }

when the user provide a DDL string, it invoke the parser. Here I think we should keep both StructType and DDL string, and pass them to the server side.

LuciferYang · 2023-03-09T02:42:25Z

probably not related to this PR:

spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines 63 to 76 in 39a5512

/**

* Specifies the schema by using the input DDL-formatted string. Some data sources (e.g. JSON)

* can infer the input schema automatically from data. By specifying the schema here, the

* underlying data source can skip the schema inference step, and thus speed up data loading.

*

* {{{

* spark.read.schema("a INT, b STRING, c DOUBLE").csv("test.csv")

* }}}

*

* @since 3.4.0

*/

def schema(schemaString: String): DataFrameReader = {

schema(StructType.fromDDL(schemaString))

}
  def schema(schemaString: String): DataFrameReader = {
    schema(StructType.fromDDL(schemaString))
  }
when the user provide a DDL string, it invoke the parser. Here I think we should keep both StructType and DDL string, and pass them to the server side.

message Read seems also need to consider this？I think we can further discuss this problem in a separate pr?

zhengruifeng · 2023-03-09T03:06:04Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+      val parseBuilder = builder.getParseBuilder
+        .setInput(ds.plan.getRoot)
+        .setFormat(format)
+      userSpecifiedSchema.foreach(schema => parseBuilder.setSchema(schema.toDDL))


as to this PR itself, I think we should probably use DataType schema in the proto message, schema.toDDL always discards the metadata.

What do you think about this? @hvanhovell

20f1722 change to pass a DataType @zhengruifeng

ca6ec7b rename data_type to schema in proto message

In this scenario, I think DataType schema is used and schema.toDDL is no longer needed

… client ### What changes were proposed in this pull request? This pr add a new proto message ``` message Parse { // (Required) Input relation to Parse. The input is expected to have single text column. Relation input = 1; // (Required) The expected format of the text. ParseFormat format = 2; // (Optional) DataType representing the schema. If not set, Spark will infer the schema. optional DataType schema = 3; // Options for the csv/json parser. The map key is case insensitive. map<string, string> options = 4; enum ParseFormat { PARSE_FORMAT_UNSPECIFIED = 0; PARSE_FORMAT_CSV = 1; PARSE_FORMAT_JSON = 2; } } ``` and implement CSV/JSON parsing functions for Scala client. ### Why are the changes needed? Add Spark connect jvm client api coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Manual checked Scala 2.13 Closes #40332 from LuciferYang/SPARK-42690. Authored-by: yangjie01 <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]> (cherry picked from commit 07f71d2) Signed-off-by: Ruifeng Zheng <[email protected]>

zhengruifeng · 2023-03-09T07:00:35Z

@LuciferYang thank you for woking on this.

merged into master/branch-3.4

LuciferYang · 2023-03-09T07:03:09Z

Thanks @zhengruifeng @hvanhovell @HyukjinKwon ~

… client ### What changes were proposed in this pull request? This pr add a new proto message ``` message Parse { // (Required) Input relation to Parse. The input is expected to have single text column. Relation input = 1; // (Required) The expected format of the text. ParseFormat format = 2; // (Optional) DataType representing the schema. If not set, Spark will infer the schema. optional DataType schema = 3; // Options for the csv/json parser. The map key is case insensitive. map<string, string> options = 4; enum ParseFormat { PARSE_FORMAT_UNSPECIFIED = 0; PARSE_FORMAT_CSV = 1; PARSE_FORMAT_JSON = 2; } } ``` and implement CSV/JSON parsing functions for Scala client. ### Why are the changes needed? Add Spark connect jvm client api coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Manual checked Scala 2.13 Closes apache#40332 from LuciferYang/SPARK-42690. Authored-by: yangjie01 <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]> (cherry picked from commit 07f71d2) Signed-off-by: Ruifeng Zheng <[email protected]>

LuciferYang added 4 commits March 8, 2023 16:26

add new message to support parse

2aed104

Merge branch 'upmaster' into SPARK-42690

aadfe3f

fix test

8dbf090

add empty line

70406c0

github-actions bot added AVRO BUILD CONNECT CORE DOCS DSTREAM EXAMPLES GRAPHX INFRA KUBERNETES MESOS MLLIB PROTOBUF SPARK SHELL SQL STRUCTURED STREAMING YARN labels Mar 8, 2023

refresh py

14ce16c

github-actions bot added the PYTHON label Mar 8, 2023

revert scala change

7805fe3

github-actions bot removed BUILD MLLIB EXAMPLES YARN KUBERNETES MESOS labels Mar 8, 2023

LuciferYang commented Mar 8, 2023

View reviewed changes

.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala Show resolved Hide resolved

LuciferYang commented Mar 8, 2023

View reviewed changes

LuciferYang marked this pull request as draft March 8, 2023 13:14

LuciferYang changed the title ~~[SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client~~ [WIP][SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client Mar 8, 2023

add schema to message

7b9ce61

LuciferYang changed the title ~~[WIP][SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client~~ [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client Mar 8, 2023

LuciferYang marked this pull request as ready for review March 8, 2023 14:18

hvanhovell reviewed Mar 8, 2023

View reviewed changes

LuciferYang commented Mar 8, 2023

View reviewed changes

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala Outdated Show resolved Hide resolved

LuciferYang added 3 commits March 8, 2023 23:06

use emptyDataset

c9f8522

use qe.analyzed

dca6f60

Merge branch 'upmaster' into SPARK-42690

3734ee9

hvanhovell approved these changes Mar 9, 2023

View reviewed changes

HyukjinKwon approved these changes Mar 9, 2023

View reviewed changes

zhengruifeng reviewed Mar 9, 2023

View reviewed changes

LuciferYang added 2 commits March 9, 2023 11:50

change to pass dataType

20f1722

rename data_type -> schmea

ca6ec7b

zhengruifeng approved these changes Mar 9, 2023

View reviewed changes

zhengruifeng closed this in 07f71d2 Mar 9, 2023

	def json(jsonDataset: Dataset[String]): DataFrame = {
	val parsedOptions = new JSONOptions(
	extraOptions.toMap,
	sparkSession.sessionState.conf.sessionLocalTimeZone,
	sparkSession.sessionState.conf.columnNameOfCorruptRecord)

	userSpecifiedSchema.foreach(checkJsonSchema)
	val schema = userSpecifiedSchema.map {
	case s if !SQLConf.get.getConf(
	SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION) => s.asNullable
	case other => other
	}.getOrElse {
	TextInputJsonDataSource.inferFromDataset(jsonDataset, parsedOptions)
	}

	val parsed = jsonDataset.rdd.mapPartitions { iter =>
	val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = true)
	val parser = new FailureSafeParser[String](
	input => rawParser.parse(input, createParser, UTF8String.fromString),
	parsedOptions.parseMode,
	schema,
	parsedOptions.columnNameOfCorruptRecord)
	iter.flatMap(parser.parse)
	}
	sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = jsonDataset.isStreaming)

	val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
	val headerChecker = new CSVHeaderChecker(
	actualSchema,
	parsedOptions,
	source = s"CSV source: $csvDataset")
	headerChecker.checkHeaderColumnNames(firstLine)
	filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, parsedOptions))
	}.getOrElse(filteredLines.rdd)

	val parsed = linesWithoutHeader.mapPartitions { iter =>
	val rawParser = new UnivocityParser(actualSchema, parsedOptions)
	val parser = new FailureSafeParser[String](
	input => rawParser.parse(input),
	parsedOptions.parseMode,
	schema,
	parsedOptions.columnNameOfCorruptRecord)
	iter.flatMap(parser.parse)
	}
	sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = csvDataset.isStreaming)

	private[sql] def internalCreateDataFrame(
	catalystRows: RDD[InternalRow],
	schema: StructType,
	isStreaming: Boolean = false): DataFrame = {
	// TODO: use MutableProjection when rowRDD is another DataFrame and the applied
	// schema differs from the existing schema on any field data type.
	val logicalPlan = LogicalRDD(
	schema.toAttributes,
	catalystRows,
	isStreaming = isStreaming)(self)
	Dataset.ofRows(self, logicalPlan)
	}

[SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client #40332

[SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client #40332

Uh oh!

Conversation

LuciferYang commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

zhengruifeng commented Mar 8, 2023

Uh oh!

LuciferYang commented Mar 8, 2023

Uh oh!

zhengruifeng commented Mar 8, 2023

Uh oh!

LuciferYang Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Mar 9, 2023

Uh oh!

LuciferYang commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Mar 9, 2023

Uh oh!

LuciferYang commented Mar 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

LuciferYang commented Mar 8, 2023 •

edited

Loading

LuciferYang Mar 8, 2023 •

edited

Loading

LuciferYang Mar 8, 2023 •

edited

Loading

LuciferYang Mar 8, 2023 •

edited

Loading

LuciferYang Mar 8, 2023 •

edited

Loading

LuciferYang commented Mar 9, 2023 •

edited

Loading