[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON #32558

SaurabhChawla100 · 2021-05-15T17:41:20Z

What changes were proposed in this pull request?

Till now there is no support in infer schema for the DateType format while reading the CSV/json. In this PR, code change is done to support the DateType when inferschema set true in the options

Why are the changes needed?

Many times there are multiple columns which are DateType but after inferred from schema they are added as StringType in the schema and than there is need to convert into to_date before we start the running any query. After this change DateType will be added in the schema instead of the StringType

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test added , also testing is done while running read command on spark shell.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

HyukjinKwon · 2021-05-16T00:14:25Z

See also #23202. I think this JIRA is a duplicate of SPARK-26248

SaurabhChawla100 · 2021-05-16T08:15:41Z

See also #23202. I think this JIRA is a duplicate of SPARK-26248

@HyukjinKwon - Yes it looks like the duplicate of SPARK-26248 , which was reverted.

SaurabhChawla100 · 2021-05-16T08:29:07Z

@HyukjinKwon - As per the points mentioned in the reverted PR, Would like to get some understanding

Problem 1.

#23202 (comment) - I left some examples there.

If there are multiple rows, and the first row is inferred as date type in the same partition,
It will not be able to infer timestamp afterward.

Should this not be the same problem with timestamp format also if we specify like this

2,23232,hello,2016-11-11
2,23232,hello1,2016.11.11


 val p =  spark.read.format("csv").option("header", "false").option("delimiter", ",").option("timestampFormat", "yyyy-MM-dd").option("inferSchema", "true").load("/testDir/test.csv")

p.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(_c0,IntegerType,true), StructField(_c1,IntegerType,true), StructField(_c2,StringType,true), StructField(_c3,StringType,true))

Here for timestampFormat we are getting the StringDataType which is the last one proffered. I believe in real world scenario it is the case of data-corruption, if we are directly saving that data into the file, instead of applying some filtering criteria which prevent multiple type of data format in the same column

Problem 2.

#23202 (comment)

If legacy is on, we have ambiguity about date/timestamp pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instane, 2010-10-10.

We are able to do this only when spark.sql.legacy.timeParser.enabled is disabled (by default), however, I was thinking it's going to introduce complexity.
I was thinking we could do it later when we remove spark.sql.legacy.timeParser.enabled. Date type inference isn't super important IMHO becase we infer timestamps.
I would like to talk about this further if anyone thinks differently. If the change isn't complicated then I thought, it should also be okay to go ahead.

Not Able to understand this point clearly, But was thinking if for dateFormatType validate if we can set dtFormat.setLenient(false) it will be able to distinguish the both yyyy-MM and yyyy-MM-dd

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

SaurabhChawla100 · 2021-05-30T17:32:43Z

@HyukjinKwon - I made the changes as per the review comments. Can you please check this PR.

HyukjinKwon · 2021-05-31T01:40:08Z

ok to test

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

HyukjinKwon · 2021-05-31T01:45:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

cc @MaxGekk FYI

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

SparkQA · 2021-05-31T02:36:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43610/

SparkQA · 2021-05-31T02:38:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43610/

SparkQA · 2021-05-31T05:53:50Z

Test build #139089 has finished for PR 32558 at commit 29a5e43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-05-31T07:43:04Z

docs/sql-data-sources-json.md

Let's change the default value to false. This was a mistake from #32204, and @itholic is working on it to fix. BTW, seems like CSV option is missing here.

done , added for csv also

HyukjinKwon · 2021-05-31T07:44:42Z

I think we should also add the parameter at csv and json at DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter (and also readwriter.py and streaming.py). @itholic can you help review this one please?

SparkQA · 2021-05-31T07:48:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43619/

SparkQA · 2021-05-31T08:24:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43619/

SaurabhChawla100 · 2021-05-31T11:35:03Z

I think we should also add the parameter at csv and json at DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter (and also readwriter.py and streaming.py). @itholic can you help review this one please?

Not able to understand what paramater is needed to add in the csv and json at at DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter (and also readwriter.py and streaming.py).

There is already private var extraOptions = CaseInsensitiveMap[String](Map.empty) and this code inserts the options that are added to it in

def option(key: String, value: String): DataFrameReader = {
    this.extraOptions = this.extraOptions + (key -> value)
    this
  }

So this is working on giving in the option.

scala> spark.read.option("inferSchema", "true").option("inferDateType", "true").option("dateFormat", "yyyy-MM-dd").json(Seq("""{"a": {"b": 1, "c": "2021-02-26"}}""").toDS()).schema
res18: org.apache.spark.sql.types.StructType = StructType(StructField(a,StructType(StructField(b,LongType,true), StructField(c,DateType,true)),true))

in Pyspark

spark.read.option("inferSchema", "true").option("inferDateType", "true").option("dateFormat", "yyyy-MM-dd").json("/testDir/test1.json").schema
StructType(List(StructField(a,StructType(List(StructField(b,LongType,true),StructField(c,DateType,true))),true)))

Please do let me know if my understanding is not correct here

SparkQA · 2021-06-01T20:13:44Z

Test build #139172 has finished for PR 32558 at commit 3d46a3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-06-02T01:03:54Z

Looks making sense to me.

HyukjinKwon

It would be great if we can have a look from @MaxGekk though.

docs/sql-data-sources-csv.md

docs/sql-data-sources-json.md

SaurabhChawla100 · 2021-06-02T06:17:59Z

@HyukjinKwon - Thank you for reviewing this PR.

@MaxGekk - Please take a look into this PR.

SparkQA · 2021-06-02T06:24:28Z

Test build #139204 has finished for PR 32558 at commit 78a7356.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-06-02T07:10:53Z

docs/sql-data-sources-csv.md

+  <tr>
+    <td><code>inferDateType</code></td>
+    <td>false</td>
+    <td>Infers all DateType format for the CSV. If this is not set, it uses the default value, <code>false</code>.</td>


What do you mean by all DateType format? I think we should say about the dateFormat option here.

made the change

MaxGekk · 2021-06-02T07:12:31Z

docs/sql-data-sources-json.md

+  <tr>
+    <td><code>inferDateType</code></td>
+    <td>false</td>
+    <td>Infers all DateType format for the JSON. If this is not set, it uses the default value, <code>false</code>.</td>


the JSON ? Is it needed?

this is to specify for the JSON in this doc

MaxGekk · 2021-06-02T07:14:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

        case LongType => tryParseLong(field)
        case _: DecimalType => tryParseDecimal(field)
        case DoubleType => tryParseDouble(field)
+        case DateType => tryParseDateFormat(field)


Just curious why you try to infer dates before timestamps?

This is as per the suggestion in one of the review comments to use

#32558 (comment)

MaxGekk · 2021-06-02T07:15:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

+
+  private def tryParseDateFormat(field: String): DataType = {
+    if (options.inferDateType
+      && !dateFormatter.isInstanceOf[LegacySimpleDateFormatter]


I didn't get the check. Could you explain, please, why do you avoid the formatter?

It has to be LegacyFastDateFormatter, missed to changed it. Previously I was using the SimpleDateFormatter so added this LegacySimpleDateFormatter, Now since we are using the FastDateFormatter it has to be LegacyFastDateFormatter, Making that change.

If legacy is on, we have ambiguity about Datetype pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instance, 2010-10-10.

MaxGekk · 2021-06-02T07:19:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

  }

+  /**
+   * option to infer date Type in the schema


For me as an user, it is not clear the relation between the inferSchema option and this one. Does this option enable inferring independently from inferSchema`?

This option inferDateType is added to -> This is to keep in sync with older version of spark, If someone wants to use the dateType, they can enable it in the option. This is just to prevent any migration issue to spark-3.2.0 from the older version. if they don't enable this option inferDateType, It will infer it as StringType.

Where as on other hand inferSchema is to enable the inferring of schema.
If inferSchema is enabled and inferDateType option is enabled in that case on reading the schema is this will infer at data type as DateType format instead of StringType

MaxGekk · 2021-06-02T07:20:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala

    Seq("ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 0)))
  }
+
+  test("SPARK-34953 - DateType should be inferred when user defined format are provided") {


The common convention is: SPARK-XXXXX: ...

Suggested change

test("SPARK-34953 - DateType should be inferred when user defined format are provided") {

test("SPARK-34953: DateType should be inferred when user defined format are provided") {

MaxGekk · 2021-06-02T07:20:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala

    checkType(Map("inferTimestamp" -> "false"), json, StringType)
  }
+
+  test("SPARK-34953 - Allow DateType format while inferring") {


Suggested change

test("SPARK-34953 - Allow DateType format while inferring") {

test("SPARK-34953: Allow DateType format while inferring") {

SparkQA · 2021-06-02T07:39:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43727/

SparkQA · 2021-06-02T08:12:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43727/

SparkQA · 2021-06-02T09:14:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43730/

SparkQA · 2021-06-02T09:43:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43735/

SparkQA · 2021-06-02T10:01:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43736/

SparkQA · 2021-06-02T10:17:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43735/

SparkQA · 2021-06-02T10:35:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43736/

SparkQA · 2021-06-02T12:14:06Z

Test build #139207 has finished for PR 32558 at commit 2689fd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-02T13:16:06Z

Test build #139213 has finished for PR 32558 at commit c93b873.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SaurabhChawla100 · 2021-06-08T04:24:02Z

@MaxGekk - Thanks for reviewing the PR . I made the changes requested in the review comments and also replied to your comments. Can you please take look to the PR.

SaurabhChawla100 · 2021-07-22T03:57:59Z

@MaxGekk , @HyukjinKwon - Thank you for reviewing this PR. Are we planning to move further with this PR or do we need more changes on this. Please share your thoughts on this .

HyukjinKwon · 2021-07-22T04:19:22Z

I would defer to @MaxGekk

github-actions · 2021-10-31T00:10:33Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

nate-kuhl · 2022-08-10T18:05:43Z

@SaurabhChawla100 @HyukjinKwon has this feature (specifically DateType inference for JSON) been included in any subsequent releases / PRs? If not I might try to write the patch myself....

nate-kuhl · 2022-08-10T18:24:51Z

also @MaxGekk ^^^

HyukjinKwon · 2022-08-11T03:36:51Z

It's been added at #37327 and #23202

github-actions bot added the SQL label May 15, 2021

SaurabhChawla100 changed the title ~~[SPARK-35279][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading~~ [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading May 15, 2021

HyukjinKwon reviewed May 16, 2021

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 16, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 16, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

SaurabhChawla100 force-pushed the SPARK-34953 branch 2 times, most recently from 62ae145 to 598e38e Compare May 16, 2021 10:10

HyukjinKwon reviewed May 17, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala Outdated Show resolved Hide resolved

SaurabhChawla100 force-pushed the SPARK-34953 branch 2 times, most recently from ea8057d to 276066b Compare May 17, 2021 12:39

HyukjinKwon reviewed May 31, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 31, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated

Copy link

Member

HyukjinKwon May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MaxGekk FYI

HyukjinKwon reviewed May 31, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading~~ [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV May 31, 2021

HyukjinKwon changed the title ~~[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV~~ [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON May 31, 2021

github-actions bot added the DOCS label May 31, 2021

HyukjinKwon reviewed May 31, 2021

View reviewed changes

SaurabhChawla100 force-pushed the SPARK-34953 branch from 4fc48dd to 361ed49 Compare May 31, 2021 11:38

HyukjinKwon approved these changes Jun 2, 2021

View reviewed changes

HyukjinKwon reviewed Jun 2, 2021

View reviewed changes

docs/sql-data-sources-csv.md Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 2, 2021

View reviewed changes

docs/sql-data-sources-json.md Outdated Show resolved Hide resolved

update the doc as per the comments suggestion

78a7356

MaxGekk reviewed Jun 2, 2021

View reviewed changes

add the changes as per the review comments

2689fd1

update LegacyFastDateFormatter instead of LegacySimpleDateFormatter

c93b873

SaurabhChawla100 force-pushed the SPARK-34953 branch from c8f7381 to c93b873 Compare June 2, 2021 09:04

github-actions bot added the Stale label Oct 31, 2021

github-actions bot closed this Nov 1, 2021

	test("SPARK-34953 - DateType should be inferred when user defined format are provided") {
	test("SPARK-34953: DateType should be inferred when user defined format are provided") {

	test("SPARK-34953 - Allow DateType format while inferring") {
	test("SPARK-34953: Allow DateType format while inferring") {

[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON #32558

[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON #32558

Uh oh!

Conversation

SaurabhChawla100 commented May 15, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented May 16, 2021

Uh oh!

SaurabhChawla100 commented May 16, 2021

Uh oh!

SaurabhChawla100 commented May 16, 2021

Uh oh!

Uh oh!

SaurabhChawla100 commented May 30, 2021

Uh oh!

HyukjinKwon commented May 31, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

SaurabhChawla100 commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 1, 2021

Uh oh!

HyukjinKwon commented Jun 2, 2021

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SaurabhChawla100 commented Jun 2, 2021

Uh oh!

SparkQA commented Jun 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 commented May 31, 2021 •

edited

Loading

nate-kuhl commented Aug 10, 2022 •

edited

Loading