Skip to content

Conversation

@SaurabhChawla100
Copy link
Contributor

What changes were proposed in this pull request?

Till now there is no support in infer schema for the DateType format while reading the CSV/json. In this PR, code change is done to support the DateType when inferschema set true in the options

Why are the changes needed?

Many times there are multiple columns which are DateType but after inferred from schema they are added as StringType in the schema and than there is need to convert into to_date before we start the running any query. After this change DateType will be added in the schema instead of the StringType

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test added , also testing is done while running read command on spark shell.

@github-actions github-actions bot added the SQL label May 15, 2021
@SaurabhChawla100 SaurabhChawla100 changed the title [SPARK-35279][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading May 15, 2021
@HyukjinKwon
Copy link
Member

See also #23202. I think this JIRA is a duplicate of SPARK-26248

@SaurabhChawla100
Copy link
Contributor Author

See also #23202. I think this JIRA is a duplicate of SPARK-26248

@HyukjinKwon - Yes it looks like the duplicate of SPARK-26248 , which was reverted.

@SaurabhChawla100
Copy link
Contributor Author

@HyukjinKwon - As per the points mentioned in the reverted PR, Would like to get some understanding

Problem 1.

#23202 (comment) - I left some examples there.

If there are multiple rows, and the first row is inferred as date type in the same partition,
It will not be able to infer timestamp afterward.

Should this not be the same problem with timestamp format also if we specify like this

2,23232,hello,2016-11-11
2,23232,hello1,2016.11.11

 val p =  spark.read.format("csv").option("header", "false").option("delimiter", ",").option("timestampFormat", "yyyy-MM-dd").option("inferSchema", "true").load("/testDir/test.csv")

p.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(_c0,IntegerType,true), StructField(_c1,IntegerType,true), StructField(_c2,StringType,true), StructField(_c3,StringType,true))

Here for timestampFormat we are getting the StringDataType which is the last one proffered. I believe in real world scenario it is the case of data-corruption, if we are directly saving that data into the file, instead of applying some filtering criteria which prevent multiple type of data format in the same column

Problem 2.

#23202 (comment)

If legacy is on, we have ambiguity about date/timestamp pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instane, 2010-10-10.

We are able to do this only when spark.sql.legacy.timeParser.enabled is disabled (by default), however, I was thinking it's going to introduce complexity.
I was thinking we could do it later when we remove spark.sql.legacy.timeParser.enabled. Date type inference isn't super important IMHO becase we infer timestamps.
I would like to talk about this further if anyone thinks differently. If the change isn't complicated then I thought, it should also be okay to go ahead.

Not Able to understand this point clearly, But was thinking if for dateFormatType validate if we can set dtFormat.setLenient(false) it will be able to distinguish the both yyyy-MM and yyyy-MM-dd

@SaurabhChawla100 SaurabhChawla100 force-pushed the SPARK-34953 branch 2 times, most recently from 62ae145 to 598e38e Compare May 16, 2021 10:10
@SaurabhChawla100 SaurabhChawla100 force-pushed the SPARK-34953 branch 2 times, most recently from ea8057d to 276066b Compare May 17, 2021 12:39
@SaurabhChawla100
Copy link
Contributor Author

@HyukjinKwon - I made the changes as per the review comments. Can you please check this PR.

@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MaxGekk FYI

@HyukjinKwon HyukjinKwon changed the title [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV May 31, 2021
@HyukjinKwon HyukjinKwon changed the title [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON May 31, 2021
@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43610/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43610/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Test build #139089 has finished for PR 32558 at commit 29a5e43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the DOCS label May 31, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the default value to false. This was a mistake from #32204, and @itholic is working on it to fix. BTW, seems like CSV option is missing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done , added for csv also

@HyukjinKwon
Copy link
Member

I think we should also add the parameter at csv and json at DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter (and also readwriter.py and streaming.py). @itholic can you help review this one please?

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43619/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43619/

@SaurabhChawla100
Copy link
Contributor Author

SaurabhChawla100 commented May 31, 2021

I think we should also add the parameter at csv and json at DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter (and also readwriter.py and streaming.py). @itholic can you help review this one please?

Not able to understand what paramater is needed to add in the csv and json at at DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter (and also readwriter.py and streaming.py).

There is already private var extraOptions = CaseInsensitiveMap[String](Map.empty) and this code inserts the options that are added to it in

def option(key: String, value: String): DataFrameReader = {
    this.extraOptions = this.extraOptions + (key -> value)
    this
  }

So this is working on giving in the option.

scala> spark.read.option("inferSchema", "true").option("inferDateType", "true").option("dateFormat", "yyyy-MM-dd").json(Seq("""{"a": {"b": 1, "c": "2021-02-26"}}""").toDS()).schema
res18: org.apache.spark.sql.types.StructType = StructType(StructField(a,StructType(StructField(b,LongType,true), StructField(c,DateType,true)),true))

in Pyspark

spark.read.option("inferSchema", "true").option("inferDateType", "true").option("dateFormat", "yyyy-MM-dd").json("/testDir/test1.json").schema
StructType(List(StructField(a,StructType(List(StructField(b,LongType,true),StructField(c,DateType,true))),true)))

Please do let me know if my understanding is not correct here

@SparkQA
Copy link

SparkQA commented Jun 1, 2021

Test build #139172 has finished for PR 32558 at commit 3d46a3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks making sense to me.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can have a look from @MaxGekk though.

@SaurabhChawla100
Copy link
Contributor Author

@HyukjinKwon - Thank you for reviewing this PR.

@MaxGekk - Please take a look into this PR.

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Test build #139204 has finished for PR 32558 at commit 78a7356.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

<tr>
<td><code>inferDateType</code></td>
<td>false</td>
<td>Infers all DateType format for the CSV. If this is not set, it uses the default value, <code>false</code>.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by all DateType format? I think we should say about the dateFormat option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change

<tr>
<td><code>inferDateType</code></td>
<td>false</td>
<td>Infers all DateType format for the JSON. If this is not set, it uses the default value, <code>false</code>.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the JSON ? Is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to specify for the JSON in this doc

case LongType => tryParseLong(field)
case _: DecimalType => tryParseDecimal(field)
case DoubleType => tryParseDouble(field)
case DateType => tryParseDateFormat(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why you try to infer dates before timestamps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as per the suggestion in one of the review comments to use

#32558 (comment)


private def tryParseDateFormat(field: String): DataType = {
if (options.inferDateType
&& !dateFormatter.isInstanceOf[LegacySimpleDateFormatter]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the check. Could you explain, please, why do you avoid the formatter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be LegacyFastDateFormatter, missed to changed it. Previously I was using the SimpleDateFormatter so added this LegacySimpleDateFormatter, Now since we are using the FastDateFormatter it has to be LegacyFastDateFormatter, Making that change.

If legacy is on, we have ambiguity about Datetype pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instance, 2010-10-10.

}

/**
* option to infer date Type in the schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me as an user, it is not clear the relation between the inferSchema option and this one. Does this option enable inferring independently from inferSchema`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option inferDateType is added to -> This is to keep in sync with older version of spark, If someone wants to use the dateType, they can enable it in the option. This is just to prevent any migration issue to spark-3.2.0 from the older version. if they don't enable this option inferDateType, It will infer it as StringType.

Where as on other hand inferSchema is to enable the inferring of schema.
If inferSchema is enabled and inferDateType option is enabled in that case on reading the schema is this will infer at data type as DateType format instead of StringType

Seq("ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 0)))
}

test("SPARK-34953 - DateType should be inferred when user defined format are provided") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common convention is: SPARK-XXXXX: ...

Suggested change
test("SPARK-34953 - DateType should be inferred when user defined format are provided") {
test("SPARK-34953: DateType should be inferred when user defined format are provided") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

checkType(Map("inferTimestamp" -> "false"), json, StringType)
}

test("SPARK-34953 - Allow DateType format while inferring") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test("SPARK-34953 - Allow DateType format while inferring") {
test("SPARK-34953: Allow DateType format while inferring") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43727/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43727/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43730/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43735/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43736/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43735/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43736/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Test build #139207 has finished for PR 32558 at commit 2689fd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Test build #139213 has finished for PR 32558 at commit c93b873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SaurabhChawla100
Copy link
Contributor Author

@MaxGekk - Thanks for reviewing the PR . I made the changes requested in the review comments and also replied to your comments. Can you please take look to the PR.

@SaurabhChawla100
Copy link
Contributor Author

@MaxGekk , @HyukjinKwon - Thank you for reviewing this PR. Are we planning to move further with this PR or do we need more changes on this. Please share your thoughts on this .

@HyukjinKwon
Copy link
Member

I would defer to @MaxGekk

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 31, 2021
@github-actions github-actions bot closed this Nov 1, 2021
@nate-kuhl
Copy link

nate-kuhl commented Aug 10, 2022

@SaurabhChawla100 @HyukjinKwon has this feature (specifically DateType inference for JSON) been included in any subsequent releases / PRs? If not I might try to write the patch myself....

@nate-kuhl
Copy link

also @MaxGekk ^^^

@HyukjinKwon
Copy link
Member

It's been added at #37327 and #23202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants