-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON #32558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
|
See also #23202. I think this JIRA is a duplicate of SPARK-26248 |
@HyukjinKwon - Yes it looks like the duplicate of SPARK-26248 , which was reverted. |
|
@HyukjinKwon - As per the points mentioned in the reverted PR, Would like to get some understanding Should this not be the same problem with timestamp format also if we specify like this Here for timestampFormat we are getting the StringDataType which is the last one proffered. I believe in real world scenario it is the case of data-corruption, if we are directly saving that data into the file, instead of applying some filtering criteria which prevent multiple type of data format in the same column Not Able to understand this point clearly, But was thinking if for dateFormatType validate if we can set dtFormat.setLenient(false) it will be able to distinguish the both |
62ae145 to
598e38e
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
Outdated
Show resolved
Hide resolved
ea8057d to
276066b
Compare
|
@HyukjinKwon - I made the changes as per the review comments. Can you please check this PR. |
|
ok to test |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MaxGekk FYI
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #139089 has finished for PR 32558 at commit
|
docs/sql-data-sources-json.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done , added for csv also
|
I think we should also add the parameter at |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
Not able to understand what paramater is needed to add in the csv and json at at There is already So this is working on giving in the option. in Pyspark Please do let me know if my understanding is not correct here |
4fc48dd to
361ed49
Compare
|
Test build #139172 has finished for PR 32558 at commit
|
|
Looks making sense to me. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we can have a look from @MaxGekk though.
|
@HyukjinKwon - Thank you for reviewing this PR. @MaxGekk - Please take a look into this PR. |
|
Test build #139204 has finished for PR 32558 at commit
|
docs/sql-data-sources-csv.md
Outdated
| <tr> | ||
| <td><code>inferDateType</code></td> | ||
| <td>false</td> | ||
| <td>Infers all DateType format for the CSV. If this is not set, it uses the default value, <code>false</code>.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by all DateType format? I think we should say about the dateFormat option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change
docs/sql-data-sources-json.md
Outdated
| <tr> | ||
| <td><code>inferDateType</code></td> | ||
| <td>false</td> | ||
| <td>Infers all DateType format for the JSON. If this is not set, it uses the default value, <code>false</code>.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the JSON ? Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to specify for the JSON in this doc
| case LongType => tryParseLong(field) | ||
| case _: DecimalType => tryParseDecimal(field) | ||
| case DoubleType => tryParseDouble(field) | ||
| case DateType => tryParseDateFormat(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why you try to infer dates before timestamps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is as per the suggestion in one of the review comments to use
|
|
||
| private def tryParseDateFormat(field: String): DataType = { | ||
| if (options.inferDateType | ||
| && !dateFormatter.isInstanceOf[LegacySimpleDateFormatter] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get the check. Could you explain, please, why do you avoid the formatter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be LegacyFastDateFormatter, missed to changed it. Previously I was using the SimpleDateFormatter so added this LegacySimpleDateFormatter, Now since we are using the FastDateFormatter it has to be LegacyFastDateFormatter, Making that change.
If legacy is on, we have ambiguity about Datetype pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instance, 2010-10-10.
| } | ||
|
|
||
| /** | ||
| * option to infer date Type in the schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me as an user, it is not clear the relation between the inferSchema option and this one. Does this option enable inferring independently from inferSchema`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option inferDateType is added to -> This is to keep in sync with older version of spark, If someone wants to use the dateType, they can enable it in the option. This is just to prevent any migration issue to spark-3.2.0 from the older version. if they don't enable this option inferDateType, It will infer it as StringType.
Where as on other hand inferSchema is to enable the inferring of schema.
If inferSchema is enabled and inferDateType option is enabled in that case on reading the schema is this will infer at data type as DateType format instead of StringType
| Seq("ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 0))) | ||
| } | ||
|
|
||
| test("SPARK-34953 - DateType should be inferred when user defined format are provided") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common convention is: SPARK-XXXXX: ...
| test("SPARK-34953 - DateType should be inferred when user defined format are provided") { | |
| test("SPARK-34953: DateType should be inferred when user defined format are provided") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| checkType(Map("inferTimestamp" -> "false"), json, StringType) | ||
| } | ||
|
|
||
| test("SPARK-34953 - Allow DateType format while inferring") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| test("SPARK-34953 - Allow DateType format while inferring") { | |
| test("SPARK-34953: Allow DateType format while inferring") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
c8f7381 to
c93b873
Compare
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status success |
|
Test build #139207 has finished for PR 32558 at commit
|
|
Test build #139213 has finished for PR 32558 at commit
|
|
@MaxGekk - Thanks for reviewing the PR . I made the changes requested in the review comments and also replied to your comments. Can you please take look to the PR. |
|
@MaxGekk , @HyukjinKwon - Thank you for reviewing this PR. Are we planning to move further with this PR or do we need more changes on this. Please share your thoughts on this . |
|
I would defer to @MaxGekk |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@SaurabhChawla100 @HyukjinKwon has this feature (specifically DateType inference for JSON) been included in any subsequent releases / PRs? If not I might try to write the patch myself.... |
|
also @MaxGekk ^^^ |
What changes were proposed in this pull request?
Till now there is no support in infer schema for the DateType format while reading the CSV/json. In this PR, code change is done to support the DateType when inferschema set true in the options
Why are the changes needed?
Many times there are multiple columns which are DateType but after inferred from schema they are added as StringType in the schema and than there is need to convert into to_date before we start the running any query. After this change DateType will be added in the schema instead of the StringType
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test added , also testing is done while running read command on spark shell.