[SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. #22619

dilipbiswal · 2018-10-03T00:55:00Z

What changes were proposed in this pull request?

Current the CSV's infer schema code inlines TypeCoercion.findTightestCommonType. This is a minor refactor to make use of the common type coercion code when applicable. This way we can take advantage of any improvement to the base method.

Thanks to @MaxGekk for finding this while reviewing another PR.

How was this patch tested?

This is a minor refactor. Existing tests are used to verify the change.

…chema

SparkQA · 2018-10-03T04:47:15Z

Test build #96881 has finished for PR 22619 at commit d4e0bdb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-10-03T04:51:01Z

cc @HyukjinKwon @MaxGekk

gatorsmile · 2018-10-03T06:04:29Z

Any behavior change? Test cases?

ueshin · 2018-10-03T06:07:51Z

Maybe this is related to #22448.

dilipbiswal · 2018-10-03T06:16:53Z

@gatorsmile There should not be any behaviour change. I was thinking that existing test cases should suffice. Basically we used to duplicate the code of TypeCoercion.findTightestCommonType in here. Here i am just reusing the common function. This is tested in CSVInferSchemaSuite

dilipbiswal · 2018-10-03T06:19:38Z

@ueshin

Maybe this is related to #22448.

Yeah.. Actually @MaxGekk had pointed me to the presence of duplicate code in one of his comment. I was trying to address it in here.

HyukjinKwon · 2018-10-03T06:24:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

+          } else {
+            Some(DecimalType(range + scale, scale))
+          }
+        case (_, _) => None


case _ => None

@HyukjinKwon Thanks. Will change.

HyukjinKwon · 2018-10-03T06:35:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-        Some(DecimalType(range + scale, scale))
+  def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
+    TypeCoercion.findTightestCommonType(t1, t2).orElse {
+      (t1, t2) match {


Can we leave this out as a private val like the previous and leave a comment that this pattern matching is CSV specific? That will reduce the diff and makes the review easier.

BTW, let's keep the comments in the original place.

@HyukjinKwon Did you have any preference or suggestion on the name of the val ? findCommonTypeExtended ?

not sure. maybe just findCompatibleTypeForCSV

HyukjinKwon · 2018-10-03T06:36:31Z

Looks okay - I checked a case one by one but it needs another look.

HyukjinKwon · 2018-10-03T06:37:32Z

Let's just file a JIRA @dilipbiswal BTW.

dilipbiswal · 2018-10-03T06:39:21Z

@HyukjinKwon Okay.

viirya · 2018-10-03T06:39:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-      findTightestCommonType(t1, DecimalType.forType(t2))
-
-    // Double support larger range than fixed decimal, DecimalType.Maximum should be enough
-    // in most case, also have better precision.


Some comments here are ignored in the change. Shall we keep them?

@viirya Yeah.. we should keep.. sorry.. got dropped inadvertently.

viirya · 2018-10-03T06:40:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-        Some(DoubleType)
-      } else {
-        Some(DecimalType(range + scale, scale))
+  def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {


nit: findCompatibleType?

@viirya i kept the same name used in JsonInferSchema. Change that as well ? Or only change this ?

compatibleType is also fine if it is consistent with JsonInferSchema.

HyukjinKwon · 2018-10-03T08:52:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-
    case _ => None
  }
+


Let's get rid of new lines changes.

HyukjinKwon · 2018-10-03T08:52:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

+   * is compatible with both input data types.
+   */
+  private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
+    TypeCoercion.findTightestCommonType(t1, t2).orElse (findCompatibleTypeForCSV(t1, t2))


nit: e ( -> e(

SparkQA · 2018-10-03T12:11:59Z

Test build #96890 has finished for PR 22619 at commit ad69a1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-03T23:36:27Z

Test build #96897 has finished for PR 22619 at commit 9e656a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-10-05T05:04:40Z

@HyukjinKwon Does this look okay now ?

HyukjinKwon · 2018-10-05T05:08:39Z

Yup. Let me leave this open few more days in case.

dilipbiswal · 2018-10-05T05:09:43Z

@HyukjinKwon Sure :-)

HyukjinKwon · 2018-10-06T06:49:27Z

Merged to master.

…Type while inferring CSV schema. ## What changes were proposed in this pull request? Current the CSV's infer schema code inlines `TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of the common type coercion code when applicable. This way we can take advantage of any improvement to the base method. Thanks to MaxGekk for finding this while reviewing another PR. ## How was this patch tested? This is a minor refactor. Existing tests are used to verify the change. Closes apache#22619 from dilipbiswal/csv_minor. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Make use of TypeCoercion.findTightestCommonType while inferring CSV s…

d4e0bdb

…chema

HyukjinKwon reviewed Oct 3, 2018

View reviewed changes

viirya reviewed Oct 3, 2018

View reviewed changes

Code review

ad69a1b

dilipbiswal changed the title ~~[SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema.~~ [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. Oct 3, 2018

HyukjinKwon approved these changes Oct 3, 2018

View reviewed changes

code review

9e656a8

asfgit closed this in f2f4e7a Oct 6, 2018

[SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. #22619

[SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. #22619

Uh oh!

Conversation

dilipbiswal commented Oct 3, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

dilipbiswal commented Oct 3, 2018

Uh oh!

gatorsmile commented Oct 3, 2018

Uh oh!

ueshin commented Oct 3, 2018

Uh oh!

dilipbiswal commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 3, 2018

Uh oh!

HyukjinKwon commented Oct 3, 2018

Uh oh!

dilipbiswal commented Oct 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

dilipbiswal commented Oct 5, 2018

Uh oh!

HyukjinKwon commented Oct 5, 2018

Uh oh!

dilipbiswal commented Oct 5, 2018

Uh oh!

HyukjinKwon commented Oct 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dilipbiswal commented Oct 3, 2018 •

edited

Loading

dilipbiswal commented Oct 3, 2018 •

edited

Loading