Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Current the CSV's infer schema code inlines TypeCoercion.findTightestCommonType. This is a minor refactor to make use of the common type coercion code when applicable. This way we can take advantage of any improvement to the base method.

Thanks to @MaxGekk for finding this while reviewing another PR.

How was this patch tested?

This is a minor refactor. Existing tests are used to verify the change.

@SparkQA
Copy link

SparkQA commented Oct 3, 2018

Test build #96881 has finished for PR 22619 at commit d4e0bdb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

cc @HyukjinKwon @MaxGekk

@gatorsmile
Copy link
Member

Any behavior change? Test cases?

@ueshin
Copy link
Member

ueshin commented Oct 3, 2018

Maybe this is related to #22448.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Oct 3, 2018

@gatorsmile There should not be any behaviour change. I was thinking that existing test cases should suffice. Basically we used to duplicate the code of TypeCoercion.findTightestCommonType in here. Here i am just reusing the common function. This is tested in CSVInferSchemaSuite

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Oct 3, 2018

@ueshin

Maybe this is related to #22448.

Yeah.. Actually @MaxGekk had pointed me to the presence of duplicate code in one of his comment. I was trying to address it in here.

} else {
Some(DecimalType(range + scale, scale))
}
case (_, _) => None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case _ => None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks. Will change.

Some(DecimalType(range + scale, scale))
def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
TypeCoercion.findTightestCommonType(t1, t2).orElse {
(t1, t2) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave this out as a private val like the previous and leave a comment that this pattern matching is CSV specific? That will reduce the diff and makes the review easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, let's keep the comments in the original place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Did you have any preference or suggestion on the name of the val ? findCommonTypeExtended ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure. maybe just findCompatibleTypeForCSV

@HyukjinKwon
Copy link
Member

Looks okay - I checked a case one by one but it needs another look.

@HyukjinKwon
Copy link
Member

Let's just file a JIRA @dilipbiswal BTW.

@dilipbiswal
Copy link
Contributor Author

@HyukjinKwon Okay.

findTightestCommonType(t1, DecimalType.forType(t2))

// Double support larger range than fixed decimal, DecimalType.Maximum should be enough
// in most case, also have better precision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments here are ignored in the change. Shall we keep them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Yeah.. we should keep.. sorry.. got dropped inadvertently.

Some(DoubleType)
} else {
Some(DecimalType(range + scale, scale))
def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: findCompatibleType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya i kept the same name used in JsonInferSchema. Change that as well ? Or only change this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compatibleType is also fine if it is consistent with JsonInferSchema.

@dilipbiswal dilipbiswal changed the title [SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. Oct 3, 2018

case _ => None
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of new lines changes.

* is compatible with both input data types.
*/
private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
TypeCoercion.findTightestCommonType(t1, t2).orElse (findCompatibleTypeForCSV(t1, t2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: e ( -> e(

@SparkQA
Copy link

SparkQA commented Oct 3, 2018

Test build #96890 has finished for PR 22619 at commit ad69a1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 3, 2018

Test build #96897 has finished for PR 22619 at commit 9e656a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@HyukjinKwon Does this look okay now ?

@HyukjinKwon
Copy link
Member

Yup. Let me leave this open few more days in case.

@dilipbiswal
Copy link
Contributor Author

@HyukjinKwon Sure :-)

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in f2f4e7a Oct 6, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…Type while inferring CSV schema.

## What changes were proposed in this pull request?
Current the CSV's infer schema code inlines `TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of the common type coercion code when applicable.  This way we can take advantage of any improvement to the base method.

Thanks to MaxGekk for finding this while reviewing another PR.

## How was this patch tested?
This is a minor refactor.  Existing tests are used to verify the change.

Closes apache#22619 from dilipbiswal/csv_minor.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants