Skip to content

Conversation

@Udbhav30
Copy link
Contributor

@Udbhav30 Udbhav30 commented Aug 9, 2019

What changes were proposed in this pull request?

In insert overwrite directory while using STORED AS file_format, files are not compressed.
In this PR it is converted to datasource if it is convertible, to make it inline withCTAS behavior which is fixed in this PR

Why are the changes needed?

To make the behavior inline with CTAS while using STORED AS file_format

Does this PR introduce any user-facing change?

Yes, After the fix of this PR now STORED AS file_format will be converted to datasource if it is convertible
Before
before

After
after

How was this patch tested?

New testcase is added

@maropu
Copy link
Member

maropu commented Aug 10, 2019

Can you add tests?

@maropu maropu changed the title [SPARK-28659] Use data source if convertible in insert overwrite dire… [SPARK-28659] Use data source if convertible in insert overwrite directory Aug 10, 2019
@maropu maropu changed the title [SPARK-28659] Use data source if convertible in insert overwrite directory [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory Aug 10, 2019
@Udbhav30
Copy link
Contributor Author

Can you add tests?

yes i will add tests and update the PR

@Udbhav30
Copy link
Contributor Author

@maropu added the testcase!

@maropu
Copy link
Member

maropu commented Aug 10, 2019

ok to test

@maropu
Copy link
Member

maropu commented Aug 10, 2019

Thanks!

@SparkQA
Copy link

SparkQA commented Aug 10, 2019

Test build #108912 has finished for PR 25398 at commit 36913da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2019

Test build #108920 has finished for PR 25398 at commit 68db457.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

(ctx.LOCAL != null, storage, Some(DDLUtils.HIVE_PROVIDER))
val fileFormat = extractFileFormat(fileStorage.serde)
(ctx.LOCAL != null, storage, Some(fileFormat))
}
Copy link
Member

@maropu maropu Aug 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? It seems a valid value is Some(DDLUtils.HIVE_PROVIDER) or None for the third parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of parquet and orc we can use the respective file format instead of hive.
In case of ctas also we convert to use data source https://github.com/viirya/spark-1/blob/839a6ce1732fa37b5f8ec9afa2d51730fc6ca691/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L188
This will make it inline with that behavior.

sql(
s"""
|INSERT OVERWRITE LOCAL DIRECTORY '$path'
|STORED AS orc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you delete this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use stored as and file format is orc or parquet it will be converted to data source flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the test case name should be modified correspondingly and a new test case for (orc/parquet) should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy i have updated the testcase name and for orc and parquet testcase was already added. As they will be converted to data source, the data in the directory would be compressed.

sql(
s"""
|INSERT OVERWRITE LOCAL DIRECTORY '$path'
|STORED AS orc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the test case name should be modified correspondingly and a new test case for (orc/parquet) should be added.

}

private def extractFileFormat(serde: Option[String]): String = {
if (serde.toString.contains("parquet")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although None.toString is "None", this looks like a hack to me.

How about:

serde.map({ x =>
  val lowerCaseSerde = x.toLowerCase(Locale.ROOT)
  if (lowerCaseSerde.contains("parquet") "parquet"
  else if (lowerCaseSerde.contains("orc") "orc"
  else  DDLUtils.HIVE_PROVIDER
}).getOrElse( DDLUtils.HIVE_PROVIDER)

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108968 has finished for PR 25398 at commit b432b39.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108972 has finished for PR 25398 at commit 72d6dd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Udbhav30
Copy link
Contributor Author

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110698 has finished for PR 25398 at commit 72d6dd4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Udbhav30
Copy link
Contributor Author

failed test doesn't look related to this PR @HyukjinKwon

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110857 has finished for PR 25398 at commit 72d6dd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Udbhav30
Copy link
Contributor Author

retest this please

1 similar comment
@maropu
Copy link
Member

maropu commented Sep 18, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110880 has finished for PR 25398 at commit 72d6dd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 11, 2019

If you only target to fix Hive ser/de to respect compression, why don't you set Hive compression properly?

@Udbhav30
Copy link
Contributor Author

Udbhav30 commented Oct 11, 2019

If you only target to fix Hive ser/de to respect compression, why don't you set Hive compression properly?

Yes compression can be achieved by setting Hive ser/de or USING file_format , but as i mentioned this PR is more towards making the behavior inline to CTAS and to use datasource if it is convertible. Let me know if you have any suggestions :)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, if we want to do the conversion, it should have a configuration like spark.sql.hive.convertMetastoreCtas which respects spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc.

cc @viirya, @dongjoon-hyun, @cloud-fan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jun 7, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 7, 2020
@github-actions github-actions bot closed this Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants