[SPARK-28659][SQL] Use data source if convertible in insert overwrite directory #25398

Udbhav30 · 2019-08-09T10:37:01Z

What changes were proposed in this pull request?

In insert overwrite directory while using STORED AS file_format, files are not compressed.
In this PR it is converted to datasource if it is convertible, to make it inline withCTAS behavior which is fixed in this PR

Why are the changes needed?

To make the behavior inline with CTAS while using STORED AS file_format

Does this PR introduce any user-facing change?

Yes, After the fix of this PR now STORED AS file_format will be converted to datasource if it is convertible
Before

After

How was this patch tested?

New testcase is added

…ctory

maropu · 2019-08-10T00:07:00Z

Can you add tests?

Udbhav30 · 2019-08-10T08:22:38Z

Can you add tests?

yes i will add tests and update the PR

Udbhav30 · 2019-08-10T08:50:53Z

@maropu added the testcase!

maropu · 2019-08-10T08:52:15Z

ok to test

maropu · 2019-08-10T08:52:19Z

Thanks!

SparkQA · 2019-08-10T11:30:54Z

Test build #108912 has finished for PR 25398 at commit 36913da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-10T20:57:02Z

Test build #108920 has finished for PR 25398 at commit 68db457.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-08-11T03:33:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

-    (ctx.LOCAL != null, storage, Some(DDLUtils.HIVE_PROVIDER))
+    val fileFormat = extractFileFormat(fileStorage.serde)
+    (ctx.LOCAL != null, storage, Some(fileFormat))
+  }


Are you sure this is correct? It seems a valid value is Some(DDLUtils.HIVE_PROVIDER) or None for the third parameter.

in case of parquet and orc we can use the respective file format instead of hive.
In case of ctas also we convert to use data source https://github.com/viirya/spark-1/blob/839a6ce1732fa37b5f8ec9afa2d51730fc6ca691/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L188
This will make it inline with that behavior.

maropu · 2019-08-11T03:34:03Z

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

        sql(
          s"""
             |INSERT OVERWRITE LOCAL DIRECTORY '$path'
-             |STORED AS orc


Why did you delete this?

if we use stored as and file format is orc or parquet it will be converted to data source flow.

I believe the test case name should be modified correspondingly and a new test case for (orc/parquet) should be added.

@advancedxy i have updated the testcase name and for orc and parquet testcase was already added. As they will be converted to data source, the data in the directory would be compressed.

advancedxy · 2019-08-12T08:13:55Z

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

        sql(
          s"""
             |INSERT OVERWRITE LOCAL DIRECTORY '$path'
-             |STORED AS orc


I believe the test case name should be modified correspondingly and a new test case for (orc/parquet) should be added.

advancedxy · 2019-08-12T08:19:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+  }
+
+  private def extractFileFormat(serde: Option[String]): String = {
+    if (serde.toString.contains("parquet")) {


Although None.toString is "None", this looks like a hack to me.

How about:

serde.map({ x => val lowerCaseSerde = x.toLowerCase(Locale.ROOT) if (lowerCaseSerde.contains("parquet") "parquet" else if (lowerCaseSerde.contains("orc") "orc" else DDLUtils.HIVE_PROVIDER }).getOrElse( DDLUtils.HIVE_PROVIDER)

SparkQA · 2019-08-12T12:12:44Z

Test build #108968 has finished for PR 25398 at commit b432b39.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-12T17:03:59Z

Test build #108972 has finished for PR 25398 at commit 72d6dd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Udbhav30 · 2019-08-22T11:48:38Z

cc @HyukjinKwon

HyukjinKwon · 2019-09-17T00:36:12Z

ok to test

SparkQA · 2019-09-17T05:24:17Z

Test build #110698 has finished for PR 25398 at commit 72d6dd4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Udbhav30 · 2019-09-17T07:04:40Z

failed test doesn't look related to this PR @HyukjinKwon

HyukjinKwon · 2019-09-18T03:31:33Z

retest this please

SparkQA · 2019-09-18T05:38:08Z

Test build #110857 has finished for PR 25398 at commit 72d6dd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Udbhav30 · 2019-09-18T05:45:48Z

retest this please

maropu · 2019-09-18T07:18:03Z

retest this please

SparkQA · 2019-09-18T12:45:54Z

Test build #110880 has finished for PR 25398 at commit 72d6dd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

HyukjinKwon · 2019-10-11T02:14:32Z

If you only target to fix Hive ser/de to respect compression, why don't you set Hive compression properly?

Udbhav30 · 2019-10-11T08:16:56Z

If you only target to fix Hive ser/de to respect compression, why don't you set Hive compression properly?

Yes compression can be achieved by setting Hive ser/de or USING file_format , but as i mentioned this PR is more towards making the behavior inline to CTAS and to use datasource if it is convertible. Let me know if you have any suggestions :)

HyukjinKwon

Alright, if we want to do the conversion, it should have a configuration like spark.sql.hive.convertMetastoreCtas which respects spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc.

cc @viirya, @dongjoon-hyun, @cloud-fan

AmplabJenkins · 2020-02-27T02:05:43Z

Can one of the admins verify this patch?

github-actions · 2020-06-07T00:23:06Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-28659] Use data source if convertible in insert overwrite dire…

12d880a

…ctory

dongjoon-hyun added the SQL label Aug 9, 2019

maropu changed the title ~~[SPARK-28659] Use data source if convertible in insert overwrite dire…~~ [SPARK-28659] Use data source if convertible in insert overwrite directory Aug 10, 2019

maropu changed the title ~~[SPARK-28659] Use data source if convertible in insert overwrite directory~~ [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory Aug 10, 2019

added testcase

36913da

corrected failed testcase

68db457

maropu reviewed Aug 11, 2019

View reviewed changes

advancedxy suggested changes Aug 12, 2019

View reviewed changes

handled review comments

b432b39

Udbhav30 force-pushed the master branch from d5cd0f1 to b432b39 Compare August 12, 2019 12:08

fix scala style error

72d6dd4

HyukjinKwon reviewed Sep 29, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala Show resolved Hide resolved

HyukjinKwon reviewed Oct 24, 2019

View reviewed changes

github-actions bot added the Stale label Jun 7, 2020

github-actions bot closed this Jun 8, 2020

[SPARK-28659][SQL] Use data source if convertible in insert overwrite directory #25398

[SPARK-28659][SQL] Use data source if convertible in insert overwrite directory #25398

Uh oh!

Conversation

Udbhav30 commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maropu commented Aug 10, 2019

Uh oh!

Udbhav30 commented Aug 10, 2019

Uh oh!

Udbhav30 commented Aug 10, 2019

Uh oh!

maropu commented Aug 10, 2019

Uh oh!

maropu commented Aug 10, 2019

Uh oh!

SparkQA commented Aug 10, 2019

Uh oh!

SparkQA commented Aug 10, 2019

Uh oh!

maropu Aug 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Udbhav30 Aug 11, 2019

Choose a reason for hiding this comment

Uh oh!

maropu Aug 11, 2019

Choose a reason for hiding this comment

Uh oh!

Udbhav30 Aug 11, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

Udbhav30 Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 12, 2019

Uh oh!

SparkQA commented Aug 12, 2019

Uh oh!

Udbhav30 commented Aug 22, 2019

Uh oh!

HyukjinKwon commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

Udbhav30 commented Sep 17, 2019

Uh oh!

HyukjinKwon commented Sep 18, 2019

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

Udbhav30 commented Sep 18, 2019

Uh oh!

maropu commented Sep 18, 2019

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

Uh oh!

HyukjinKwon commented Oct 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Udbhav30 commented Oct 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon left a comment

Udbhav30 commented Aug 9, 2019 •

edited

Loading

maropu Aug 11, 2019 •

edited

Loading

HyukjinKwon commented Oct 11, 2019 •

edited

Loading

Udbhav30 commented Oct 11, 2019 •

edited

Loading