[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields #22378

dongjoon-hyun · 2018-09-10T05:30:53Z

What changes were proposed in this pull request?

Like INSERT OVERWRITE DIRECTORY USING syntax, INSERT OVERWRITE DIRECTORY STORED AS should not generate files with duplicate fields because Spark cannot read those files back.

INSERT OVERWRITE DIRECTORY USING

scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`;

INSERT OVERWRITE DIRECTORY STORED AS

scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`;

How was this patch tested?

Pass the Jenkins with newly added test cases.

…t duplicate fields

SparkQA · 2018-09-10T07:05:02Z

Test build #95860 has finished for PR 22378 at commit 0242576.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-10T07:24:33Z

retest this please

SparkQA · 2018-09-10T08:55:31Z

Test build #95865 has finished for PR 22378 at commit 0242576.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-10T13:39:05Z

Retest this please.

SparkQA · 2018-09-10T17:31:50Z

Test build #95879 has finished for PR 22378 at commit 0242576.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-10T18:24:34Z

Retest this please.

SparkQA · 2018-09-10T21:14:15Z

Test build #95891 has finished for PR 22378 at commit 0242576.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-10T23:58:20Z

Could you review this, @gatorsmile , @cloud-fan , and @seancxmao ?

cloud-fan · 2018-09-11T08:10:55Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala

+                  s"""
+                     |INSERT OVERWRITE $local DIRECTORY '${dir.toURI}'
+                     |STORED AS $format
+                     |SELECT 'id', 'id2' ${if (caseSensitivity) "id" else "ID"}


do we miss a comma after 'id2'?

id2 is value and the following one appends alias; id or ID.

cloud-fan · 2018-09-11T08:11:11Z

LGTM

dongjoon-hyun · 2018-09-11T15:55:47Z

Thank you for review, @cloud-fan and @HyukjinKwon .

dongjoon-hyun · 2018-09-11T15:57:02Z

Merged to master/2.4.

…t duplicate fields ## What changes were proposed in this pull request? Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files back. **INSERT OVERWRITE DIRECTORY USING** ```scala scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id") ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`; ``` **INSERT OVERWRITE DIRECTORY STORED AS** ```scala scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id") // It generates corrupted files scala> spark.read.parquet("/tmp/parquet").show 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`; ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes #22378 from dongjoon-hyun/SPARK-25389. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 77579aa) Signed-off-by: Dongjoon Hyun <[email protected]>

…t duplicate fields ## What changes were proposed in this pull request? Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files back. **INSERT OVERWRITE DIRECTORY USING** ```scala scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id") ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`; ``` **INSERT OVERWRITE DIRECTORY STORED AS** ```scala scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id") // It generates corrupted files scala> spark.read.parquet("/tmp/parquet").show 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`; ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes apache#22378 from dongjoon-hyun/SPARK-25389. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should preven…

0242576

…t duplicate fields

cloud-fan reviewed Sep 11, 2018

View reviewed changes

HyukjinKwon approved these changes Sep 11, 2018

View reviewed changes

asfgit closed this in 77579aa Sep 11, 2018

dongjoon-hyun deleted the SPARK-25389 branch September 11, 2018 16:34

[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields #22378

[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields #22378

Uh oh!

Conversation

dongjoon-hyun commented Sep 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

cloud-fan Sep 11, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 11, 2018

Uh oh!

dongjoon-hyun commented Sep 11, 2018

Uh oh!

dongjoon-hyun commented Sep 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun Sep 11, 2018 •

edited

Loading