-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields #22378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t duplicate fields
|
Test build #95860 has finished for PR 22378 at commit
|
|
retest this please |
|
Test build #95865 has finished for PR 22378 at commit
|
|
Retest this please. |
|
Test build #95879 has finished for PR 22378 at commit
|
|
Retest this please. |
|
Test build #95891 has finished for PR 22378 at commit
|
|
Could you review this, @gatorsmile , @cloud-fan , and @seancxmao ? |
| s""" | ||
| |INSERT OVERWRITE $local DIRECTORY '${dir.toURI}' | ||
| |STORED AS $format | ||
| |SELECT 'id', 'id2' ${if (caseSensitivity) "id" else "ID"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we miss a comma after 'id2'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id2 is value and the following one appends alias; id or ID.
|
LGTM |
|
Thank you for review, @cloud-fan and @HyukjinKwon . |
|
Merged to master/2.4. |
…t duplicate fields
## What changes were proposed in this pull request?
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files back.
**INSERT OVERWRITE DIRECTORY USING**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`;
```
**INSERT OVERWRITE DIRECTORY STORED AS**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`;
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
Closes #22378 from dongjoon-hyun/SPARK-25389.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 77579aa)
Signed-off-by: Dongjoon Hyun <[email protected]>
…t duplicate fields
## What changes were proposed in this pull request?
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files back.
**INSERT OVERWRITE DIRECTORY USING**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`;
```
**INSERT OVERWRITE DIRECTORY STORED AS**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`;
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
Closes apache#22378 from dongjoon-hyun/SPARK-25389.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Like
INSERT OVERWRITE DIRECTORY USINGsyntax,INSERT OVERWRITE DIRECTORY STORED ASshould not generate files with duplicate fields because Spark cannot read those files back.INSERT OVERWRITE DIRECTORY USING
INSERT OVERWRITE DIRECTORY STORED AS
How was this patch tested?
Pass the Jenkins with newly added test cases.