[SPARK-31968][SQL]Duplicate partition columns check when writing data #28814

TJX2014 · 2020-06-12T09:04:12Z

What changes were proposed in this pull request?

A unit test is added
Partition duplicate check added in org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn

Why are the changes needed?

When people write data with duplicate partition column, it will cause a org.apache.spark.sql.AnalysisException: Found duplicate column ... in loading data from the writted.

Does this PR introduce any user-facing change?

Yes.
It will prevent people from using duplicate partition columns to write data.

Before the PR:
It will look ok at df.write.partitionBy("b", "b").csv("file:///tmp/output"),
but get an exception when read：
spark.read.csv("file:///tmp/output").show()
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: b;
After the PR：
df.write.partitionBy("b", "b").csv("file:///tmp/output") will trigger the exception：
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: b;

How was this patch tested?

Unit test.

HyukjinKwon · 2020-06-12T15:00:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

      caseSensitive: Boolean): Unit = {

+    SchemaUtils.checkColumnNameDuplication(
+      partitionColumns, partitionColumns.mkString(","), caseSensitive)


nit: "," -> ", "

HyukjinKwon · 2020-06-12T15:00:27Z

sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala

    }
  }
+
+  test("SPARK-31968:duplicate partition columns check") {


nit: SPARK-31968:duplicate ... -> SPARK-31968: duplicate ...

HyukjinKwon · 2020-06-12T15:01:09Z

sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala

+    val ds = Seq((3, 2)).toDF("a", "b")
+    val e = intercept[AnalysisException](ds
+      .write.mode(org.apache.spark.sql.SaveMode.Overwrite)
+      .partitionBy("b", "b").csv("/tmp/111"))


We could use withTempPath { f =>.

HyukjinKwon · 2020-06-12T15:01:54Z

sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala

+  test("SPARK-31968:duplicate partition columns check") {
+    val ds = Seq((3, 2)).toDF("a", "b")
+    val e = intercept[AnalysisException](ds
+      .write.mode(org.apache.spark.sql.SaveMode.Overwrite)


I suppose it doesn't have to be overwrite mode. Let's also make it inlined properly within 100 line length limit. See also https://github.com/databricks/scala-style-guide

HyukjinKwon · 2020-06-12T15:02:00Z

ok to test

HyukjinKwon · 2020-06-12T15:02:09Z

cc @maropu and @cloud-fan.

SparkQA · 2020-06-12T20:08:22Z

Test build #123934 has finished for PR 28814 at commit 533166c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Looks okay except for the existing @HyukjinKwon comments.

maropu · 2020-06-12T22:57:43Z

Ur, for better commit logs, could you add output differentces in the PR description above before/after this PR?

TJX2014 · 2020-06-13T00:36:00Z

Ur, for better commit logs, could you add output differentces in the PR description above before/after this PR?

Done

SparkQA · 2020-06-13T00:58:58Z

Test build #123948 has finished for PR 28814 at commit 77b16c9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…maUtils

SparkQA · 2020-06-13T07:05:01Z

Test build #123955 has finished for PR 28814 at commit 4f0bc9f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-13T07:05:02Z

Test build #123949 has finished for PR 28814 at commit c3994e1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-13T07:05:03Z

Test build #123954 has finished for PR 28814 at commit 60d6c51.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-13T07:21:51Z

retest this please

SparkQA · 2020-06-13T11:46:32Z

Test build #123970 has finished for PR 28814 at commit 4f0bc9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you all. Merged to master/3.0/2.4

### What changes were proposed in this pull request? A unit test is added Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn` ### Why are the changes needed? When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the writted. ### Does this PR introduce _any_ user-facing change? Yes. It will prevent people from using duplicate partition columns to write data. 1. Before the PR: It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`, but get an exception when read： `spark.read.csv("file:///tmp/output").show()` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`; 2. After the PR： `df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception： org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`; ### How was this patch tested? Unit test. Closes #28814 from TJX2014/master-SPARK-31968. Authored-by: TJX2014 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4ea599) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2020-06-14T05:28:23Z

Hi, @TJX2014 . What is your JIRA id?

TJX2014 · 2020-06-14T07:45:48Z

Hi, @TJX2014 . What is your JIRA id?

Thanks all. [https://issues.apache.org/jira/browse/SPARK-31968](jira link)

HyukjinKwon · 2020-06-14T08:36:44Z

@TJX2014, can you leave one arbitrary comment in the JIRA to show your JIRA account?Committers should know your JIRA account so they can assign the ticket to you.

dongjoon-hyun · 2020-06-14T17:30:09Z

@TJX2014 . I mean your JIRA account ID. I need to assign SPARK-31968 to you. :)

TJX2014 · 2020-06-15T02:30:24Z

@TJX2014, can you leave one arbitrary comment in the JIRA to show your JIRA account?Committers should know your JIRA account so they can assign the ticket to you.

Thanks, I am JinxinTang.

TJX2014 · 2020-06-15T02:30:50Z

@TJX2014 . I mean your JIRA account ID. I need to assign SPARK-31968 to you. :)

Thanks, I am JinxinTang.

### What changes were proposed in this pull request? A unit test is added Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn` ### Why are the changes needed? When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the writted. ### Does this PR introduce _any_ user-facing change? Yes. It will prevent people from using duplicate partition columns to write data. 1. Before the PR: It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`, but get an exception when read： `spark.read.csv("file:///tmp/output").show()` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`; 2. After the PR： `df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception： org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`; ### How was this patch tested? Unit test. Closes apache#28814 from TJX2014/master-SPARK-31968. Authored-by: TJX2014 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4ea599) Signed-off-by: Dongjoon Hyun <[email protected]>

probot-autolabeler bot added the SQL label Jun 12, 2020

TJX2014 changed the title ~~[SPARK-31968][CORE]Duplicate partition column check when write data~~ [SPARK-31968][SQL]Duplicate partition column check when write data Jun 12, 2020

TJX2014 changed the title ~~[SPARK-31968][SQL]Duplicate partition column check when write data~~ [SPARK-31968][SQL]Duplicate partition columns check when writing data Jun 12, 2020

HyukjinKwon reviewed Jun 12, 2020

View reviewed changes

maropu approved these changes Jun 12, 2020

View reviewed changes

TJX2014 requested a review from HyukjinKwon June 13, 2020 00:35

TJX2014 force-pushed the master-SPARK-31968 branch from 77b16c9 to c3994e1 Compare June 13, 2020 00:57

TJX2014 added 6 commits June 13, 2020 10:05

duplicate column check

ea91f12

message improve add ut fix

8a14b7e

casesensitive judge and judge duplicate way improved to refer to Sche…

739c15a

…maUtils

reuse SchemaUtils.checkColumnNameDuplication, may be it could be better

0aab7b5

code format correct

ffebc0c

code style improve

4f0bc9f

TJX2014 force-pushed the master-SPARK-31968 branch from 60d6c51 to 4f0bc9f Compare June 13, 2020 02:10

dongjoon-hyun approved these changes Jun 14, 2020

View reviewed changes

dongjoon-hyun closed this in a4ea599 Jun 14, 2020

[SPARK-31968][SQL]Duplicate partition columns check when writing data #28814

[SPARK-31968][SQL]Duplicate partition columns check when writing data #28814

Uh oh!

Conversation

TJX2014 commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 12, 2020

Uh oh!

HyukjinKwon commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 12, 2020

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented Jun 12, 2020

Uh oh!

TJX2014 commented Jun 13, 2020

Uh oh!

SparkQA commented Jun 13, 2020

Uh oh!

SparkQA commented Jun 13, 2020

Uh oh!

SparkQA commented Jun 13, 2020

Uh oh!

SparkQA commented Jun 13, 2020

Uh oh!

maropu commented Jun 13, 2020

Uh oh!

SparkQA commented Jun 13, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 14, 2020

Uh oh!

TJX2014 commented Jun 14, 2020

Uh oh!

HyukjinKwon commented Jun 14, 2020

Uh oh!

dongjoon-hyun commented Jun 14, 2020

Uh oh!

TJX2014 commented Jun 15, 2020

Uh oh!

TJX2014 commented Jun 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TJX2014 commented Jun 12, 2020 •

edited

Loading

HyukjinKwon commented Jun 12, 2020 •

edited

Loading