Skip to content

Conversation

@TJX2014
Copy link
Contributor

@TJX2014 TJX2014 commented Jun 12, 2020

What changes were proposed in this pull request?

A unit test is added
Partition duplicate check added in org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn

Why are the changes needed?

When people write data with duplicate partition column, it will cause a org.apache.spark.sql.AnalysisException: Found duplicate column ... in loading data from the writted.

Does this PR introduce any user-facing change?

Yes.
It will prevent people from using duplicate partition columns to write data.

  1. Before the PR:
    It will look ok at df.write.partitionBy("b", "b").csv("file:///tmp/output"),
    but get an exception when read:
    spark.read.csv("file:///tmp/output").show()
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: b;
  2. After the PR:
    df.write.partitionBy("b", "b").csv("file:///tmp/output") will trigger the exception:
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: b;

How was this patch tested?

Unit test.

@TJX2014 TJX2014 changed the title [SPARK-31968][CORE]Duplicate partition column check when write data [SPARK-31968][SQL]Duplicate partition column check when write data Jun 12, 2020
@TJX2014 TJX2014 changed the title [SPARK-31968][SQL]Duplicate partition column check when write data [SPARK-31968][SQL]Duplicate partition columns check when writing data Jun 12, 2020
caseSensitive: Boolean): Unit = {

SchemaUtils.checkColumnNameDuplication(
partitionColumns, partitionColumns.mkString(","), caseSensitive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "," -> ", "

}
}

test("SPARK-31968:duplicate partition columns check") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SPARK-31968:duplicate ... -> SPARK-31968: duplicate ...

val ds = Seq((3, 2)).toDF("a", "b")
val e = intercept[AnalysisException](ds
.write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.partitionBy("b", "b").csv("/tmp/111"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use withTempPath { f =>.

test("SPARK-31968:duplicate partition columns check") {
val ds = Seq((3, 2)).toDF("a", "b")
val e = intercept[AnalysisException](ds
.write.mode(org.apache.spark.sql.SaveMode.Overwrite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it doesn't have to be overwrite mode. Let's also make it inlined properly within 100 line length limit. See also https://github.com/databricks/scala-style-guide

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 12, 2020

cc @maropu and @cloud-fan.

@SparkQA
Copy link

SparkQA commented Jun 12, 2020

Test build #123934 has finished for PR 28814 at commit 533166c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay except for the existing @HyukjinKwon comments.

@maropu
Copy link
Member

maropu commented Jun 12, 2020

Ur, for better commit logs, could you add output differentces in the PR description above before/after this PR?

@TJX2014 TJX2014 requested a review from HyukjinKwon June 13, 2020 00:35
@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 13, 2020

Ur, for better commit logs, could you add output differentces in the PR description above before/after this PR?

Done

@TJX2014 TJX2014 force-pushed the master-SPARK-31968 branch from 77b16c9 to c3994e1 Compare June 13, 2020 00:57
@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123948 has finished for PR 28814 at commit 77b16c9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@TJX2014 TJX2014 force-pushed the master-SPARK-31968 branch from 60d6c51 to 4f0bc9f Compare June 13, 2020 02:10
@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123955 has finished for PR 28814 at commit 4f0bc9f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123949 has finished for PR 28814 at commit c3994e1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123954 has finished for PR 28814 at commit 60d6c51.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 13, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123970 has finished for PR 28814 at commit 4f0bc9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you all. Merged to master/3.0/2.4

dongjoon-hyun pushed a commit that referenced this pull request Jun 14, 2020
### What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn`

### Why are the changes needed?
When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the  writted.

### Does this PR introduce _any_ user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
1. Before the PR:
It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`,
but get an exception when read:
`spark.read.csv("file:///tmp/output").show()`
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`;
2. After the PR:
`df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`;

### How was this patch tested?
Unit test.

Closes #28814 from TJX2014/master-SPARK-31968.

Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a4ea599)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jun 14, 2020
### What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn`

### Why are the changes needed?
When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the  writted.

### Does this PR introduce _any_ user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
1. Before the PR:
It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`,
but get an exception when read:
`spark.read.csv("file:///tmp/output").show()`
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`;
2. After the PR:
`df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`;

### How was this patch tested?
Unit test.

Closes #28814 from TJX2014/master-SPARK-31968.

Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a4ea599)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Hi, @TJX2014 . What is your JIRA id?

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 14, 2020

Hi, @TJX2014 . What is your JIRA id?

Thanks all. [https://issues.apache.org/jira/browse/SPARK-31968](jira link)

@HyukjinKwon
Copy link
Member

@TJX2014, can you leave one arbitrary comment in the JIRA to show your JIRA account?Committers should know your JIRA account so they can assign the ticket to you.

@dongjoon-hyun
Copy link
Member

@TJX2014 . I mean your JIRA account ID. I need to assign SPARK-31968 to you. :)

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 15, 2020

@TJX2014, can you leave one arbitrary comment in the JIRA to show your JIRA account?Committers should know your JIRA account so they can assign the ticket to you.

Thanks, I am JinxinTang.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 15, 2020

@TJX2014 . I mean your JIRA account ID. I need to assign SPARK-31968 to you. :)

Thanks, I am JinxinTang.

holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
### What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn`

### Why are the changes needed?
When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the  writted.

### Does this PR introduce _any_ user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
1. Before the PR:
It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`,
but get an exception when read:
`spark.read.csv("file:///tmp/output").show()`
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`;
2. After the PR:
`df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`;

### How was this patch tested?
Unit test.

Closes apache#28814 from TJX2014/master-SPARK-31968.

Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a4ea599)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants