-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31968][SQL]Duplicate partition columns check when writing data #28814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| caseSensitive: Boolean): Unit = { | ||
|
|
||
| SchemaUtils.checkColumnNameDuplication( | ||
| partitionColumns, partitionColumns.mkString(","), caseSensitive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "," -> ", "
| } | ||
| } | ||
|
|
||
| test("SPARK-31968:duplicate partition columns check") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: SPARK-31968:duplicate ... -> SPARK-31968: duplicate ...
| val ds = Seq((3, 2)).toDF("a", "b") | ||
| val e = intercept[AnalysisException](ds | ||
| .write.mode(org.apache.spark.sql.SaveMode.Overwrite) | ||
| .partitionBy("b", "b").csv("/tmp/111")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use withTempPath { f =>.
| test("SPARK-31968:duplicate partition columns check") { | ||
| val ds = Seq((3, 2)).toDF("a", "b") | ||
| val e = intercept[AnalysisException](ds | ||
| .write.mode(org.apache.spark.sql.SaveMode.Overwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it doesn't have to be overwrite mode. Let's also make it inlined properly within 100 line length limit. See also https://github.com/databricks/scala-style-guide
|
ok to test |
|
cc @maropu and @cloud-fan. |
|
Test build #123934 has finished for PR 28814 at commit
|
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay except for the existing @HyukjinKwon comments.
|
Ur, for better commit logs, could you add output differentces in the PR description above before/after this PR? |
Done |
77b16c9 to
c3994e1
Compare
|
Test build #123948 has finished for PR 28814 at commit
|
60d6c51 to
4f0bc9f
Compare
|
Test build #123955 has finished for PR 28814 at commit
|
|
Test build #123949 has finished for PR 28814 at commit
|
|
Test build #123954 has finished for PR 28814 at commit
|
|
retest this please |
|
Test build #123970 has finished for PR 28814 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you all. Merged to master/3.0/2.4
### What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn`
### Why are the changes needed?
When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the writted.
### Does this PR introduce _any_ user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
1. Before the PR:
It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`,
but get an exception when read:
`spark.read.csv("file:///tmp/output").show()`
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`;
2. After the PR:
`df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`;
### How was this patch tested?
Unit test.
Closes #28814 from TJX2014/master-SPARK-31968.
Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a4ea599)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn`
### Why are the changes needed?
When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the writted.
### Does this PR introduce _any_ user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
1. Before the PR:
It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`,
but get an exception when read:
`spark.read.csv("file:///tmp/output").show()`
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`;
2. After the PR:
`df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`;
### How was this patch tested?
Unit test.
Closes #28814 from TJX2014/master-SPARK-31968.
Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a4ea599)
Signed-off-by: Dongjoon Hyun <[email protected]>
|
Hi, @TJX2014 . What is your JIRA id? |
Thanks all. [https://issues.apache.org/jira/browse/SPARK-31968](jira link) |
|
@TJX2014, can you leave one arbitrary comment in the JIRA to show your JIRA account?Committers should know your JIRA account so they can assign the ticket to you. |
|
@TJX2014 . I mean your JIRA account ID. I need to assign SPARK-31968 to you. :) |
Thanks, I am JinxinTang. |
Thanks, I am JinxinTang. |
### What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn`
### Why are the changes needed?
When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the writted.
### Does this PR introduce _any_ user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
1. Before the PR:
It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`,
but get an exception when read:
`spark.read.csv("file:///tmp/output").show()`
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`;
2. After the PR:
`df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`;
### How was this patch tested?
Unit test.
Closes apache#28814 from TJX2014/master-SPARK-31968.
Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a4ea599)
Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
A unit test is added
Partition duplicate check added in
org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumnWhy are the changes needed?
When people write data with duplicate partition column, it will cause a
org.apache.spark.sql.AnalysisException: Found duplicate column ...in loading data from the writted.Does this PR introduce any user-facing change?
Yes.
It will prevent people from using duplicate partition columns to write data.
It will look ok at
df.write.partitionBy("b", "b").csv("file:///tmp/output"),but get an exception when read:
spark.read.csv("file:///tmp/output").show()org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema:
b;df.write.partitionBy("b", "b").csv("file:///tmp/output")will trigger the exception:org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b:
b;How was this patch tested?
Unit test.