[SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS` #38685

MaxGekk · 2022-11-17T07:22:54Z

What changes were proposed in this pull request?

In the PR, I propose to assign the proper name COLUMN_ALREADY_EXISTS to the legacy error class _LEGACY_ERROR_TEMP_1233 , and modify test suite to use checkError() which checks the error class name, context and etc. Also this PR improves the error message.

Why are the changes needed?

Proper name improves user experience w/ Spark SQL.

Does this PR introduce any user-facing change?

Yes, the PR changes an user-facing error message.

How was this patch tested?

By running the modified test suites:

$ $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite"

MaxGekk · 2022-11-20T16:42:02Z

@srielau @LuciferYang @panbingkun @itholic @cloud-fan Could you review this PR, please.

srielau · 2022-11-20T17:08:32Z

I've some doubts about COLUMN_ALREADY_EXISTS when a column is duplicated within a new list.
I.e. it makes a lot of sense for ALTER TABLE ADD COLUMN.
But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS?
How about AS T(c1, c1)

I think we would like DUPLICATE_COLUMN or COLUMN_DUPLICATED, ... ? for these case and also include the table name.

MaxGekk · 2022-11-20T17:34:19Z

But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS?
How about AS T(c1, c1)

@srielau I assumed that we will provide a query context which should point out to the problematic part.

srielau · 2022-11-20T19:06:05Z

But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS?
How about AS T(c1, c1)

@srielau I assumed that we will provide a query context which should point out to the problematic part.

Sure, but will every tool look at it? By that token we don't need most of the payload for various errors.
Either way, that is not the main point. The main question is whether we should have a distinct error messages for duplicate identifier in "constructors". We apparently allow duplicate attribute names in structs (?)..

Should this say: MAP_KEY_ALREADY_EXISTS
spark-sql> select map('a', 5, 'a', 6);
Duplicate map key a was found

I just checked CTE and table alias. Neither enforce unique names, which is curious.

So I suppose the question boils down to CREATE TABLE and CREATE VIEW.

MaxGekk · 2022-11-21T15:07:15Z

The main question is whether we should have a distinct error messages for duplicate identifier in "constructors".

Don't think this is a significant issue. A column might already exist in a constructor, partition spec itself.

Should this say: MAP_KEY_ALREADY_EXISTS

Yep, we can say that key already exist in the provided map.

If we introduce one more error class like DUPLICATED_COLUMN, this could bring just additional questions what is the difference.

I would just follow the existing convention, and name the error class as *_ALREADY_EXISTS, and do refactoring later if it is needed.

MaxGekk · 2022-11-22T07:23:53Z

Merging to master. Thank you, @srielau and @cloud-fan for review.

HyukjinKwon · 2022-11-23T02:04:42Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    checkError(
+      exception = e,
+      errorClass = "COLUMN_ALREADY_EXISTS",
+      parameters = Map("columnName" -> "`column1`"))


Several tests fixed here seem flaky because of the order in the map, e.g.:

- SPARK-8072: Better Exception for Duplicate Columns *** FAILED *** (42 milliseconds) Map("columnName" -> "`column3`") did not equal Map("columnName" -> "`column1`") (SparkFunSuite.scala:317) Analysis: JavaCollectionWrappers$JMapWrapper(columnName: `column3` -> `column1`) org.scalatest.exceptions.TestFailedException: at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.SparkFunSuite.checkError(SparkFunSuite.scala:317) at org.apache.spark.sql.DataFrameSuite.$anonfun$new$368(DataFrameSuite.scala:1781) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)

https://github.com/apache/spark/actions/runs/3525051044/jobs/5911287739 and https://github.com/apache/spark/actions/runs/3526328003 which happens in a different JDK or Scala version.

Let me check this

…o `COLUMN_ALREADY_EXISTS` ### What changes were proposed in this pull request? In the PR, I propose to assign the proper name `COLUMN_ALREADY_EXISTS ` to the legacy error class `_LEGACY_ERROR_TEMP_1233 `, and modify test suite to use `checkError()` which checks the error class name, context and etc. Also this PR improves the error message. ### Why are the changes needed? Proper name improves user experience w/ Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes an user-facing error message. ### How was this patch tested? By running the modified test suites: ``` $ $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite" ``` Closes apache#38685 from MaxGekk/columns-already-exist. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS

cf486a4

github-actions bot added CORE SQL labels Nov 17, 2022

MaxGekk added 8 commits November 17, 2022 11:20

Merge remote-tracking branch 'origin/master' into columns-already-exist

40091fd

Merge remote-tracking branch 'origin/master' into columns-already-exist

8c9164e

Remove @param colType

43c2689

Fix tests

7a54cea

Use checkError() in tests

9e04a04

Fix tests

0296ae8

Fix tests

730d189

Use checkError()

a2164a6

github-actions bot added CONNECT STRUCTURED STREAMING labels Nov 20, 2022

MaxGekk added 2 commits November 20, 2022 12:31

Fix tests

0df3839

Fix formatting of SparkConnectProtoSuite

2f7d7e7

MaxGekk changed the title ~~[WIP][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS~~ [WIP][SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS Nov 20, 2022

MaxGekk changed the title ~~[WIP][SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS~~ [SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS Nov 20, 2022

MaxGekk marked this pull request as ready for review November 20, 2022 13:01

Fix V2CommandsCaseSensitivitySuite

e977bb1

MaxGekk requested a review from cloud-fan November 20, 2022 16:41

cloud-fan approved these changes Nov 22, 2022

View reviewed changes

MaxGekk closed this in a80899f Nov 22, 2022

HyukjinKwon reviewed Nov 23, 2022

View reviewed changes

LuciferYang mentioned this pull request Nov 23, 2022

[SPARK-41206][SQL][FOLLOWUP] Make result of checkColumnNameDuplication stable to fix COLUMN_ALREADY_EXISTS check failed with Scala 2.13 #38764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS` #38685

[SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS` #38685

Uh oh!

MaxGekk commented Nov 17, 2022 •

edited

Loading

Uh oh!

MaxGekk commented Nov 20, 2022

Uh oh!

srielau commented Nov 20, 2022 •

edited

Loading

Uh oh!

MaxGekk commented Nov 20, 2022

Uh oh!

srielau commented Nov 20, 2022

Uh oh!

MaxGekk commented Nov 21, 2022

Uh oh!

MaxGekk commented Nov 22, 2022

Uh oh!

HyukjinKwon Nov 23, 2022 •

edited

Loading

Uh oh!

LuciferYang Nov 23, 2022

Uh oh!

LuciferYang Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS #38685

[SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS #38685

Uh oh!

Conversation

MaxGekk commented Nov 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 20, 2022

Uh oh!

srielau commented Nov 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Nov 20, 2022

Uh oh!

srielau commented Nov 20, 2022

Uh oh!

MaxGekk commented Nov 21, 2022

Uh oh!

MaxGekk commented Nov 22, 2022

Uh oh!

HyukjinKwon Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Nov 23, 2022

Choose a reason for hiding this comment

Uh oh!

LuciferYang Nov 23, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS` #38685

[SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS` #38685

MaxGekk commented Nov 17, 2022 •

edited

Loading

srielau commented Nov 20, 2022 •

edited

Loading

HyukjinKwon Nov 23, 2022 •

edited

Loading