Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 17, 2022

What changes were proposed in this pull request?

In the PR, I propose to assign the proper name COLUMN_ALREADY_EXISTS to the legacy error class _LEGACY_ERROR_TEMP_1233 , and modify test suite to use checkError() which checks the error class name, context and etc. Also this PR improves the error message.

Why are the changes needed?

Proper name improves user experience w/ Spark SQL.

Does this PR introduce any user-facing change?

Yes, the PR changes an user-facing error message.

How was this patch tested?

By running the modified test suites:

$ $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite"

@MaxGekk MaxGekk changed the title [WIP][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS [WIP][SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS Nov 20, 2022
@MaxGekk MaxGekk changed the title [WIP][SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS [SPARK-41206][SQL] Rename the error class _LEGACY_ERROR_TEMP_1233 to COLUMN_ALREADY_EXISTS Nov 20, 2022
@MaxGekk MaxGekk marked this pull request as ready for review November 20, 2022 13:01
@MaxGekk MaxGekk requested a review from cloud-fan November 20, 2022 16:41
@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 20, 2022

@srielau @LuciferYang @panbingkun @itholic @cloud-fan Could you review this PR, please.

@srielau
Copy link
Contributor

srielau commented Nov 20, 2022

I've some doubts about COLUMN_ALREADY_EXISTS when a column is duplicated within a new list.
I.e. it makes a lot of sense for ALTER TABLE ADD COLUMN.
But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS?
How about AS T(c1, c1)

I think we would like DUPLICATE_COLUMN or COLUMN_DUPLICATED, ... ? for these case and also include the table name.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 20, 2022

But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS?
How about AS T(c1, c1)

@srielau I assumed that we will provide a query context which should point out to the problematic part.

@srielau
Copy link
Contributor

srielau commented Nov 20, 2022

But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS?
How about AS T(c1, c1)

@srielau I assumed that we will provide a query context which should point out to the problematic part.

Sure, but will every tool look at it? By that token we don't need most of the payload for various errors.
Either way, that is not the main point. The main question is whether we should have a distinct error messages for duplicate identifier in "constructors". We apparently allow duplicate attribute names in structs (?)..

Should this say: MAP_KEY_ALREADY_EXISTS
spark-sql> select map('a', 5, 'a', 6);
Duplicate map key a was found

I just checked CTE and table alias. Neither enforce unique names, which is curious.

So I suppose the question boils down to CREATE TABLE and CREATE VIEW.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 21, 2022

The main question is whether we should have a distinct error messages for duplicate identifier in "constructors".

Don't think this is a significant issue. A column might already exist in a constructor, partition spec itself.

Should this say: MAP_KEY_ALREADY_EXISTS

Yep, we can say that key already exist in the provided map.

If we introduce one more error class like DUPLICATED_COLUMN, this could bring just additional questions what is the difference.

I would just follow the existing convention, and name the error class as *_ALREADY_EXISTS, and do refactoring later if it is needed.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 22, 2022

Merging to master. Thank you, @srielau and @cloud-fan for review.

@MaxGekk MaxGekk closed this in a80899f Nov 22, 2022
checkError(
exception = e,
errorClass = "COLUMN_ALREADY_EXISTS",
parameters = Map("columnName" -> "`column1`"))
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several tests fixed here seem flaky because of the order in the map, e.g.:

- SPARK-8072: Better Exception for Duplicate Columns *** FAILED *** (42 milliseconds)
  Map("columnName" -> "`column3`") did not equal Map("columnName" -> "`column1`") (SparkFunSuite.scala:317)
  Analysis:
  JavaCollectionWrappers$JMapWrapper(columnName: `column3` -> `column1`)
  org.scalatest.exceptions.TestFailedException:
  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
  at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
  at org.apache.spark.SparkFunSuite.checkError(SparkFunSuite.scala:317)
  at org.apache.spark.sql.DataFrameSuite.$anonfun$new$368(DataFrameSuite.scala:1781)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)

https://github.com/apache/spark/actions/runs/3525051044/jobs/5911287739 and https://github.com/apache/spark/actions/runs/3526328003 which happens in a different JDK or Scala version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…o `COLUMN_ALREADY_EXISTS`

### What changes were proposed in this pull request?
In the PR, I propose to assign the proper name `COLUMN_ALREADY_EXISTS ` to the legacy error class `_LEGACY_ERROR_TEMP_1233 `, and modify test suite to use `checkError()` which checks the error class name, context and etc. Also this PR improves the error message.

### Why are the changes needed?
Proper name improves user experience w/ Spark SQL.

### Does this PR introduce _any_ user-facing change?
Yes, the PR changes an user-facing error message.

### How was this patch tested?
By running the modified test suites:
```
$ $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite"
```

Closes apache#38685 from MaxGekk/columns-already-exist.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…o `COLUMN_ALREADY_EXISTS`

### What changes were proposed in this pull request?
In the PR, I propose to assign the proper name `COLUMN_ALREADY_EXISTS ` to the legacy error class `_LEGACY_ERROR_TEMP_1233 `, and modify test suite to use `checkError()` which checks the error class name, context and etc. Also this PR improves the error message.

### Why are the changes needed?
Proper name improves user experience w/ Spark SQL.

### Does this PR introduce _any_ user-facing change?
Yes, the PR changes an user-facing error message.

### How was this patch tested?
By running the modified test suites:
```
$ $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite"
```

Closes apache#38685 from MaxGekk/columns-already-exist.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…o `COLUMN_ALREADY_EXISTS`

### What changes were proposed in this pull request?
In the PR, I propose to assign the proper name `COLUMN_ALREADY_EXISTS ` to the legacy error class `_LEGACY_ERROR_TEMP_1233 `, and modify test suite to use `checkError()` which checks the error class name, context and etc. Also this PR improves the error message.

### Why are the changes needed?
Proper name improves user experience w/ Spark SQL.

### Does this PR introduce _any_ user-facing change?
Yes, the PR changes an user-facing error message.

### How was this patch tested?
By running the modified test suites:
```
$ $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite"
```

Closes apache#38685 from MaxGekk/columns-already-exist.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants