[SPARK-37517][SQL] Keep consistent order of columns with user specify for v1 table #34780

Peng-Lei · 2021-12-02T11:35:55Z

What changes were proposed in this pull request?

keep columns order with user specified instead of put partition columns at last.
Modify the partitionSchema and dataSchema implementation.

Why are the changes needed?

discuss at #34719.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add test case.

Peng-Lei · 2021-12-02T11:46:39Z

sql/core/src/test/scala/org/apache/spark/sql/ShowCreateTableSuite.scala

I'm a little confused that the table will be created success although column a is nullable. It seems to me that partition columns should not be nullable. @cloud-fan

SparkQA · 2021-12-02T12:25:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50330/

SparkQA · 2021-12-02T13:10:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50330/

SparkQA · 2021-12-02T13:18:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50333/

SparkQA · 2021-12-02T13:39:45Z

Test build #145855 has finished for PR 34780 at commit 200cd7f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-12-02T14:14:33Z

To understand this issue better, today Spark reorders the user-specified schema in CREATE TABLE and always puts partition columns at the end?

SparkQA · 2021-12-02T14:18:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50333/

SparkQA · 2021-12-02T14:35:56Z

Test build #145858 has finished for PR 34780 at commit be72b42.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LuciferYang · 2021-12-03T02:49:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

  def partitionSchema: StructType = {
-    val partitionFields = schema.takeRight(partitionColumnNames.length)
+    val partitionFields = partitionColumnNames.map { partCol =>
+      schema.find(_.name == partCol).get


Is this safe? Is there any Exception of None.get here？

Is this consistent with the result of

partitionColumnNames.flatMap { partCol => schema.find(_.name == partCol) }

?

LuciferYang · 2021-12-03T02:58:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

   */
  def dataSchema: StructType = {
-    val dataFields = schema.dropRight(partitionColumnNames.length)
+    val dataFields = schema.filterNot { i =>


i is easy to associate with index. Should we change this variable name?

ok. I will change it. Thank you.

Peng-Lei · 2021-12-16T12:23:16Z

To understand this issue better, today Spark reorders the user-specified schema in CREATE TABLE and always puts partition columns at the end?

@cloud-fan
I try to learn about it. I found that Spark reorders the user-specified schema in CREATE TABLE. Because the reorder logic in a analyzer, which works with both data source tables and hive serde tables. In particular, CTAS, if provider is FileFormat. The HadoopFsRelation have data schema and partition schema individually. The schema of HadoopFsRelation is data schema + partition schema - overlapped, So although I remove the reorder logic in a analyzer rule. The schema also is data schema + partition schema - overlapped. It is same to hive serde tables. when we get information from HiveCatalog, we will reorder the schema to put the partition column at end. Am I wrong?

cloud-fan · 2021-12-17T12:32:33Z

I don't think we need to be limited by the underlying data source/hive metastore. We can always add an extra project to keep the original user-specified column order.

github-actions · 2022-03-28T00:16:52Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Dec 2, 2021

Peng-Lei commented Dec 2, 2021

View reviewed changes

add draft

be72b42

Peng-Lei force-pushed the v1-column-order branch from 200cd7f to be72b42 Compare December 2, 2021 11:54

LuciferYang reviewed Dec 3, 2021

View reviewed changes

Peng-Lei mentioned this pull request Jan 5, 2022

[SPARK-37381][SQL] Unify v1 and v2 SHOW CREATE TABLE tests #34719

Closed

github-actions bot added the Stale label Mar 28, 2022

github-actions bot closed this Mar 29, 2022

[SPARK-37517][SQL] Keep consistent order of columns with user specify for v1 table #34780

[SPARK-37517][SQL] Keep consistent order of columns with user specify for v1 table #34780

Uh oh!

Conversation

Peng-Lei commented Dec 2, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Peng-Lei Dec 2, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

cloud-fan commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

LuciferYang Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 3, 2021

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 3, 2021

Choose a reason for hiding this comment

Uh oh!

Peng-Lei Dec 16, 2021

Choose a reason for hiding this comment

Uh oh!

Peng-Lei commented Dec 16, 2021

Uh oh!

cloud-fan commented Dec 17, 2021

Uh oh!

github-actions bot commented Mar 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LuciferYang Dec 3, 2021 •

edited

Loading