Skip to content

Conversation

@Peng-Lei
Copy link
Contributor

@Peng-Lei Peng-Lei commented Dec 2, 2021

What changes were proposed in this pull request?

  1. keep columns order with user specified instead of put partition columns at last.
  2. Modify the partitionSchema and dataSchema implementation.

Why are the changes needed?

discuss at #34719.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add test case.

@github-actions github-actions bot added the SQL label Dec 2, 2021
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused that the table will be created success although column a is nullable. It seems to me that partition columns should not be nullable. @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50330/

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50330/

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50333/

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Test build #145855 has finished for PR 34780 at commit 200cd7f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

To understand this issue better, today Spark reorders the user-specified schema in CREATE TABLE and always puts partition columns at the end?

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50333/

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Test build #145858 has finished for PR 34780 at commit be72b42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def partitionSchema: StructType = {
val partitionFields = schema.takeRight(partitionColumnNames.length)
val partitionFields = partitionColumnNames.map { partCol =>
schema.find(_.name == partCol).get
Copy link
Contributor

@LuciferYang LuciferYang Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? Is there any Exception of None.get here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with the result of

partitionColumnNames.flatMap { partCol =>
  schema.find(_.name == partCol)
}

?

*/
def dataSchema: StructType = {
val dataFields = schema.dropRight(partitionColumnNames.length)
val dataFields = schema.filterNot { i =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is easy to associate with index. Should we change this variable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will change it. Thank you.

@Peng-Lei
Copy link
Contributor Author

To understand this issue better, today Spark reorders the user-specified schema in CREATE TABLE and always puts partition columns at the end?

@cloud-fan
I try to learn about it. I found that Spark reorders the user-specified schema in CREATE TABLE. Because the reorder logic in a analyzer, which works with both data source tables and hive serde tables. In particular, CTAS, if provider is FileFormat. The HadoopFsRelation have data schema and partition schema individually. The schema of HadoopFsRelation is data schema + partition schema - overlapped, So although I remove the reorder logic in a analyzer rule. The schema also is data schema + partition schema - overlapped. It is same to hive serde tables. when we get information from HiveCatalog, we will reorder the schema to put the partition column at end. Am I wrong?

@cloud-fan
Copy link
Contributor

I don't think we need to be limited by the underlying data source/hive metastore. We can always add an extra project to keep the original user-specified column order.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 28, 2022
@github-actions github-actions bot closed this Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants