Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 12, 2020

What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

This is to backport #22514 to branch-2.4.

Why are the changes needed?

This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.

Does this PR introduce any user-facing change?

Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.

How was this patch tested?

Unit tests.

@viirya
Copy link
Member Author

viirya commented Oct 12, 2020

cc @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34304/

@dongjoon-hyun
Copy link
Member

Thank you, @viirya .

@dongjoon-hyun
Copy link
Member

cc @anuragmantri

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34304/

@viirya
Copy link
Member Author

viirya commented Oct 12, 2020

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Test build #129698 has finished for PR 30017 at commit e3ffaaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CreateHiveTableAsSelectBase extends DataWritingCommand
  • case class CreateHiveTableAsSelectCommand(
  • case class OptimizedCreateHiveTableAsSelectCommand(

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @viirya and all.
Merged to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Oct 12, 2020
…it is convertible

### What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

This is to backport #22514 to branch-2.4.

### Why are the changes needed?

This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.

### How was this patch tested?

Unit tests.

Closes #30017 from viirya/SPARK-25271-2.4.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya
Copy link
Member Author

viirya commented Oct 12, 2020

Thanks!

@viirya viirya deleted the SPARK-25271-2.4 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants