[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017

viirya · 2020-10-12T16:27:32Z

What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

This is to backport #22514 to branch-2.4.

Why are the changes needed?

This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.

Does this PR introduce any user-facing change?

Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.

How was this patch tested?

Unit tests.

viirya · 2020-10-12T16:28:14Z

cc @dongjoon-hyun

SparkQA · 2020-10-12T16:47:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34304/

dongjoon-hyun · 2020-10-12T16:49:53Z

Thank you, @viirya .

dongjoon-hyun · 2020-10-12T16:51:59Z

cc @anuragmantri

SparkQA · 2020-10-12T17:00:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34304/

viirya · 2020-10-12T17:05:30Z

cc @cloud-fan

SparkQA · 2020-10-12T19:38:40Z

Test build #129698 has finished for PR 30017 at commit e3ffaaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait CreateHiveTableAsSelectBase extends DataWritingCommand
case class CreateHiveTableAsSelectCommand(
case class OptimizedCreateHiveTableAsSelectCommand(

dongjoon-hyun

+1, LGTM. Thank you, @viirya and all.
Merged to branch-2.4.

…it is convertible ### What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. This is to backport #22514 to branch-2.4. ### Why are the changes needed? This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS. ### Does this PR introduce _any_ user-facing change? Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS. ### How was this patch tested? Unit tests. Closes #30017 from viirya/SPARK-25271-2.4. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

viirya · 2020-10-12T21:10:44Z

Thanks!

Hive ctas commands should use data source if it is convertible.

e3ffaaf

cloud-fan approved these changes Oct 12, 2020

View reviewed changes

dongjoon-hyun approved these changes Oct 12, 2020

View reviewed changes

dongjoon-hyun closed this Oct 12, 2020

viirya deleted the SPARK-25271-2.4 branch December 27, 2023 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

dongjoon-hyun commented Oct 12, 2020

Uh oh!

dongjoon-hyun commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

dongjoon-hyun left a comment

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017

Uh oh!

Conversation

viirya commented Oct 12, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

dongjoon-hyun commented Oct 12, 2020

Uh oh!

dongjoon-hyun commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants