-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Thank you, @viirya . |
|
Kubernetes integration test status success |
|
cc @cloud-fan |
|
Test build #129698 has finished for PR 30017 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @viirya and all.
Merged to branch-2.4.
…it is convertible ### What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. This is to backport #22514 to branch-2.4. ### Why are the changes needed? This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS. ### Does this PR introduce _any_ user-facing change? Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS. ### How was this patch tested? Unit tests. Closes #30017 from viirya/SPARK-25271-2.4. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Thanks! |
What changes were proposed in this pull request?
In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.
In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.
This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.
This is to backport #22514 to branch-2.4.
Why are the changes needed?
This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.
Does this PR introduce any user-facing change?
Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.
How was this patch tested?
Unit tests.