-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25271][SQL] Hive ctas commands should use data source if it is convertible #22514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #96401 has finished for PR 22514 at commit
|
|
retest this please. |
|
Test build #96410 has finished for PR 22514 at commit
|
|
retest this please. |
|
Test build #96595 has finished for PR 22514 at commit
|
|
retest this please. |
|
Test build #96614 has finished for PR 22514 at commit
|
|
retest this please. |
|
Test build #96618 has finished for PR 22514 at commit
|
|
retest this please. |
|
Test build #96625 has finished for PR 22514 at commit
|
|
Test build #96784 has finished for PR 22514 at commit
|
|
retest this please. |
|
Test build #96792 has finished for PR 22514 at commit
|
|
Retest this please. |
|
Test build #97352 has finished for PR 22514 at commit
|
| * @param tableDesc the metadata of the table to be created. | ||
| * @param mode the data writing mode | ||
| * @param query an optional logical plan representing data to write into the created table. | ||
| * @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too hacky. We should not leak hive specific knowledge to general logical plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because all rules related to conversion to data source are located in RelationConversions. So now I need to set a flag at this logical plan and pass to CreateHiveTableAsSelectCommand.
If we loose this requirement, we can avoid this flag and let CreateHiveTableAsSelectCommand decide to convert it to data source or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is better to put all this conversion stuff of Hive CTAS into CreateHiveTableAsSelectCommand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a clear idea now, but CreateTable is a general logical plan for CREATE TABLE, we may even public in to data source/catalog APIs in the future, we should not put hive specific concept here.
|
@cloud-fan, is this a performance regression that affects users that use Hive serde tables as well? |
|
Yes this is a performance regression for users who run CTAS on Hive serde tables. This is a regression since Spark 2.3.1. |
|
@viirya can you explain the high-level idea about how to fix it? It seems hard to fix and we should get a consensus on the approach first. |
|
@cloud-fan The high level idea is not to put expose conversion details to In |
|
sounds like a clean solution. please go ahead, thanks! |
| withTable(sourceTable, targetTable) { | ||
| sql(s"CREATE TABLE $sourceTable (i int,m map<int, string>) ROW FORMAT DELIMITED FIELDS " + | ||
| "TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '$'") | ||
| sql(s"LOAD DATA LOCAL INPATH '${testData.toURI}' INTO TABLE $sourceTable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we generate the input data with a temp view? e.g. create a dataframe with literals and register temp view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
| val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog | ||
|
|
||
| // Whether this table is convertible to data source relation. | ||
| val isConvertible = metastoreCatalog.isConvertible(tableDesc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another idea: can we move this logic to the RelationConversions rule? e.g.
case CreateTable(tbl, mode, Some(query)) if DDLUtils.isHiveTable(tbl) && isConvertible(tbl) =>
Union(CreateTable(tbl, mode, None), InsertIntoTable ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel CreateHiveTableAsSelectCommand is not useful. It simply creates the table first and then call InsertIntoHiveTable.run. Maybe we should just remove it and implement hive table CTAS as Union(CreateTable, InsertIntoTable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is interesting idea. Let me try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a try on this idea.
There is an issue that convertToLogicalRelation needs that the HiveTableRelation is an existing relation. It is good for InsertIntoTable case.
For CTAS now, this relation doesn't exist. Although we use an Union and CreateTable will be run first, the conversion is happened during analysis stage and the table is not created yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense, thanks for trying!
...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
Outdated
Show resolved
Hide resolved
|
@cloud-fan I've updated the PR description. Thanks. |
|
Synced with master. |
...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
Outdated
Show resolved
Hide resolved
|
To be safe, let's add a |
|
I see, we have discussed before. Is it good to add it here or a follow-up? |
|
Seems like a trivial change, let's do it in this PR. |
|
Test build #99958 has finished for PR 22514 at commit
|
|
@cloud-fan Added a SQL config for it. |
|
retest this please |
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
Outdated
Show resolved
Hide resolved
|
Test build #100309 has finished for PR 22514 at commit
|
|
The last commit is only updating comment, I'm merging it to master, thanks! |
|
Test build #100330 has finished for PR 22514 at commit
|
|
Great! Thank you all! |
… convertible ## What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. ## How was this patch tested? Added test. Closes apache#22514 from viirya/SPARK-25271-2. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… convertible ## What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. ## How was this patch tested? Added test. Closes apache#22514 from viirya/SPARK-25271-2. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
Hi, @viirya and @cloud-fan . |
|
It sounds correct to me. As this is reported a bug in 2.3.1, we should fix it in 2.4 too. I will create a backport PR then. |
|
Thank you, @viirya . |
…it is convertible ### What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. This is to backport #22514 to branch-2.4. ### Why are the changes needed? This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS. ### Does this PR introduce _any_ user-facing change? Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS. ### How was this patch tested? Unit tests. Closes #30017 from viirya/SPARK-25271-2.4. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.
In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.
This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.
How was this patch tested?
Added test.