[SPARK-25271][SQL] Hive ctas commands should use data source if it is convertible #22514

viirya · 2018-09-21T06:25:06Z

What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

How was this patch tested?

Added test.

SparkQA · 2018-09-21T07:05:01Z

Test build #96401 has finished for PR 22514 at commit 5debc60.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-21T07:15:16Z

retest this please.

SparkQA · 2018-09-21T11:19:51Z

Test build #96410 has finished for PR 22514 at commit 5debc60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-21T12:55:05Z

cc @cloud-fan @HyukjinKwon

viirya · 2018-09-26T03:49:01Z

retest this please.

SparkQA · 2018-09-26T07:05:01Z

Test build #96595 has finished for PR 22514 at commit 5debc60.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-26T07:09:52Z

retest this please.

SparkQA · 2018-09-26T08:02:24Z

Test build #96614 has finished for PR 22514 at commit 5debc60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-26T08:08:28Z

retest this please.

SparkQA · 2018-09-26T10:15:28Z

Test build #96618 has finished for PR 22514 at commit 5debc60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-26T10:20:56Z

retest this please.

SparkQA · 2018-09-26T14:20:05Z

Test build #96625 has finished for PR 22514 at commit 5debc60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-29T07:05:01Z

Test build #96784 has finished for PR 22514 at commit 1223178.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-29T07:18:41Z

retest this please.

SparkQA · 2018-09-29T11:31:35Z

Test build #96792 has finished for PR 22514 at commit 1223178.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-14T01:50:46Z

Retest this please.

SparkQA · 2018-10-14T05:41:19Z

Test build #97352 has finished for PR 22514 at commit 1223178.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-15T08:33:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala

 * @param tableDesc the metadata of the table to be created.
 * @param mode the data writing mode
 * @param query an optional logical plan representing data to write into the created table.
+ * @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently


This is too hacky. We should not leak hive specific knowledge to general logical plans.

This is because all rules related to conversion to data source are located in RelationConversions. So now I need to set a flag at this logical plan and pass to CreateHiveTableAsSelectCommand.

If we loose this requirement, we can avoid this flag and let CreateHiveTableAsSelectCommand decide to convert it to data source or not.

Do you think it is better to put all this conversion stuff of Hive CTAS into CreateHiveTableAsSelectCommand?

I don't have a clear idea now, but CreateTable is a general logical plan for CREATE TABLE, we may even public in to data source/catalog APIs in the future, we should not put hive specific concept here.

HyukjinKwon · 2018-10-23T15:52:09Z

@cloud-fan, is this a performance regression that affects users that use Hive serde tables as well?

cloud-fan · 2018-10-23T23:44:59Z

Yes this is a performance regression for users who run CTAS on Hive serde tables. This is a regression since Spark 2.3.1.

cloud-fan · 2018-10-23T23:49:26Z

@viirya can you explain the high-level idea about how to fix it? It seems hard to fix and we should get a consensus on the approach first.

viirya · 2018-10-24T00:27:49Z

@cloud-fan The high level idea is not to put expose conversion details to CreateTable. But let CreateHiveTableAsSelectCommand to decide whether to do conversion. So in HiveAnalysis rule, no thing is changed, we still transform CreateTable to CreateHiveTableAsSelectCommand if it is a Hive table.

In CreateHiveTableAsSelectCommand, it checks if the relation is convertible. If yes, it makes the conversion and write into data source relation.

cloud-fan · 2018-10-24T01:43:36Z

sounds like a clean solution. please go ahead, thanks!

cloud-fan · 2018-10-24T06:14:18Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala

+    withTable(sourceTable, targetTable) {
+      sql(s"CREATE TABLE $sourceTable (i int,m map<int, string>) ROW FORMAT DELIMITED FIELDS " +
+        "TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '$'")
+      sql(s"LOAD DATA LOCAL INPATH '${testData.toURI}' INTO TABLE $sourceTable")


can we generate the input data with a temp view? e.g. create a dataframe with literals and register temp view.

cloud-fan · 2018-10-24T06:18:35Z

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

+    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
+
+    // Whether this table is convertible to data source relation.
+    val isConvertible = metastoreCatalog.isConvertible(tableDesc)


another idea: can we move this logic to the RelationConversions rule? e.g.

case CreateTable(tbl, mode, Some(query)) if DDLUtils.isHiveTable(tbl) && isConvertible(tbl) => Union(CreateTable(tbl, mode, None), InsertIntoTable ...)

I feel CreateHiveTableAsSelectCommand is not useful. It simply creates the table first and then call InsertIntoHiveTable.run. Maybe we should just remove it and implement hive table CTAS as Union(CreateTable, InsertIntoTable).

That is interesting idea. Let me try it.

Made a try on this idea.

There is an issue that convertToLogicalRelation needs that the HiveTableRelation is an existing relation. It is good for InsertIntoTable case.

For CTAS now, this relation doesn't exist. Although we use an Union and CreateTable will be run first, the conversion is happened during analysis stage and the table is not created yet.

ah makes sense, thanks for trying!

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

viirya · 2018-12-06T13:51:39Z

@cloud-fan I've updated the PR description. Thanks.

viirya · 2018-12-11T08:11:46Z

Synced with master.

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

cloud-fan · 2018-12-11T08:31:24Z

To be safe, let's add a HiveUtils.CONVERT_METASTORE_CTAS with default value true in this PR. It's also a good practice to have fine-grained optimization flags. I think migration guide is not needed here.

viirya · 2018-12-11T08:34:28Z

I see, we have discussed before. Is it good to add it here or a follow-up?

cloud-fan · 2018-12-11T08:40:17Z

Seems like a trivial change, let's do it in this PR.

SparkQA · 2018-12-11T12:12:39Z

Test build #99958 has finished for PR 22514 at commit ef52536.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait SingleValueExecutorMetricType extends ExecutorMetricType
class GBTClassifierParams(GBTParams, HasVarianceImpurity):
class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
class HasDistanceMeasure(Params):
class HasValidationIndicatorCol(Params):
class HasVarianceImpurity(Params):
class TreeRegressorParams(HasVarianceImpurity):
class GBTParams(TreeEnsembleParams, HasMaxIter, HasStepSize, HasValidationIndicatorCol):
class GBTRegressorParams(GBTParams, TreeRegressorParams):
class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
class ArrowCollectSerializer(Serializer):
class CSVInferSchema(val options: CSVOptions) extends Serializable
class InterpretedSafeProjection(expressions: Seq[Expression]) extends Projection
sealed trait DateTimeFormatter
class Iso8601DateTimeFormatter(
class LegacyDateTimeFormatter(
class LegacyFallbackDateTimeFormatter(
sealed trait DateFormatter
class Iso8601DateFormatter(
class LegacyDateFormatter(
class LegacyFallbackDateFormatter(
case class ArrowEvalPython(
case class BatchEvalPython(

viirya · 2018-12-19T08:37:37Z

@cloud-fan Added a SQL config for it.

cloud-fan · 2018-12-19T11:14:02Z

retest this please

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

SparkQA · 2018-12-19T21:59:29Z

Test build #100309 has finished for PR 22514 at commit d949436.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-20T02:47:38Z

The last commit is only updating comment, I'm merging it to master, thanks!

SparkQA · 2018-12-20T06:35:45Z

Test build #100330 has finished for PR 22514 at commit 839a6ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-20T18:41:39Z

Great! Thank you all!

… convertible ## What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. ## How was this patch tested? Added test. Closes apache#22514 from viirya/SPARK-25271-2. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2020-10-07T23:01:57Z

Hi, @viirya and @cloud-fan .
This was reported as a bug since 2.3.1 originally. Can we have this in branch-2.4 because it's LTS?

viirya · 2020-10-07T23:10:23Z

It sounds correct to me. As this is reported a bug in 2.3.1, we should fix it in 2.4 too. I will create a backport PR then.

dongjoon-hyun · 2020-10-08T00:35:02Z

Thank you, @viirya .
cc @anuragmantri

…it is convertible ### What changes were proposed in this pull request? In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET. In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data. This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data. This is to backport #22514 to branch-2.4. ### Why are the changes needed? This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS. ### Does this PR introduce _any_ user-facing change? Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS. ### How was this patch tested? Unit tests. Closes #30017 from viirya/SPARK-25271-2.4. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Hive ctas commands should use data source format if it is convertible.

5debc60

Merge remote-tracking branch 'upstream/master' into SPARK-25271-2

1223178

cloud-fan reviewed Oct 15, 2018

View reviewed changes

viirya added 3 commits October 24, 2018 11:00

Take another approach.

ad620be

Merge remote-tracking branch 'upstream/master' into SPARK-25271-2

1c4ad1a

Restore previous changes.

5780a5e

cloud-fan reviewed Oct 24, 2018

View reviewed changes

gatorsmile reviewed Dec 5, 2018

View reviewed changes

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into SPARK-25271-2

ef52536

cloud-fan reviewed Dec 11, 2018

View reviewed changes

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala Outdated Show resolved Hide resolved

viirya added 2 commits December 19, 2018 16:10

Merge two methods to getWritingCommand.

15b9c02

Add config for Ctas command.

d949436

cloud-fan reviewed Dec 19, 2018

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Dec 19, 2018

View reviewed changes

Address comment.

839a6ce

asfgit closed this in 5ad0360 Dec 20, 2018

Udbhav30 mentioned this pull request Oct 1, 2019

[SPARK-28659][SQL] Use data source if convertible in insert overwrite directory #25398

Closed

viirya mentioned this pull request Oct 12, 2020

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible #30017

Closed

viirya deleted the SPARK-25271-2 branch December 27, 2023 18:22

[SPARK-25271][SQL] Hive ctas commands should use data source if it is convertible #22514

[SPARK-25271][SQL] Hive ctas commands should use data source if it is convertible #22514

Uh oh!

Conversation

viirya commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 21, 2018

Uh oh!

viirya commented Sep 21, 2018

Uh oh!

SparkQA commented Sep 21, 2018

Uh oh!

viirya commented Sep 21, 2018

Uh oh!

viirya commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 26, 2018

Uh oh!

viirya commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 26, 2018

Uh oh!

viirya commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 26, 2018

Uh oh!

viirya commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 29, 2018

Uh oh!

viirya commented Sep 29, 2018

Uh oh!

SparkQA commented Sep 29, 2018

Uh oh!

dongjoon-hyun commented Oct 14, 2018

Uh oh!

SparkQA commented Oct 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 23, 2018

Uh oh!

cloud-fan commented Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Oct 23, 2018

Uh oh!

viirya commented Oct 24, 2018

Uh oh!

cloud-fan commented Oct 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya commented Dec 6, 2018

viirya commented Sep 21, 2018 •

edited

Loading

viirya Oct 15, 2018 •

edited

Loading

cloud-fan commented Oct 23, 2018 •

edited

Loading

dongjoon-hyun commented Oct 7, 2020 •

edited

Loading