Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Sep 21, 2018

What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

How was this patch tested?

Added test.

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96401 has finished for PR 22514 at commit 5debc60.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 21, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96410 has finished for PR 22514 at commit 5debc60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 21, 2018

cc @cloud-fan @HyukjinKwon

@viirya
Copy link
Member Author

viirya commented Sep 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96595 has finished for PR 22514 at commit 5debc60.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96614 has finished for PR 22514 at commit 5debc60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96618 has finished for PR 22514 at commit 5debc60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96625 has finished for PR 22514 at commit 5debc60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2018

Test build #96784 has finished for PR 22514 at commit 1223178.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 29, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 29, 2018

Test build #96792 has finished for PR 22514 at commit 1223178.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 14, 2018

Test build #97352 has finished for PR 22514 at commit 1223178.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @param tableDesc the metadata of the table to be created.
* @param mode the data writing mode
* @param query an optional logical plan representing data to write into the created table.
* @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too hacky. We should not leak hive specific knowledge to general logical plans.

Copy link
Member Author

@viirya viirya Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because all rules related to conversion to data source are located in RelationConversions. So now I need to set a flag at this logical plan and pass to CreateHiveTableAsSelectCommand.

If we loose this requirement, we can avoid this flag and let CreateHiveTableAsSelectCommand decide to convert it to data source or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is better to put all this conversion stuff of Hive CTAS into CreateHiveTableAsSelectCommand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a clear idea now, but CreateTable is a general logical plan for CREATE TABLE, we may even public in to data source/catalog APIs in the future, we should not put hive specific concept here.

@HyukjinKwon
Copy link
Member

@cloud-fan, is this a performance regression that affects users that use Hive serde tables as well?

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 23, 2018

Yes this is a performance regression for users who run CTAS on Hive serde tables. This is a regression since Spark 2.3.1.

@cloud-fan
Copy link
Contributor

@viirya can you explain the high-level idea about how to fix it? It seems hard to fix and we should get a consensus on the approach first.

@viirya
Copy link
Member Author

viirya commented Oct 24, 2018

@cloud-fan The high level idea is not to put expose conversion details to CreateTable. But let CreateHiveTableAsSelectCommand to decide whether to do conversion. So in HiveAnalysis rule, no thing is changed, we still transform CreateTable to CreateHiveTableAsSelectCommand if it is a Hive table.

In CreateHiveTableAsSelectCommand, it checks if the relation is convertible. If yes, it makes the conversion and write into data source relation.

@cloud-fan
Copy link
Contributor

sounds like a clean solution. please go ahead, thanks!

withTable(sourceTable, targetTable) {
sql(s"CREATE TABLE $sourceTable (i int,m map<int, string>) ROW FORMAT DELIMITED FIELDS " +
"TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '$'")
sql(s"LOAD DATA LOCAL INPATH '${testData.toURI}' INTO TABLE $sourceTable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we generate the input data with a temp view? e.g. create a dataframe with literals and register temp view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog

// Whether this table is convertible to data source relation.
val isConvertible = metastoreCatalog.isConvertible(tableDesc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another idea: can we move this logic to the RelationConversions rule? e.g.

case CreateTable(tbl, mode, Some(query)) if DDLUtils.isHiveTable(tbl) && isConvertible(tbl) =>
  Union(CreateTable(tbl, mode, None), InsertIntoTable ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel CreateHiveTableAsSelectCommand is not useful. It simply creates the table first and then call InsertIntoHiveTable.run. Maybe we should just remove it and implement hive table CTAS as Union(CreateTable, InsertIntoTable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting idea. Let me try it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a try on this idea.

There is an issue that convertToLogicalRelation needs that the HiveTableRelation is an existing relation. It is good for InsertIntoTable case.

For CTAS now, this relation doesn't exist. Although we use an Union and CreateTable will be run first, the conversion is happened during analysis stage and the table is not created yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense, thanks for trying!

@viirya
Copy link
Member Author

viirya commented Dec 6, 2018

@cloud-fan I've updated the PR description. Thanks.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018

Synced with master.

@cloud-fan
Copy link
Contributor

To be safe, let's add a HiveUtils.CONVERT_METASTORE_CTAS with default value true in this PR. It's also a good practice to have fine-grained optimization flags. I think migration guide is not needed here.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018

I see, we have discussed before. Is it good to add it here or a follow-up?

@cloud-fan
Copy link
Contributor

Seems like a trivial change, let's do it in this PR.

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99958 has finished for PR 22514 at commit ef52536.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait SingleValueExecutorMetricType extends ExecutorMetricType
  • class GBTClassifierParams(GBTParams, HasVarianceImpurity):
  • class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
  • class HasDistanceMeasure(Params):
  • class HasValidationIndicatorCol(Params):
  • class HasVarianceImpurity(Params):
  • class TreeRegressorParams(HasVarianceImpurity):
  • class GBTParams(TreeEnsembleParams, HasMaxIter, HasStepSize, HasValidationIndicatorCol):
  • class GBTRegressorParams(GBTParams, TreeRegressorParams):
  • class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
  • class ArrowCollectSerializer(Serializer):
  • class CSVInferSchema(val options: CSVOptions) extends Serializable
  • class InterpretedSafeProjection(expressions: Seq[Expression]) extends Projection
  • sealed trait DateTimeFormatter
  • class Iso8601DateTimeFormatter(
  • class LegacyDateTimeFormatter(
  • class LegacyFallbackDateTimeFormatter(
  • sealed trait DateFormatter
  • class Iso8601DateFormatter(
  • class LegacyDateFormatter(
  • class LegacyFallbackDateFormatter(
  • case class ArrowEvalPython(
  • case class BatchEvalPython(

@viirya
Copy link
Member Author

viirya commented Dec 19, 2018

@cloud-fan Added a SQL config for it.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 19, 2018

Test build #100309 has finished for PR 22514 at commit d949436.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The last commit is only updating comment, I'm merging it to master, thanks!

@asfgit asfgit closed this in 5ad0360 Dec 20, 2018
@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100330 has finished for PR 22514 at commit 839a6ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Great! Thank you all!

holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
… convertible

## What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

## How was this patch tested?

Added test.

Closes apache#22514 from viirya/SPARK-25271-2.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… convertible

## What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

## How was this patch tested?

Added test.

Closes apache#22514 from viirya/SPARK-25271-2.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 7, 2020

Hi, @viirya and @cloud-fan .
This was reported as a bug since 2.3.1 originally. Can we have this in branch-2.4 because it's LTS?

@viirya
Copy link
Member Author

viirya commented Oct 7, 2020

It sounds correct to me. As this is reported a bug in 2.3.1, we should fix it in 2.4 too. I will create a backport PR then.

@dongjoon-hyun
Copy link
Member

Thank you, @viirya .
cc @anuragmantri

dongjoon-hyun pushed a commit that referenced this pull request Oct 12, 2020
…it is convertible

### What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

This is to backport #22514 to branch-2.4.

### Why are the changes needed?

This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.

### How was this patch tested?

Unit tests.

Closes #30017 from viirya/SPARK-25271-2.4.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya viirya deleted the SPARK-25271-2 branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants