Skip to content

Conversation

@ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

At CreateTableLikeCommand, we use the new tblproperties with merge source tblproperties.

Why are the changes needed?

We should retain the useful tblproperties, e.g. parquet.compression. And hive also retain the tblproperties.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add UT.

@ulysses-you
Copy link
Contributor Author

One point.
Hive only retain the useful properties through serde class annotation.
Should we remove the properties like transient_lastDdlTime simply?
Or follow the hive process ?

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123155 has finished for PR 28647 at commit 37ba3b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(source.properties("a") == "apple")
sql("CREATE TABLE t LIKE s STORED AS parquet TBLPROPERTIES('f'='foo', 'b'='bar')")
val table = catalog.getTableMetadata(TableIdentifier("t"))
assert(table.properties.get("a") === None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we intentionally don't keep the table properties from the original table. @maropu @dongjoon-hyun @viirya are you OK with the behavior change proposed by this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description of CreateTableLikeCommand:

The CatalogTable attributes copied from the source table are storage(inputFormat, outputFormat, serde, compressed, properties), schema, provider, partitionColumnNames, bucketSpec by default.

So we don't say table properties are copied from source table too. Not sure about why we didn't copy table properties.

I feel it is OK as seems copying original table properties should not be harmful. And we already copy storage properties.

But we should update the doc together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I have the same opinion with @viirya; I couldn't find any strong reason not to copy the properites. Rather, is this a kind of bugs? Anyway, yea, we should clearly descibe the behaivour in our doc...

@ulysses-you
Copy link
Contributor Author

In HiveDDLSuite, exists the code

    assert(targetTable.properties.filterKeys(!metastoreGeneratedProperties.contains(_)).isEmpty,
      "the table properties of source tables should not be copied in the created table")

If use this behavior, we should change the test code.

@cloud-fan
Copy link
Contributor

let's update the doc of CreateTableLikeCommand to make it clear that we copy table properties as well.

@ulysses-you
Copy link
Contributor Author

I see the enum properties key in HiveDDLSuite.

val metastoreGeneratedProperties = Seq(
      "CreateTime",
      "transient_lastDdlTime",
      "grantTime",
      "lastUpdateTime",
      "last_modified_by",
      "last_modified_time",
      "Owner:",
      "totalNumberFiles",
      "maxFileSize",
      "minFileSize"
    )

Shall we copy the tblproperties without these?

@maropu
Copy link
Member

maropu commented May 29, 2020

CatalogTableType.EXTERNAL
}

val newProperties =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz leave some comments about why the metastore props are remvoed here? #28647 (comment)

object DDLUtils {
val HIVE_PROVIDER = "hive"

val METASTORE_GENERATED_PROPERTIES: Set[String] = Set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check; does the latest hive metastore have the same set of these props here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched somewhere, but I think there is no determinate way to check this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse HiveClientImpl.HiveStatisticsProperties? They are the hive specific properties we explicitly ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveClientImpl.HiveStatisticsProperties has been used at HiveClientImpl.getTable(), we needn't remove them again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel comfortable maintaining this list here. Is it possible to remove Hive specific properties in HiveClientImpl? So that the hive related stuff remains in hive related classes.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123255 has finished for PR 28647 at commit b30d62c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123256 has finished for PR 28647 at commit 162627a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123257 has finished for PR 28647 at commit 9eacf1e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val newProperties = sourceTableDesc.tableType match {
case VIEW =>
// For view, we just use new properties
properties
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep view behavior as before. Hive also does not copy view properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit format;


    val newProperties = sourceTableDesc.tableType match {
      case MANAGED | EXTERNAL =>
        // Hive only retain the useful properties through serde class annotation.
        // For better compatible with Hive, we remove the metastore properties.
        sourceTableDesc.properties -- DDLUtils.METASTORE_GENERATED_PROPERTIES ++ properties

      case VIEW =>
        // For view, we just use new properties
        properties
    }

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123272 has finished for PR 28647 at commit 175d0e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"maxFileSize",
"minFileSize"
)
assert(targetTable.properties.filterKeys(!metastoreGeneratedProperties.contains(_)).isEmpty,
Copy link
Contributor Author

@ulysses-you ulysses-you May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this check. And we can't check all meta properties here. For example transient_lastDdlTime, hive will add this properties when table created. So the properties is always exists.

Add some meta properties test at new UT.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123291 has finished for PR 28647 at commit ed0877e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan @viirya @maropu

test("SPARK-31828: Retain table properties at CreateTableLikeCommand") {
val catalog = spark.sessionState.catalog
withTable("t1", "t2", "t3") {
sql(s"CREATE TABLE t1(c1 int) TBLPROPERTIES('k1'='v1', 'k2'='v2', 'totalNumberFiles'='meta')")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, could you test all the prpos in METASTORE_GENERATED_PROPERTIES? Probably, its beetter to split it into two test units like add a new test unit test("SPARK-31828: Filters out Hive metastore properties in CreateTableLikeCommand") {.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 19, 2020
@github-actions github-actions bot closed this Oct 20, 2020
@ulysses-you
Copy link
Contributor Author

I believe this is useful, can we revisit PR ? cc @maropu @cloud-fan @viirya @dilipbiswal

@dongjoon-hyun
Copy link
Member

This is reopened to the author (@ulysses-you )'s request.

@dongjoon-hyun
Copy link
Member

@ulysses-you . Please rebase your PR regularly to the master to avoid STALE label.

@SparkQA
Copy link

SparkQA commented Nov 2, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35131/

@SparkQA
Copy link

SparkQA commented Nov 2, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35131/

@SparkQA
Copy link

SparkQA commented Nov 2, 2020

Test build #130531 has finished for PR 28647 at commit 03966ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

thank you @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35135/

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Test build #130535 has finished for PR 28647 at commit dc00260.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35135/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35369/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35369/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130760 has finished for PR 28647 at commit 2d470fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

cc @maropu @cloud-fan @viirya do you have time to review this thanks !

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35804/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35804/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131201 has finished for PR 28647 at commit 4b55575.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class NoOpMergedShuffleFileManager implements MergedShuffleFileManager
  • public class RemoteBlockPushResolver implements MergedShuffleFileManager
  • static class PushBlockStreamCallback implements StreamCallbackWithID
  • public static class AppShuffleId
  • public static class AppShufflePartitionInfo
  • trait HasMaxBlockSizeInMB extends Params
  • >>> class VectorAccumulatorParam(AccumulatorParam):
  • fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
  • fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
  • fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
  • fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
  • class HasMaxBlockSizeInMB(Params):
  • trait SQLConfHelper
  • class Analyzer(override val catalogManager: CatalogManager)
  • case class UnresolvedTableOrView(
  • case class UnresolvedPartitionSpec(
  • case class ResolvedPartitionSpec(
  • case class ElementAt(
  • case class GetArrayItem(
  • case class GetMapValue(
  • case class Elt(
  • trait OffsetWindowFunction extends WindowFunction
  • class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logging
  • abstract class AbstractSqlParser extends ParserInterface with SQLConfHelper with Logging
  • class CatalystSqlParser extends AbstractSqlParser
  • case class AnalyzeTable(
  • case class AnalyzeColumn(
  • case class AlterTableAddPartition(
  • case class AlterTableDropPartition(
  • case class LoadData(
  • case class ShowCreateTable(child: LogicalPlan, asSerde: Boolean = false) extends Command
  • abstract class Rule[TreeType <: TreeNode[_]] extends SQLConfHelper with Logging
  • implicit class PartitionSpecsHelper(partSpecs: Seq[PartitionSpec])
  • class SparkPlanner(val session: SparkSession, val experimentalMethods: ExperimentalMethods)
  • class SparkSqlParser extends AbstractSqlParser
  • class SparkSqlAstBuilder extends AstBuilder
  • case class CoalesceShufflePartitions(session: SparkSession) extends Rule[SparkPlan]
  • class FindDataSourceTable(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • class FallBackFileSourceV2(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • class ResolveSQLOnFile(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[LogicalPlan]
  • case class AlterTableAddPartitionExec(
  • case class AlterTableDropPartitionExec(
  • case class DropTableExec(
  • class V2SessionCatalog(catalog: SessionCatalog)
  • case class PlanDynamicPruningFilters(sparkSession: SparkSession)
  • class HDFSBackedReadStateStore(val version: Long, map: MapType)
  • trait ReadStateStore
  • trait StateStore extends ReadStateStore
  • class WrappedReadStateStore(store: StateStore) extends ReadStateStore
  • abstract class BaseStateStoreRDD[T: ClassTag, U: ClassTag](
  • class ReadStateStoreRDD[T: ClassTag, U: ClassTag](
  • case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan]
  • class VariableSubstitution extends SQLConfHelper
  • abstract class JdbcDialect extends Serializable with Logging
  • class ResolveHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]
  • class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131644 has finished for PR 28647 at commit c45489a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExecutorSource(
  • case class GetShufflePushMergerLocations(numMergersNeeded: Int, hostsToFilter: Set[String])
  • case class RemoveShufflePushMergerLocation(host: String) extends ToBlockManagerMaster
  • case class UnresolvedTable(
  • class SubExprEvaluationRuntime(cacheMaxEntries: Int)
  • case class ExpressionProxy(
  • case class ResultProxy(result: Any)
  • case class CurrentTimeZone() extends LeafExpression with Unevaluable
  • abstract class LikeAllBase extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant
  • case class LikeAll(child: Expression, patterns: Seq[UTF8String]) extends LikeAllBase
  • case class NotLikeAll(child: Expression, patterns: Seq[UTF8String]) extends LikeAllBase
  • case class ParseUrl(children: Seq[Expression], failOnError: Boolean = SQLConf.get.ansiEnabled)
  • implicit class MetadataColumnsHelper(metadata: Array[MetadataColumn])
  • trait PathFilterStrategy extends Serializable
  • trait StrategyBuilder
  • class PathGlobFilter(filePatten: String) extends PathFilterStrategy
  • abstract class ModifiedDateFilter extends PathFilterStrategy
  • class ModifiedBeforeFilter(thresholdTime: Long, val timeZoneId: String)
  • class ModifiedAfterFilter(thresholdTime: Long, val timeZoneId: String)

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 5, 2021
@github-actions github-actions bot closed this Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants