Skip to content

Conversation

@liancheng
Copy link
Contributor

Although lazy caching for in-memory table seems consistent with the RDD.cache() API, it's relatively confusing for users who mainly work with SQL and not familiar with Spark internals. The CACHE TABLE t; SELECT COUNT(*) FROM t; pattern is also commonly seen just to ensure predictable performance.

This PR makes both the CACHE TABLE t [AS SELECT ...] statement and the SQLContext.cacheTable() API eager by default, and adds a new CACHE LAZY TABLE t [AS SELECT ...] syntax to provide lazy in-memory table caching.

Also, took the chance to make some refactoring: CacheCommand and CacheTableAsSelectCommand are now merged and renamed to CacheTableCommand since the former is strictly a special case of the latter. A new UncacheTableCommand is added for the UNCACHE TABLE t statement.

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have started for PR 2513 at commit b72e24e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have finished for PR 2513 at commit b72e24e.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheTableCommand(tableName: String) extends Command
    • case class CacheTableCommand(tableName: String, logicalPlan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheCommand(tableName: String) extends LeafNode with Command
    • case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20735/

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have started for PR 2513 at commit 8d2192d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have finished for PR 2513 at commit 8d2192d.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheTableCommand(tableName: String) extends Command
    • case class CacheTableCommand(tableName: String, logicalPlan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheCommand(tableName: String) extends LeafNode with Command
    • case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20753/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a nit, but I talked to @aarondav and we are thinking that CACHE TABLE LAZY might be a little more consistent. The reasoning being that CACHE is really the most important verb here and so should go first. This is similar to INSERT INTO TABLE vs INSERT OVERWRITE TABLE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I wasn't very sure about the syntax either when add this.

@liancheng
Copy link
Contributor Author

Updated, replaced LAZY CACHE TABLE with CACHE TABLE LAZY, also refactored test suites to check in-memory column RDD materialization.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2513 at commit 39d243e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2513 at commit 39d243e.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheTableCommand(tableName: String) extends Command
    • case class CacheTableCommand(tableName: String, logicalPlan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheCommand(tableName: String) extends LeafNode with Command
    • case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21201/

@liancheng
Copy link
Contributor Author

Rebased to the master, with the new CACHE LAZY TABLE t syntax.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2513 at commit fe92287.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have finished for PR 2513 at commit fe92287.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheTableCommand(tableName: String) extends Command
    • case class CacheTableCommand(
    • case class UncacheTableCommand(tableName: String) extends LeafNode with Command
    • case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21292/Test PASSed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added keyword LAZY and sorted all the keywords in alphabetical order here. This list was once sorted but broken later.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 6, 2014

I'm going to merge this. Feel free to clean up minor ";" issue as part of the other parser refactoring you are doing. Thanks :)

@asfgit asfgit closed this in 34b97a0 Oct 6, 2014
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus Forgot to confirm this with you: default value of the blocking argument is true in RDD.unpersist(), I changed the default value here to keep the semantics consistent. This also makes testing more easily (I added assertions to check RDD materialization, non-blocking unpersisting introduces some subtleties). Did you intend to use non-blocking unpersisting here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mistakenly though that was the default. We should match the original semantics.

@liancheng liancheng deleted the eager-caching branch October 9, 2014 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants