-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3645][SQL] Makes table caching eager by default and adds syntax for lazy caching #2513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2513 at commit
|
|
QA tests have finished for PR 2513 at commit
|
|
Test PASSed. |
|
QA tests have started for PR 2513 at commit
|
|
QA tests have finished for PR 2513 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of a nit, but I talked to @aarondav and we are thinking that CACHE TABLE LAZY might be a little more consistent. The reasoning being that CACHE is really the most important verb here and so should go first. This is similar to INSERT INTO TABLE vs INSERT OVERWRITE TABLE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I wasn't very sure about the syntax either when add this.
8d2192d to
39d243e
Compare
|
Updated, replaced |
|
QA tests have started for PR 2513 at commit
|
|
QA tests have finished for PR 2513 at commit
|
|
Test PASSed. |
39d243e to
fe92287
Compare
|
Rebased to the master, with the new |
|
QA tests have started for PR 2513 at commit
|
|
QA tests have finished for PR 2513 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added keyword LAZY and sorted all the keywords in alphabetical order here. This list was once sorted but broken later.
|
I'm going to merge this. Feel free to clean up minor ";" issue as part of the other parser refactoring you are doing. Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus Forgot to confirm this with you: default value of the blocking argument is true in RDD.unpersist(), I changed the default value here to keep the semantics consistent. This also makes testing more easily (I added assertions to check RDD materialization, non-blocking unpersisting introduces some subtleties). Did you intend to use non-blocking unpersisting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mistakenly though that was the default. We should match the original semantics.
Although lazy caching for in-memory table seems consistent with the
RDD.cache()API, it's relatively confusing for users who mainly work with SQL and not familiar with Spark internals. TheCACHE TABLE t; SELECT COUNT(*) FROM t;pattern is also commonly seen just to ensure predictable performance.This PR makes both the
CACHE TABLE t [AS SELECT ...]statement and theSQLContext.cacheTable()API eager by default, and adds a newCACHE LAZY TABLE t [AS SELECT ...]syntax to provide lazy in-memory table caching.Also, took the chance to make some refactoring:
CacheCommandandCacheTableAsSelectCommandare now merged and renamed toCacheTableCommandsince the former is strictly a special case of the latter. A newUncacheTableCommandis added for theUNCACHE TABLE tstatement.