[SPARK-16311][SQL] Improve metadata refresh#13989
[SPARK-16311][SQL] Improve metadata refresh#13989petermaxlee wants to merge 3 commits intoapache:masterfrom
Conversation
|
cc @rxin |
|
cc @cloud-fan / @liancheng |
|
Before, I tried to merge I think maybe we can keep them separately? |
| * @group action | ||
| * @since 2.0.0 | ||
| */ | ||
| def refresh(): Unit = { |
There was a problem hiding this comment.
It will remove the cached data. This is different from what JIRA describes. CC @rxin
There was a problem hiding this comment.
Other refresh methods also remove cached data, so I thought this is better.
There was a problem hiding this comment.
This new API has different behaviors from the refreshTable API and Refresh Table SQL statement. See the following code:
spark/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala
Lines 349 to 374 in 02a029d
IMO, if we using the word refresh, we have to make them consistent.
There was a problem hiding this comment.
ah ic - we can't unpersist.
There was a problem hiding this comment.
We can unpersist, but should persist it again immediately.
There was a problem hiding this comment.
Actually we can and should call unpersist, but we should also call persist()/cache() again so that the Dataset will be cached lazily again with correct data when it gets executed next time. I guess that's also what @gatorsmile meant.
|
Test cases are not enough to cover the metadata refreshing. The current metadata cache is only used for data source tables. We still could convert Hive tables to data source tables. For example, parquet and orc. Thus, we also need to check the behaviors of these cases. Try to design more test cases for metadata refreshing, including both positive and negative cases. |
|
What do you mean by both positive and negative cases? |
|
For example, I try to refresh the metadata of a DataFrame that has multiple leaf nodes of Update: just correct the contents. |
|
Test build #61524 has finished for PR 13989 at commit
|
| } | ||
|
|
||
| /** | ||
| * Invalidates any metadata cached in the plan recursively. |
There was a problem hiding this comment.
"Refreshes" instead of "Invalidates"?
|
Would this work? Traverse the logical plan to find whether it references any catalog relation, and if it does, call catalog.refreshTable("...")? For example |
|
One concern of mine is that, analyzed plan, optimized plan, and executed (physical) plan stored in Say we constructed a DataFrame Next, we add a bunch of files into the directory where table |
|
In general, I think reconstructing a DataFrame/Dataset or using |
|
I think @liancheng has a good point. Why don't we take out Dataset.refresh() for now? |
|
Alright I will do that and submit a new pull request. Note that I think the data frame refresh is already possible via table refresh, if a data frame references a table, or if some view references a data frame. |
[SPARK-16311][SQL] Improve metadata refresh
|
Test build #3159 has finished for PR 13989 at commit
|
## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on #13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311.
## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on #13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311. (cherry picked from commit 16a2a7d) Signed-off-by: Reynold Xin <rxin@databricks.com>
What changes were proposed in this pull request?
This patch implements the 3 things specified in SPARK-16311:
(1) Append a message to the FileNotFoundException and say that a workaround is to do explicitly metadata refresh.
(2) Make metadata refresh work on temporary tables/views.
(3) Make metadata refresh work on Datasets/DataFrames, by introducing a Dataset.refresh() method.
And one additional small change:
(4) Merge invalidateTable and refreshTable.
How was this patch tested?
Created a new test suite that creates a temporary directory and then deletes a file from it to verify Spark can read the directory once refresh is called.