[SPARK-16311][SQL] Improve metadata refresh by petermaxlee · Pull Request #13989 · apache/spark

petermaxlee · 2016-06-30T04:54:33Z

What changes were proposed in this pull request?

This patch implements the 3 things specified in SPARK-16311:

(1) Append a message to the FileNotFoundException and say that a workaround is to do explicitly metadata refresh.
(2) Make metadata refresh work on temporary tables/views.
(3) Make metadata refresh work on Datasets/DataFrames, by introducing a Dataset.refresh() method.

And one additional small change:
(4) Merge invalidateTable and refreshTable.

How was this patch tested?

Created a new test suite that creates a temporary directory and then deletes a file from it to verify Spark can read the directory once refresh is called.

petermaxlee · 2016-06-30T04:55:53Z

cc @rxin

rxin · 2016-06-30T04:56:35Z

cc @cloud-fan / @liancheng

gatorsmile · 2016-06-30T05:09:43Z

Before, I tried to merge invalidateTable and refreshTable. @yhuai left the following comment:
#13156 (comment)

I think maybe we can keep them separately?

gatorsmile · 2016-06-30T05:14:38Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group action
+   * @since 2.0.0
+   */
+  def refresh(): Unit = {


It will remove the cached data. This is different from what JIRA describes. CC @rxin

Other refresh methods also remove cached data, so I thought this is better.

This new API has different behaviors from the refreshTable API and Refresh Table SQL statement. See the following code:

spark/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

Lines 349 to 374 in 02a029d

/**

* Refresh the cache entry for a table, if any. For Hive metastore table, the metadata

* is refreshed.

*

* @group cachemgmt

* @since 2.0.0

*/

override def refreshTable(tableName: String): Unit = {

val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)

sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original

// cached version and make the new version cached lazily.

val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent)

// Use lookupCachedData directly since RefreshTable also takes databaseName.

val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty

if (isCached) {

// Create a data frame to represent the table.

// TODO: Use uncacheTable once it supports database name.

val df = Dataset.ofRows(sparkSession, logicalPlan)

// Uncache the logicalPlan.

sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)

// Cache it again.

sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table))

}

}

IMO, if we using the word refresh, we have to make them consistent.

ah ic - we can't unpersist.

We can unpersist, but should persist it again immediately.

Actually we can and should call unpersist, but we should also call persist()/cache() again so that the Dataset will be cached lazily again with correct data when it gets executed next time. I guess that's also what @gatorsmile meant.

gatorsmile · 2016-06-30T05:47:10Z

Test cases are not enough to cover the metadata refreshing. The current metadata cache is only used for data source tables. We still could convert Hive tables to data source tables. For example, parquet and orc. Thus, we also need to check the behaviors of these cases.

Try to design more test cases for metadata refreshing, including both positive and negative cases.

petermaxlee · 2016-06-30T05:58:24Z

What do you mean by both positive and negative cases?

gatorsmile · 2016-06-30T06:27:49Z

For example, I try to refresh the metadata of a DataFrame that has multiple leaf nodes of MetastoreRelation that are already converted to LogicalRelation. I think we expect the metadata stored in cachedDataSourceTables should be invalidated. The positive case is the table is still cached. The negative case is the table is already uncached.

Update: just correct the contents.

SparkQA · 2016-06-30T06:54:02Z

Test build #61524 has finished for PR 13989 at commit 82f9bec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-30T07:09:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

  }
+
+  /**
+   * Invalidates any metadata cached in the plan recursively.


"Refreshes" instead of "Invalidates"?

petermaxlee · 2016-06-30T07:15:38Z

Would this work? Traverse the logical plan to find whether it references any catalog relation, and if it does, call catalog.refreshTable("...")?

For example

scala> spark.table("test1").queryExecution.optimizedPlan.asInstanceOf[org.apache.spark.sql.execution.datasources.LogicalRelation].metastoreTableIdentifier
res20: Option[org.apache.spark.sql.catalyst.TableIdentifier] = Some(`default`.`test1`)

liancheng · 2016-06-30T09:52:05Z

One concern of mine is that, analyzed plan, optimized plan, and executed (physical) plan stored in QueryExecution are all lazy vals, which means that they won't be re-optimized/planned accordingly after refreshing metadata of the corresponding logical plan.

Say we constructed a DataFrame df to join a small table A and a large table B. After calling df.write.parquet(...), analyzed, optimized, and executed plans of df are all computed. Since A is small, the planner may decide to broadcast it, and this decision is reflected in the physical plan.

Next, we add a bunch of files into the directory where table A lives to make it super large, then call df.refresh() to refresh the logical plan. Now, if we try to call df.write.parquet(...) again, the query may probably crash since the physical plan is not refreshed and still thinks that A should be broadcasted.

liancheng · 2016-06-30T10:04:25Z

In general, I think reconstructing a DataFrame/Dataset or using REFRESH TABLE at application level may be a better approach to solve the problem this PR tries to solve. Did I miss any context here?

rxin · 2016-06-30T18:28:45Z

I think @liancheng has a good point. Why don't we take out Dataset.refresh() for now?

petermaxlee · 2016-06-30T21:39:28Z

Alright I will do that and submit a new pull request.

Note that I think the data frame refresh is already possible via table refresh, if a data frame references a table, or if some view references a data frame.

[SPARK-16311][SQL] Improve metadata refresh

SparkQA · 2016-07-01T06:47:05Z

Test build #3159 has finished for PR 13989 at commit 82f9bec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on #13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311.

## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on #13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311. (cherry picked from commit 16a2a7d) Signed-off-by: Reynold Xin <rxin@databricks.com>

petermaxlee added 3 commits June 29, 2016 21:50

[SPARK-16311][SQL] Improve metadata refresh

cbfbbc7

Add test suite

f715034

undo import

82f9bec

gatorsmile reviewed Jun 30, 2016
View reviewed changes

liancheng reviewed Jun 30, 2016
View reviewed changes

petermaxlee mentioned this pull request Jun 30, 2016

[SPARK-16336][SQL] Suggest doing table refresh upon FileNotFoundException #14003

Closed

rxin added a commit to rxin/spark that referenced this pull request Jul 1, 2016

Merge pull request apache#13989 from petermaxlee/SPARK-16311

d5ea4ef

[SPARK-16311][SQL] Improve metadata refresh

rxin mentioned this pull request Jul 1, 2016

[SPARK-16311][SQL] Metadata refresh should work on temporary views #14009

Closed

petermaxlee closed this Jul 6, 2016

	/**
	* Refresh the cache entry for a table, if any. For Hive metastore table, the metadata
	* is refreshed.
	*
	* @group cachemgmt
	* @since 2.0.0
	*/
	override def refreshTable(tableName: String): Unit = {
	val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
	sessionCatalog.refreshTable(tableIdent)

	// If this table is cached as an InMemoryRelation, drop the original
	// cached version and make the new version cached lazily.
	val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent)
	// Use lookupCachedData directly since RefreshTable also takes databaseName.
	val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
	if (isCached) {
	// Create a data frame to represent the table.
	// TODO: Use uncacheTable once it supports database name.
	val df = Dataset.ofRows(sparkSession, logicalPlan)
	// Uncache the logicalPlan.
	sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
	// Cache it again.
	sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table))
	}
	}

Conversation

petermaxlee commented Jun 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

petermaxlee commented Jun 30, 2016

Uh oh!

rxin commented Jun 30, 2016

Uh oh!

gatorsmile commented Jun 30, 2016

Uh oh!

gatorsmile Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

petermaxlee Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

petermaxlee Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 30, 2016

Uh oh!

petermaxlee commented Jun 30, 2016

Uh oh!

gatorsmile commented Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 30, 2016

Uh oh!

liancheng Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

petermaxlee commented Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng commented Jun 30, 2016

Uh oh!

liancheng commented Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jun 30, 2016

Uh oh!

petermaxlee commented Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gatorsmile commented Jun 30, 2016 •

edited

Loading

petermaxlee commented Jun 30, 2016 •

edited

Loading

liancheng commented Jun 30, 2016 •

edited

Loading

petermaxlee commented Jun 30, 2016 •

edited

Loading