[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand by windpiger · Pull Request #16809 · apache/spark

windpiger · 2017-02-05T11:07:52Z

What changes were proposed in this pull request?

If we first cache a DataSource table, then we insert some data into the table, we should refresh the data in the cache after the insert command.

How was this patch tested?

unit test added

…ommand

SparkQA · 2017-02-05T12:37:53Z

Test build #72408 has finished for PR 16809 at commit 8aaef3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-05T15:11:57Z

Test build #72410 has finished for PR 16809 at commit bf2bc1d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-06T03:50:17Z

Test build #72419 has finished for PR 16809 at commit 15350f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-06T04:47:38Z

cc @cloud-fan @gatorsmile

cloud-fan · 2017-02-06T06:12:53Z

This is a behavior change rather than a bug fix, but I think this new behavior makes more sense. cc @gatorsmile to confirm.

gatorsmile · 2017-02-06T13:14:12Z

@cloud-fan The new behavior looks reasonable to me, unless users are expecting to keey the original cached data.

I went over the change history. I found @sameeragarwal did this in #13566 on purpose, even if he reported the issue in the initial PR (#13419).

@sameeragarwal @hvanhovell @davies what is the reason we did not automatically call the refreshByPath after insert?

gatorsmile · 2017-02-06T13:17:43Z

Found the design doc: https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#

An alternative is to support a new command REFRESH path that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. This acts as an explicit hammer without modifying the default behavior. Given that it’s fairly late to make significant changes in 2.0, this option might be less intrusive to the default behavior.

Should we revisit what is the expected default behavior in 2.2?

windpiger · 2017-02-07T07:08:48Z

thanks a lot! It seems that add a REFRESH command is to not modify the default behavior. if user want to refresh, they call the command manually.

@gatorsmile @cloud-fan @sameeragarwal @hvanhovell @davies let us rethink the default behavior? It is resonable to refresh after Insert auto, or use refresh manually?

cloud-fan · 2017-02-09T06:35:50Z

where do we refresh table for table insertion? will we fresh twice(table and path)?

windpiger · 2017-02-09T07:49:55Z

I just found refresh table related to table insertion when DataFrameWriter.saveAsTable with overwrite mode, and InsetIntoHiveTable. InsertHadoopFsRelation need to refresh table?

cloud-fan · 2017-02-09T11:52:45Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

          }
        }
+
+        sparkSession.catalog.refreshByPath(outputPath.toString)


This is only useful when the fileIndex is None right?

fileIndex is not None also need to refresh

why? we will do fileIndex.foreach(_.refresh()) at the end, what's the difference between this and refreshByPath?

if we cache the table, refreshByPath will unpersist it
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L176

SparkQA · 2017-02-22T12:06:44Z

Test build #73271 has finished for PR 16809 at commit 12b68a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T18:51:02Z

Test build #73516 has finished for PR 16809 at commit f8ccc2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-28T19:59:29Z

thanks, merging to master!

[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationC…

8aaef3f

…ommand

fix a test failed

bf2bc1d

fix a test failed

15350f1

cloud-fan reviewed Feb 9, 2017

View reviewed changes

modify a func name

12b68a7

windpiger added 2 commits February 27, 2017 19:34

move the refreshByPath place

bfafad7

modify a comment

f8ccc2f

cloud-fan mentioned this pull request Feb 28, 2017

[SPARK-19756][SQL] drop the table cache after inserting into a data source table #17089

Closed

asfgit closed this in ce233f1 Feb 28, 2017

Conversation

windpiger commented Feb 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 5, 2017

Uh oh!

SparkQA commented Feb 5, 2017

Uh oh!

SparkQA commented Feb 6, 2017

Uh oh!

windpiger commented Feb 6, 2017

Uh oh!

cloud-fan commented Feb 6, 2017

Uh oh!

gatorsmile commented Feb 6, 2017

Uh oh!

gatorsmile commented Feb 6, 2017

Uh oh!

windpiger commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Feb 9, 2017

Uh oh!

windpiger commented Feb 9, 2017

Uh oh!

cloud-fan Feb 9, 2017

Choose a reason for hiding this comment

Uh oh!

windpiger Feb 22, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 23, 2017

Choose a reason for hiding this comment

Uh oh!

windpiger Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 22, 2017

Uh oh!

SparkQA commented Feb 27, 2017

Uh oh!

cloud-fan commented Feb 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

windpiger commented Feb 7, 2017 •

edited

Loading

windpiger Feb 27, 2017 •

edited

Loading