Skip to content

[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand#16809

Closed
windpiger wants to merge 6 commits intoapache:masterfrom
windpiger:refreshCacheAfterInsert
Closed

[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand#16809
windpiger wants to merge 6 commits intoapache:masterfrom
windpiger:refreshCacheAfterInsert

Conversation

@windpiger
Copy link
Contributor

What changes were proposed in this pull request?

If we first cache a DataSource table, then we insert some data into the table, we should refresh the data in the cache after the insert command.

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Feb 5, 2017

Test build #72408 has finished for PR 16809 at commit 8aaef3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2017

Test build #72410 has finished for PR 16809 at commit bf2bc1d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72419 has finished for PR 16809 at commit 15350f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

@cloud-fan
Copy link
Contributor

This is a behavior change rather than a bug fix, but I think this new behavior makes more sense. cc @gatorsmile to confirm.

@gatorsmile
Copy link
Member

@cloud-fan The new behavior looks reasonable to me, unless users are expecting to keey the original cached data.

I went over the change history. I found @sameeragarwal did this in #13566 on purpose, even if he reported the issue in the initial PR (#13419).

@sameeragarwal @hvanhovell @davies what is the reason we did not automatically call the refreshByPath after insert?

@gatorsmile
Copy link
Member

Found the design doc: https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#

An alternative is to support a new command REFRESH path that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. This acts as an explicit hammer without modifying the default behavior. Given that it’s fairly late to make significant changes in 2.0, this option might be less intrusive to the default behavior.

Should we revisit what is the expected default behavior in 2.2?

@windpiger
Copy link
Contributor Author

windpiger commented Feb 7, 2017

thanks a lot! It seems that add a REFRESH command is to not modify the default behavior. if user want to refresh, they call the command manually.

@gatorsmile @cloud-fan @sameeragarwal @hvanhovell @davies let us rethink the default behavior? It is resonable to refresh after Insert auto, or use refresh manually?

@cloud-fan
Copy link
Contributor

where do we refresh table for table insertion? will we fresh twice(table and path)?

@windpiger
Copy link
Contributor Author

I just found refresh table related to table insertion when DataFrameWriter.saveAsTable with overwrite mode, and InsetIntoHiveTable. InsertHadoopFsRelation need to refresh table?

}
}

sparkSession.catalog.refreshByPath(outputPath.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only useful when the fileIndex is None right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileIndex is not None also need to refresh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? we will do fileIndex.foreach(_.refresh()) at the end, what's the difference between this and refreshByPath?

Copy link
Contributor Author

@windpiger windpiger Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73271 has finished for PR 16809 at commit 12b68a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73516 has finished for PR 16809 at commit f8ccc2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants