[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand#16809
[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand#16809windpiger wants to merge 6 commits intoapache:masterfrom
Conversation
|
Test build #72408 has finished for PR 16809 at commit
|
|
Test build #72410 has finished for PR 16809 at commit
|
|
Test build #72419 has finished for PR 16809 at commit
|
|
This is a behavior change rather than a bug fix, but I think this new behavior makes more sense. cc @gatorsmile to confirm. |
|
@cloud-fan The new behavior looks reasonable to me, unless users are expecting to keey the original cached data. I went over the change history. I found @sameeragarwal did this in #13566 on purpose, even if he reported the issue in the initial PR (#13419). @sameeragarwal @hvanhovell @davies what is the reason we did not automatically call the |
|
Found the design doc: https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#
Should we revisit what is the expected default behavior in 2.2? |
|
thanks a lot! It seems that add a REFRESH command is to not modify the default behavior. if user want to refresh, they call the command manually. @gatorsmile @cloud-fan @sameeragarwal @hvanhovell @davies let us rethink the default behavior? It is resonable to refresh after Insert auto, or use refresh manually? |
|
where do we refresh table for table insertion? will we fresh twice(table and path)? |
|
I just found refresh table related to table insertion when |
| } | ||
| } | ||
|
|
||
| sparkSession.catalog.refreshByPath(outputPath.toString) |
There was a problem hiding this comment.
This is only useful when the fileIndex is None right?
There was a problem hiding this comment.
fileIndex is not None also need to refresh
There was a problem hiding this comment.
why? we will do fileIndex.foreach(_.refresh()) at the end, what's the difference between this and refreshByPath?
There was a problem hiding this comment.
if we cache the table, refreshByPath will unpersist it
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L176
|
Test build #73271 has finished for PR 16809 at commit
|
|
Test build #73516 has finished for PR 16809 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
If we first cache a DataSource table, then we insert some data into the table, we should refresh the data in the cache after the insert command.
How was this patch tested?
unit test added