-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark3: Disable catalog cache-enabled. #2659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| | uri | null | a URI string, such as Hive metastore URI | | ||
| | clients | 2 | client pool size | | ||
|
|
||
| | cache-enabled | false | cache catalog, only works for Spark | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a cache-enabled option in FlinkCatalogFactory. I have created a PR about this: #2648
|
I don't agree with changing the default here. I think the solution instead is to update the table cache and invalidate entries after some period of time, like 1 minute. But it isn't correct to simply turn off caching by default. You should also be able to run @aokolnychyi and @rymurr, I think that we turned off timer-based cache invalidation because it was causing tables referenced by cached queries to be out of sync with tables that are freshly loaded. Should we rethink that? If I remember correctly, we turned off invalidation before we fixed Spark in many cases. Now that we have Spark handling caching correctly, can we go back to caching a table only for a minute? |
|
FYI @RussellSpitzer also. |
|
This is a tough one for me since it is basically our number one user question for multiuser tables. It definitely seems counter intuitive and if there was a way to just cache for all references in the same query and drop it for everything else I would be in favor of that. I feel less confident about changing the default to just disable caching, but that does seem less surprising for the majority of users than the current caching behavior. |
I think it was also related to the table being dropped from the cache during a long running operation. Is this the only place in iceberg that a table would be cached? Or is it cached by spark as well. I would be in favour of having as little caching as possible handled by iceberg directly and rely on engine level caching for Tables. I guess adding the timer back to this cache is a way of caching multiple calls to the table before the Spark cache has been notified of the Spark Table? |
The table will be referenced by Spark plans as well. I think the problem was that those plans weren't being invalidated when you ran That seems like a Spark problem and not a catalog problem to me, which is why I think we should revisit this decision. Shouldn't Spark invalidate cached plans that reference a table when We may also want to purposely separate a table when it is in a cached plan. @aokolnychyi, what did we decide was the "correct" behavior when a query is cached? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
When I use spark SQL to query the iceberg table, I found it can't read the latest table metadata. When
cache-enabledis set to true, it will not refresh the table. So I think we should disable thecache-enabledby default.In this patch, I put the setting to catalog properties.