Skip to content

Conversation

@coolderli
Copy link
Contributor

When I use spark SQL to query the iceberg table, I found it can't read the latest table metadata. When cache-enabled is set to true, it will not refresh the table. So I think we should disable the cache-enabled by default.

In this patch, I put the setting to catalog properties.

@coolderli coolderli marked this pull request as draft June 2, 2021 06:32
@coolderli coolderli marked this pull request as ready for review June 2, 2021 06:34
| uri | null | a URI string, such as Hive metastore URI |
| clients | 2 | client pool size |

| cache-enabled | false | cache catalog, only works for Spark |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a cache-enabled option in FlinkCatalogFactory. I have created a PR about this: #2648

@rdblue
Copy link
Contributor

rdblue commented Jun 12, 2021

I don't agree with changing the default here. I think the solution instead is to update the table cache and invalidate entries after some period of time, like 1 minute. But it isn't correct to simply turn off caching by default. You should also be able to run REFRESH TABLE to update the table.

@aokolnychyi and @rymurr, I think that we turned off timer-based cache invalidation because it was causing tables referenced by cached queries to be out of sync with tables that are freshly loaded. Should we rethink that? If I remember correctly, we turned off invalidation before we fixed Spark in many cases. Now that we have Spark handling caching correctly, can we go back to caching a table only for a minute?

@rdblue
Copy link
Contributor

rdblue commented Jun 12, 2021

FYI @RussellSpitzer also.

@RussellSpitzer
Copy link
Member

This is a tough one for me since it is basically our number one user question for multiuser tables. It definitely seems counter intuitive and if there was a way to just cache for all references in the same query and drop it for everything else I would be in favor of that. I feel less confident about changing the default to just disable caching, but that does seem less surprising for the majority of users than the current caching behavior.

@rymurr
Copy link
Contributor

rymurr commented Jun 14, 2021

@aokolnychyi and @rymurr, I think that we turned off timer-based cache invalidation because it was causing tables referenced by cached queries to be out of sync with tables that are freshly loaded.

I think it was also related to the table being dropped from the cache during a long running operation.

Is this the only place in iceberg that a table would be cached? Or is it cached by spark as well. I would be in favour of having as little caching as possible handled by iceberg directly and rely on engine level caching for Tables. I guess adding the timer back to this cache is a way of caching multiple calls to the table before the Spark cache has been notified of the Spark Table?

@rdblue
Copy link
Contributor

rdblue commented Jun 15, 2021

Is this the only place in iceberg that a table would be cached?

The table will be referenced by Spark plans as well. I think the problem was that those plans weren't being invalidated when you ran REFRESH TABLE t because the catalog's invalidateTable method calls refresh on the table reference that it loads. So if a table was cleared from the cache then the existing references in Spark would no longer be updated by the catalog's invalidateTable call.

That seems like a Spark problem and not a catalog problem to me, which is why I think we should revisit this decision. Shouldn't Spark invalidate cached plans that reference a table when REFRESH TABLE runs, rather than assuming that the catalog can do it?

We may also want to purposely separate a table when it is in a cached plan. @aokolnychyi, what did we decide was the "correct" behavior when a query is cached?

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 17, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants