Spark3: Disable catalog cache-enabled. #2659

coolderli · 2021-06-02T06:32:11Z

When I use spark SQL to query the iceberg table, I found it can't read the latest table metadata. When cache-enabled is set to true, it will not refresh the table. So I think we should disable the cache-enabled by default.

In this patch, I put the setting to catalog properties.

jshmchenxi · 2021-06-05T05:59:51Z

site/docs/configuration.md

 | uri                               | null               | a URI string, such as Hive metastore URI               |
 | clients                           | 2                  | client pool size                                       |
-
+| cache-enabled                     | false              | cache catalog, only works for Spark                    | 


There is also a cache-enabled option in FlinkCatalogFactory. I have created a PR about this: #2648

rdblue · 2021-06-12T23:28:49Z

I don't agree with changing the default here. I think the solution instead is to update the table cache and invalidate entries after some period of time, like 1 minute. But it isn't correct to simply turn off caching by default. You should also be able to run REFRESH TABLE to update the table.

@aokolnychyi and @rymurr, I think that we turned off timer-based cache invalidation because it was causing tables referenced by cached queries to be out of sync with tables that are freshly loaded. Should we rethink that? If I remember correctly, we turned off invalidation before we fixed Spark in many cases. Now that we have Spark handling caching correctly, can we go back to caching a table only for a minute?

rdblue · 2021-06-12T23:29:01Z

FYI @RussellSpitzer also.

RussellSpitzer · 2021-06-13T11:06:20Z

This is a tough one for me since it is basically our number one user question for multiuser tables. It definitely seems counter intuitive and if there was a way to just cache for all references in the same query and drop it for everything else I would be in favor of that. I feel less confident about changing the default to just disable caching, but that does seem less surprising for the majority of users than the current caching behavior.

rymurr · 2021-06-14T13:20:12Z

@aokolnychyi and @rymurr, I think that we turned off timer-based cache invalidation because it was causing tables referenced by cached queries to be out of sync with tables that are freshly loaded.

I think it was also related to the table being dropped from the cache during a long running operation.

Is this the only place in iceberg that a table would be cached? Or is it cached by spark as well. I would be in favour of having as little caching as possible handled by iceberg directly and rely on engine level caching for Tables. I guess adding the timer back to this cache is a way of caching multiple calls to the table before the Spark cache has been notified of the Spark Table?

rdblue · 2021-06-15T17:34:24Z

Is this the only place in iceberg that a table would be cached?

The table will be referenced by Spark plans as well. I think the problem was that those plans weren't being invalidated when you ran REFRESH TABLE t because the catalog's invalidateTable method calls refresh on the table reference that it loads. So if a table was cleared from the cache then the existing references in Spark would no longer be updated by the catalog's invalidateTable call.

That seems like a Spark problem and not a catalog problem to me, which is why I think we should revisit this decision. Shouldn't Spark invalidate cached plans that reference a table when REFRESH TABLE runs, rather than assuming that the catalog can do it?

We may also want to purposely separate a table when it is in a cached plan. @aokolnychyi, what did we decide was the "correct" behavior when a query is cached?

github-actions · 2024-07-17T00:13:25Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-07-24T00:14:06Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Spark3: Disable catalog cache-enabled.

6384ea8

coolderli marked this pull request as draft June 2, 2021 06:32

github-actions bot added core docs spark labels Jun 2, 2021

update

091f875

coolderli marked this pull request as ready for review June 2, 2021 06:34

kbendick mentioned this pull request Jun 4, 2021

Docs: Add cache-enabled to catalog property list #2648

Merged

jshmchenxi reviewed Jun 5, 2021

View reviewed changes

github-actions bot added the stale label Jul 17, 2024

github-actions bot closed this Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark3: Disable catalog cache-enabled. #2659

Spark3: Disable catalog cache-enabled. #2659

Uh oh!

coolderli commented Jun 2, 2021

Uh oh!

jshmchenxi Jun 5, 2021

Uh oh!

rdblue commented Jun 12, 2021

Uh oh!

rdblue commented Jun 12, 2021

Uh oh!

RussellSpitzer commented Jun 13, 2021

Uh oh!

rymurr commented Jun 14, 2021

Uh oh!

rdblue commented Jun 15, 2021

Uh oh!

github-actions bot commented Jul 17, 2024

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Spark3: Disable catalog cache-enabled. #2659

Spark3: Disable catalog cache-enabled. #2659

Uh oh!

Conversation

coolderli commented Jun 2, 2021

Uh oh!

jshmchenxi Jun 5, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 12, 2021

Uh oh!

rdblue commented Jun 12, 2021

Uh oh!

RussellSpitzer commented Jun 13, 2021

Uh oh!

rymurr commented Jun 14, 2021

Uh oh!

rdblue commented Jun 15, 2021

Uh oh!

github-actions bot commented Jul 17, 2024

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants