[PoC] Core: Support Table Metadata caching #14137

okumin · 2025-09-21T07:24:55Z

Allows TableMetadataParser to cache the content of metadata.json.

Iceberg REST Catalog is expected to serve table metadata of the same table to various clients concurrently. Therefore, caching them would improve average latency and reduce S3 cost. Here, I assume each metadata.json is totally immutable, and I may be overlooking something, or this might not be worth including iceberg-core.

Related PR

apache/hive#6022

pvary · 2025-09-22T05:39:36Z

CC: @gaborkaszab

gaborkaszab · 2025-09-22T13:15:37Z

Thanks for the PR @okumin ! Also, the sequence diagram in the linked Hive PR was very useful to understand the use-case. If I'm not mistaken the motivation for this is to introduce a server-side Table/TableMetadata cache within the HMS implementation of the REST catalog and the original approach didn't work out because there is no catalog API to expose metadata location without loading the whole table. Is my assumption correct?

As an initial step I'd recommend to check if there is community support for such a broad change via asking on dev@. The reason I think this is needed, because this PR seems to affect broadly all the catalogs that load the table metadata from storage using a metadata location. Also, some of my previous experiences showed that the size of the metadata.jsons could grow pretty big, and I'm wondering if there is any study on your side, what are the optional configurations for the size of the cache and the max size of the metadata.jsons. I'd be worried that in a real world scenario only the small tables would fit into the cache anyway. Do you cache the content of the metadata.json and not the compressed gzip version that is stored on storage, right?

=======
Some thinking and side conversation

I was thinking on the architecture of the HMS-based REST catalog and maybe the root cause of these struggles to implement the server-side cache there. Let me know if I miss something. As I see the architecture is this for a load table:

Call loadTable API on HMS-based REST catalog
Internally call HiveCatalog's loadTable
This calls HMS's loadTable that returns the metadata location
Internal HiveCatalog uses the metadata location from HMS to load the TableMetadata from storage
Internal HiveCatalog returns Table object
LoadTableResponse is constructed and returned from HMS-based REST catalog

What I don't exactly see is why there is a need for the internal HiveCatalog other than convenience to connect to HMS (but we are already in HMS, right? Maybe different process, though). Alternatively the sequence could be this, eliminating the need for an API to get metadata location and also for the TableMetadataParser cache:

Call loadTable API on HMS-based REST catalog
Call HMS's loadTable directly to get the metadata location
Have a cache in HMS-based REST catalog (store metadata locations, ETags, Table objects, etc.). Check if the table has been changed using the cache
4 a) If the table has been changed, do full table load through HiveCatalog, alternatively through TableMetadataParser using the metadata location
4 b) If the table hasn't been changed, answer the request from cache. Or send a 304-NotModified depending on the use-case

Would this make sense?

okumin · 2025-09-23T03:51:49Z

@gaborkaszab
As you say, the original motivation is exactly for REST on HMS, but the sentence "the original approach didn't work" could be revised a bit. Any options, such as where we add the caching layer (I'd say the HMS-specific approach is still valid), or any detailed configurations, such as the recommended cache size, are open and flat.

I'd say we can potentially have the following three possibilities.

(A) Let's say caching table metadata is legal and practical for REST or other use cases. In this case, reusing the same logic as the manifest caching might make sense; users would have a consistent experience across table metadata and manifests(and potentially manifest list files in the future), and it can be reused by other systems, e.g., REST Catalog over JDBC Catalog. I created this PR to demonstrate this direction.
(B) Let's say caching table metadata is both legal and practical, but the caching layer should not be part of iceberg-core. In this case, I would close this PR, and we will implement caching only in REST over HMS.
(C) Let's say caching table metadata is illegal, e.g., it violates the ACID semantics, or impractical, e.g., table metadata is typically too extensive to cache. In this case, we should reconsider whether we really need HIVE-29035.

I currently assume caching table metadata is practical and (A) can outperform (B) because the logic is not so specific to Hive Metastore. If we can agree with this point, I will send an email to talk about if iceberg-core should be able to cache table metadata and the manifest list.

github-actions · 2025-10-24T00:15:10Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2025-10-31T00:17:32Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

okumin added 2 commits September 21, 2025 16:03

Core: Support Table Metadata caching

097bc7b

Deprecate another method

1b1c224

github-actions bot added the core label Sep 21, 2025

okumin mentioned this pull request Sep 23, 2025

HIVE-29035: Fixing cache handling for REST catalog apache/hive#6022

Closed

github-actions bot added the stale label Oct 24, 2025

github-actions bot closed this Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PoC] Core: Support Table Metadata caching #14137

[PoC] Core: Support Table Metadata caching #14137

Uh oh!

okumin commented Sep 21, 2025 •

edited

Loading

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

gaborkaszab commented Sep 22, 2025

Uh oh!

okumin commented Sep 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[PoC] Core: Support Table Metadata caching #14137

[PoC] Core: Support Table Metadata caching #14137

Uh oh!

Conversation

okumin commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related PR

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

gaborkaszab commented Sep 22, 2025

======= Some thinking and side conversation

Uh oh!

okumin commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

okumin commented Sep 21, 2025 •

edited

Loading

=======
Some thinking and side conversation

okumin commented Sep 23, 2025 •

edited

Loading