-
Notifications
You must be signed in to change notification settings - Fork 3k
[PoC] Core: Support Table Metadata caching #14137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC: @gaborkaszab |
|
Thanks for the PR @okumin ! Also, the sequence diagram in the linked Hive PR was very useful to understand the use-case. If I'm not mistaken the motivation for this is to introduce a server-side Table/TableMetadata cache within the HMS implementation of the REST catalog and the original approach didn't work out because there is no catalog API to expose metadata location without loading the whole table. Is my assumption correct? As an initial step I'd recommend to check if there is community support for such a broad change via asking on dev@. The reason I think this is needed, because this PR seems to affect broadly all the catalogs that load the table metadata from storage using a metadata location. Also, some of my previous experiences showed that the size of the metadata.jsons could grow pretty big, and I'm wondering if there is any study on your side, what are the optional configurations for the size of the cache and the max size of the metadata.jsons. I'd be worried that in a real world scenario only the small tables would fit into the cache anyway. Do you cache the content of the metadata.json and not the compressed gzip version that is stored on storage, right? =======
|
|
@gaborkaszab I'd say we can potentially have the following three possibilities. (A) Let's say caching table metadata is legal and practical for REST or other use cases. In this case, reusing the same logic as the manifest caching might make sense; users would have a consistent experience across table metadata and manifests(and potentially manifest list files in the future), and it can be reused by other systems, e.g., REST Catalog over JDBC Catalog. I created this PR to demonstrate this direction. I currently assume caching table metadata is practical and (A) can outperform (B) because the logic is not so specific to Hive Metastore. If we can agree with this point, I will send an email to talk about if iceberg-core should be able to cache table metadata and the manifest list. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Allows TableMetadataParser to cache the content of
metadata.json.Iceberg REST Catalog is expected to serve table metadata of the same table to various clients concurrently. Therefore, caching them would improve average latency and reduce S3 cost. Here, I assume each metadata.json is totally immutable, and I may be overlooking something, or this might not be worth including iceberg-core.
Related PR
apache/hive#6022