Skip to content

Conversation

@gaborkaszab
Copy link
Collaborator

For observability purposes clients could use the stats of the manifest file content cache to see for instance the cache hit/miss ratio so that users can fine tune the configuration of the cache.

@github-actions github-actions bot added the core label Jul 15, 2025
@gaborkaszab
Copy link
Collaborator Author

As described in the commit message, the purpose here is to expose the stats of the manifest file content cache. This would be useful to see the cache hot/miss ratio and to have some observability on the performance of this cache so that admins can tweak the configs of the cache.
This is probably the simplest solution I could think of where we expose the Caffeine.CacheStats through a public method. There might be more complex solutions where we introduce an API for providing stats but I found that an overkill.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 15, 2025
@gaborkaszab
Copy link
Collaborator Author

I plan to work on this. Commenting to remove the stale label

@gaborkaszab gaborkaszab removed the stale label Aug 18, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 18, 2025
@gaborkaszab
Copy link
Collaborator Author

not stale

@gaborkaszab gaborkaszab removed the stale label Sep 19, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@gaborkaszab
Copy link
Collaborator Author

still on my TODO list, just couldn't get to it

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 21, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 28, 2025
@gaborkaszab
Copy link
Collaborator Author

reopen

@gaborkaszab gaborkaszab reopened this Nov 28, 2025
@github-actions github-actions bot removed the stale label Nov 29, 2025
@gaborkaszab gaborkaszab force-pushed the expose_manifestFileContentCacheStats branch 2 times, most recently from 3176c00 to 31f39aa Compare December 3, 2025 13:24
@gaborkaszab gaborkaszab requested a review from pvary December 3, 2025 13:26
@gaborkaszab gaborkaszab force-pushed the expose_manifestFileContentCacheStats branch from 31f39aa to 68d04c6 Compare December 3, 2025 13:29
import org.immutables.value.Value;

@Value.Immutable
public abstract class CacheMetrics {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think later on we might want to expose catalog-level table stats to the clients, like the stats of CachingCatalog or the stats of the table cache used in RESTSessionCatalog for freshness-aware loading. To do that this class could be reused as a container of the stats, but then we might want to put this into api/ instead of core/


public abstract long evictionCount();

public abstract long evictionWeight();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?
Is this general enough to have a meaning if we happen to change the cache implementation from caffeine to something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, for different cache implementations probably not all these fields are available. I checked EhCache stats and apparently maybe 3 of the above metrics are included there: hit count, miss count, eviction count.

I see two way from here:

  1. Make this CacheMetrics class as narrow as it could serve the common fields from different cache implementations. I think this way the offered list of stats would be very narrow, maybe hit and miss count only.
  2. We could make the metrics objects instead of primitives (e.g. Long) and then if the cache implementation doesn't support some of the metrics we could leave it null. This way we have the flexibility to support more cache implementations and not to be too narrow. WDYT @pvary ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would build on the exact use-case here. What is needed, and why?
Based on that we could decide, but I would prefer a limited API. We can always add more, if needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially we are fine with the following list of metrics: hitCount, missCount, totalLoadTime and evictionCount. These are the ones that would add actual observability value to fine tune the manifest content cache IMO.
Note, by narrowing the list we still don't solve the issue of different cache implementations have different stats, so maybe other implementations won't have totalLoadTime for instance. Would it still make sense to make the member non-primitive to have the ability of leaving them null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gabor and Peter,
Thank you for looking into this. I believe minimal cache metrics exposure is still better than nothing. In Apache Impala usecase, we usually display metrics like this in Coordinator / Catalog daemon WebUI and let metrics monitoring service scraping from it. Having ability to obeserve the cache metrics will help user tune the best config for their Iceberg usecase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look again at #4518 and saw many reference into the initial implementation of manifest caching in Iceberg java library. Having a standard API in Iceberg lib related to caching will be great. I'd be happy to contribute / review further improvement of manifest caching to help benefit the broader community Iceberg community.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the list of exposed metrics to be as general as possible based on the comments from @pvary .
About introducing a new API to provide the manifest cache stats: I'm not entirely convinced (I saw the interest in Presto to use stats) but not entirely against either. @pvary Would you mind sharing your view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to hear input from other engines as well. Maybe get someone from Presto involved.

For observability purposes clients could use the stats of the
manifest file content cache to see for instance the cache
hit/miss ratio so that users can fine tune the configuration
of the cache.
@gaborkaszab gaborkaszab force-pushed the expose_manifestFileContentCacheStats branch from 68d04c6 to 78ac11a Compare December 9, 2025 10:05
@pvary pvary merged commit c68f041 into apache:main Dec 12, 2025
44 checks passed
@pvary
Copy link
Contributor

pvary commented Dec 12, 2025

Merged to main.
Thanks for the PR @gaborkaszab and @rizaon for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants