Skip to content

[Iceberg] Add StatisticsFileCache#23177

Merged
ZacBlanco merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-iceberg-statistics-file-cache
Sep 9, 2024
Merged

[Iceberg] Add StatisticsFileCache#23177
ZacBlanco merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-iceberg-statistics-file-cache

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Jul 10, 2024

Description

This change adds a connector-wide cache for StatisticsFiles. This prevents loading the same statisticsFile multiple times into the connector. It improves query planning times and reduces memory usage by having only one set of statistics available from each file.

The cache is designed to be connector-wide to reduce memory consumption. Since separate queries can reference the same tables, we prevent memory utilization by sharing the cache across queries. The files are immutable and unique to a table and snapshot (hashed and equated on properties of file path, footer size, blob metadata, etc), so caching between queries/users/sessions shouldn't be an issue. Cache entries should never become stale

Motivation and Context

I noticed after experimenting with histograms in the Iceberg connector that planning was quite slow for two reasons

  1. We were always hitting the FS for StatisticsFiles
  2. Every time we load the blobs from a StatisticFile, even if using the same file as a previous query, we would allocate a new buffer to represent the statistics. These statistics are not evacuated from memory until the query metadata for an execution expires from query.max-age. For smaller statistics, this is not really a problem. However, histograms can be large (KBs to MBs in size). Having the cache can prevent making copies of redundant data, leading to lower memory pressure and smaller chance of choking up the coordinator.

The cache inside of the Iceberg connector does not fully alleviate the issue. If a user has thousands of tables/columns, memory pressure can still occur when there is large amount of historical query metadata in the heap. The only way to prevent the pressure is to eventually store some query metadata (or maybe just statistics?) in a cache which is freed once the coordinator reaches high memory pressure.

Impact

Test Plan

  • Added an E2E test to verify cache is hit when table statistics are retrieved.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Add a configurable-sized cache for Iceberg table puffin files to improve query planning time controlled by the  `iceberg.max-statistics-file-cache-size` configuration property. :pr:`23177`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch 5 times, most recently from 4e555a2 to 8ff387f Compare July 16, 2024 05:25
@ZacBlanco ZacBlanco changed the title [iceberg] Add StatisticsFileCache [Iceberg] Add StatisticsFileCache Jul 16, 2024
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch 4 times, most recently from d207127 to 716b991 Compare July 19, 2024 19:35
@ZacBlanco ZacBlanco marked this pull request as ready for review July 19, 2024 20:54
@ZacBlanco ZacBlanco requested a review from presto-oss July 19, 2024 20:54
steveburnett
steveburnett previously approved these changes Jul 19, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local docs build, reviewed compiled page in local build. Looks good, thanks!

Copy link
Contributor

@vivek-bharathan vivek-bharathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Bring up one question for discussing, as the Guava cache is a light weight LRU implementation, it may not adapt well to some scenarios. So do you think it's useful to provide a way to configure not to use statistics file cache in some situations, or have a way to extend and configure to use other enhanced cache implementations, for example the ones based on Caffeine or Redis?

@ZacBlanco ZacBlanco dismissed stale reviews from vivek-bharathan and steveburnett via ed1f3e2 July 22, 2024 18:46
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch from 716b991 to ed1f3e2 Compare July 22, 2024 18:46
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, lgtm.

hantangwangd
hantangwangd previously approved these changes Jul 22, 2024
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch from ed1f3e2 to 8f6d011 Compare July 26, 2024 18:24
aaneja
aaneja previously approved these changes Aug 6, 2024
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one question

hantangwangd
hantangwangd previously approved these changes Aug 6, 2024
@ZacBlanco ZacBlanco dismissed stale reviews from hantangwangd and aaneja via d660623 August 6, 2024 19:09
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch from e1c3b8b to d660623 Compare August 6, 2024 19:09
aaneja
aaneja previously approved these changes Aug 8, 2024
hantangwangd
hantangwangd previously approved these changes Aug 8, 2024
@ZacBlanco ZacBlanco requested a review from tdcmeehan August 8, 2024 15:45
@ZacBlanco ZacBlanco dismissed stale reviews from hantangwangd and aaneja via e628262 August 27, 2024 19:04
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch 2 times, most recently from e628262 to 79b5263 Compare August 29, 2024 20:17
@ZacBlanco
Copy link
Contributor Author

@tdcmeehan could you take a look and give a stamp?

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits. Main thing is it would be nice to have a test that breaks if the memory estimates change for some reason.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch from 79b5263 to 55bb3b5 Compare September 6, 2024 21:39
tdcmeehan
tdcmeehan previously approved these changes Sep 6, 2024
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little thing, otherwise lgtm.

Adds a new connector-wide cache for statistics files.
This prevents additional memory consumption and improves
query planning performance by avoiding hits to the file
system when generating table statistics.
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-statistics-file-cache branch from 55bb3b5 to 41a0d78 Compare September 9, 2024 15:23
@ZacBlanco ZacBlanco merged commit 07bf13a into prestodb:master Sep 9, 2024
@jaystarshot jaystarshot mentioned this pull request Nov 1, 2024
25 tasks
@tdcmeehan tdcmeehan added the from:IBM PR from IBM label Dec 13, 2024
@prestodb-ci prestodb-ci requested review from a team, Dilli-Babu-Godari and ShahimSharafudeen and removed request for a team December 13, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments