[Iceberg] Add StatisticsFileCache#23177
Conversation
4e555a2 to
8ff387f
Compare
d207127 to
716b991
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull branch, local docs build, reviewed compiled page in local build. Looks good, thanks!
presto-iceberg/src/main/java/com/facebook/presto/iceberg/StatisticsFileCache.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Outdated
Show resolved
Hide resolved
hantangwangd
left a comment
There was a problem hiding this comment.
Overall looks good to me. Bring up one question for discussing, as the Guava cache is a light weight LRU implementation, it may not adapt well to some scenarios. So do you think it's useful to provide a way to configure not to use statistics file cache in some situations, or have a way to extend and configure to use other enhanced cache implementations, for example the ones based on Caffeine or Redis?
presto-iceberg/src/test/java/com/facebook/presto/iceberg/hive/TestIcebergDistributedHive.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/cost/DisjointRangeDomainHistogram.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergCommonModule.java
Outdated
Show resolved
Hide resolved
ed1f3e2
716b991 to
ed1f3e2
Compare
hantangwangd
left a comment
There was a problem hiding this comment.
Thanks for the fix, lgtm.
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java
Show resolved
Hide resolved
ed1f3e2 to
8f6d011
Compare
e1c3b8b to
d660623
Compare
e628262 to
79b5263
Compare
|
@tdcmeehan could you take a look and give a stamp? |
tdcmeehan
left a comment
There was a problem hiding this comment.
Just some nits. Main thing is it would be nice to have a test that breaks if the memory estimates change for some reason.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/statistics/StatisticsFileCacheKey.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/statistics/ConnectorHistogram.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java
Show resolved
Hide resolved
79b5263 to
55bb3b5
Compare
hantangwangd
left a comment
There was a problem hiding this comment.
One little thing, otherwise lgtm.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergErrorCode.java
Outdated
Show resolved
Hide resolved
Adds a new connector-wide cache for statistics files. This prevents additional memory consumption and improves query planning performance by avoiding hits to the file system when generating table statistics.
55bb3b5 to
41a0d78
Compare
Description
This change adds a connector-wide cache for
StatisticsFiles. This prevents loading the same statisticsFile multiple times into the connector. It improves query planning times and reduces memory usage by having only one set of statistics available from each file.The cache is designed to be connector-wide to reduce memory consumption. Since separate queries can reference the same tables, we prevent memory utilization by sharing the cache across queries. The files are immutable and unique to a table and snapshot (hashed and equated on properties of file path, footer size, blob metadata, etc), so caching between queries/users/sessions shouldn't be an issue. Cache entries should never become stale
Motivation and Context
I noticed after experimenting with histograms in the Iceberg connector that planning was quite slow for two reasons
StatisticsFilesquery.max-age. For smaller statistics, this is not really a problem. However, histograms can be large (KBs to MBs in size). Having the cache can prevent making copies of redundant data, leading to lower memory pressure and smaller chance of choking up the coordinator.The cache inside of the Iceberg connector does not fully alleviate the issue. If a user has thousands of tables/columns, memory pressure can still occur when there is large amount of historical query metadata in the heap. The only way to prevent the pressure is to eventually store some query metadata (or maybe just statistics?) in a cache which is freed once the coordinator reaches high memory pressure.
Impact
ConnectorHistogramSPI interface to add agetMemoryUtilizationmethod. This will be useful in the future when implementing real histograms in [Iceberg] Add Histogram Statistic Support #22365 .Test Plan
Contributor checklist
Release Notes