-
Notifications
You must be signed in to change notification settings - Fork 3k
core: Provide mechanism to cache manifest file content #4518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
485e2ab Switch ManifestCache class to use Guava Cache instead. This is easier to do in iceberg-api since we can simply include the Guava cache classes into the shadow jar of iceberg-bundled-guava. |
|
Switching to use more Guava classes is probably not a good idea. @rizaon, do you have use cases where this has helped? If so, what were they? This adds quite a bit of complexity and memory overhead that we purposely avoided up to now so I want to make sure it is worth the change. Have you considered adding a caching FileIO instance? That could be used for more use cases than just manifest files and job planning. For example, a FileIO read-through cache could cache delete files that might be reused for tasks. This could be configured by file path, making it easy to determine what to cache. Plus, we could use more options than just in-memory, like a distributed cache or local disk. |
|
Hi @rdblue, thank you for your feedback. We found a slow query compilation issue against the Iceberg table in our recent Apache Impala build. Impala uses Iceberg's HiveCatalog and HadoopFileIO instance with an S3A input stream to access data from S3. We did a full 10 TB TPC-DS benchmark and found that query compilation can go for several seconds, while it used to be less than a second with native hive tables. This slowness in single query compilation is due to the requirement to call planFiles several times, even for scan nodes targetting the same table. We also see several socket read operations that spend hundreds of milliseconds during planFiles, presumably due to S3A HTTP HEAD request overhead and backward seek overhead (issue #4508). This is especially hurt fast-running queries. We tried this caching solution and it help speed up Impala query compilation almost 5x faster compared to without on Iceberg tables. Our original solution, however, is to put a Caffeine cache as a singleton in AvroIO.java. I thought it is better to supply the cache from outside. I have not considered the solution of adding a caching FileIO instance. I'm pretty new to the Iceberg codebase but interested to follow up on that if it can yield a better integration. Will it require a new class of Catalog/Table as well, or can we improve on the existing HiveCatalog & HadoopFileIO? |
|
A relevant Impala's JIRA is here: |
|
@rizaon, caching in the FileIO layer would be much more general. You could do things like detect that the file size is less than some threshold and cache it in memory, or detect file names under |
e019c6a to
c032b7a
Compare
|
c032b7a implement caching as a new FileIO class, CachingHadoopFileIO. A new Tables class, CachingHadoopTables, is also added to assist with testing. We tried to avoid |
core/src/main/java/org/apache/iceberg/hadoop/HadoopInputFile.java
Outdated
Show resolved
Hide resolved
|
Hello @rdblue. This PR is ready for review. Please let me know if there is any new feedback or request. I'm happy to follow up. |
core/src/main/java/org/apache/iceberg/hadoop/CachingHadoopFileIO.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/hadoop/CachingHadoopTables.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/hadoop/ConfigProperties.java
Outdated
Show resolved
Hide resolved
| */ | ||
| public class CachingFileIO implements FileIO, HadoopConfigurable { | ||
| private static final Logger LOG = LoggerFactory.getLogger(CachingFileIO.class); | ||
| private static ContentCache sharedCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to pass in the cache rather than making it static?
@jackye1995, do you have any ideas here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This remain static in 47b8008. Please let me know if there is a better way to access the cache from Catalog object or outside.
|
@rizaon, there are a couple of PRs that should help you with this. #4608 adds |
|
Got it, thank you. Will rebase and update this PR once they are merged. |
4175090 to
ba7329d
Compare
|
In the last github check runs, there was a checkstye issue and exception handling when given I have rebased this PR and add ba7329d to fix this issue. |
|
@rizaon Thanks for your contribution here and keeping this updated! I'm excited to see how this works out. |
|
@danielcweeks thank you for accepting this PR! |
Description: Currently, HadoopTables.load() doesn't support passing custom properties when loading tables. While HiveCatalog and HadoopCatalog support manifest caching through their initialize() method (as implemented in apache#4518), HadoopTables lacks this capability. This enhancement adds property support to HadoopTables.load() to enable manifest caching and other configurations. Problem: - HadoopTables lacks the ability to configure manifest caching during table loading - Unlike HiveCatalog and HadoopCatalog which can enable manifest caching through initialize(), HadoopTables has no mechanism to pass these settings - This creates inconsistency in how manifest caching can be configured across different catalog implementations
Add comprehensive configuration support for the ObjectCache: - Add Iceberg Java-compatible property names: - iceberg.io.manifest.cache-enabled - iceberg.io.manifest.cache.max-total-bytes - iceberg.io.manifest.cache.expiration-interval-ms - Add ObjectCacheConfig builder with from_properties() method - Add CacheStats for monitoring cache utilization - Integrate with TableBuilder using property precedence: builder overrides > table metadata properties > defaults - Add Table::cache_stats() method for runtime monitoring - Add comprehensive tests for configuration and precedence - Add module documentation with usage examples Inspired by apache/iceberg#4518 (Java implementation).
This is a draft PR for ManifestCache implementation.
The
ManifestCacheinterface closely followcom.github.benmanes.caffeine.cache.Cacheinterface, with addition of specifyingmaxContentLength(). If stream length is longer thanmaxContentLength(), AvroIO will skip caching it to avoid memory pressure. An example of implementation,CaffeineManifestCache, can be seen in TestAvroFileSplit.java.I will tidy this up and add documentations as we go. Looking forward for any feedback.
Closes #4508