-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Manifest file caching support for Iceberg Native Catalogs #21399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -269,6 +269,40 @@ and a file system location of ``s3://test_bucket/test_schema/test_table``: | |
| location = 's3://test_bucket/test_schema/test_table') | ||
| ) | ||
|
|
||
| Caching Support | ||
| ---------------- | ||
|
|
||
| Manifest File Caching | ||
| ^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| As of Iceberg version 1.1.0, Apache Iceberg provides a mechanism to cache the contents of Iceberg manifest files in memory. This feature helps | ||
| to reduce repeated reads of small Iceberg manifest files from remote storage. | ||
|
|
||
| .. note:: | ||
|
|
||
| Currently, manifest file caching is supported for Hadoop and Nessie catalogs in the Presto Iceberg connector. | ||
|
|
||
| The following configuration properties are available: | ||
|
|
||
| ==================================================== ============================================================= ============ | ||
| Property Name Description Default | ||
| ==================================================== ============================================================= ============ | ||
| ``iceberg.io.manifest.cache-enabled`` Enable or disable the manifest caching feature. This feature ``false`` | ||
| is only available if ``iceberg.catalog.type`` is ``hadoop`` | ||
| or ``nessie``. | ||
|
|
||
| ``iceberg.io-impl`` Custom FileIO implementation to use in a catalog. It must ``org.apache.iceberg.hadoop.HadoopFileIO`` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you elaborate why this needs to be set for manifest caching?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nastra As I checked in Iceberg core, In case when we query the Hadoop catalog from Presto. here in HadoopCatalog.initialize() Catalog configuration as not being set via HadoopFileIO.initialize() but if fileIOImpl config is set explicaity then it creates HadoopFileIO object via CatalogUtil.loadFileIO() which is also responsible of initialising Catalog configuration (including which are being passed from Presto
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm I see and I was not aware of that limitation in the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly, NessieCatalog takes care of initializing it. So it's not an issue there. Thanks for confirming. Sure, I will raise a PR for this.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @agrawalreetika With apache/iceberg#9283 merged, do we still need to set
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yingsu00 No after apache/iceberg#9283 and once we upgrade Iceberg in Presto, we don't have to explicitely set
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this waiting a response from @ChunxuTang on this question? #21399 (comment)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe not. @agrawalreetika Can you please confirm?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No this is good to go for Native catalogs support. |
||
| be set to enable manifest caching. | ||
|
|
||
| ``iceberg.io.manifest.cache.max-total-bytes`` Maximum size of cache size in bytes. ``104857600`` | ||
|
|
||
| ``iceberg.io.manifest.cache.expiration-interval-ms`` Maximum time duration in milliseconds for which an entry ``60000`` | ||
| stays in the manifest cache. | ||
|
|
||
| ``iceberg.io.manifest.cache.max-content-length`` Maximum length of a manifest file to be considered for ``8388608`` | ||
| caching in bytes. Manifest files with a length exceeding | ||
| this size will not be cached. | ||
| ==================================================== ============================================================= ============ | ||
|
|
||
| Extra Hidden Metadata Tables | ||
| ---------------------------- | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -17,7 +17,6 @@ | |||||||
| import com.facebook.presto.iceberg.nessie.NessieConfig; | ||||||||
| import com.facebook.presto.spi.ConnectorSession; | ||||||||
| import com.facebook.presto.spi.PrestoException; | ||||||||
| import com.facebook.presto.spi.security.ConnectorIdentity; | ||||||||
| import com.google.common.cache.Cache; | ||||||||
| import com.google.common.cache.CacheBuilder; | ||||||||
| import com.google.common.util.concurrent.UncheckedExecutionException; | ||||||||
|
|
@@ -37,12 +36,14 @@ | |||||||
| import static com.facebook.presto.iceberg.CatalogType.NESSIE; | ||||||||
| import static com.facebook.presto.iceberg.IcebergSessionProperties.getNessieReferenceHash; | ||||||||
| import static com.facebook.presto.iceberg.IcebergSessionProperties.getNessieReferenceName; | ||||||||
| import static com.facebook.presto.iceberg.IcebergUtil.loadCachingProperties; | ||||||||
| import static com.facebook.presto.iceberg.nessie.AuthenticationType.BASIC; | ||||||||
| import static com.facebook.presto.iceberg.nessie.AuthenticationType.BEARER; | ||||||||
| import static com.facebook.presto.spi.StandardErrorCode.NOT_SUPPORTED; | ||||||||
| import static com.google.common.base.Throwables.throwIfInstanceOf; | ||||||||
| import static com.google.common.base.Throwables.throwIfUnchecked; | ||||||||
| import static java.util.Objects.requireNonNull; | ||||||||
| import static org.apache.iceberg.CatalogProperties.FILE_IO_IMPL; | ||||||||
| import static org.apache.iceberg.CatalogProperties.WAREHOUSE_LOCATION; | ||||||||
|
|
||||||||
| /** | ||||||||
|
|
@@ -59,11 +60,13 @@ public class IcebergResourceFactory | |||||||
| private final NessieConfig nessieConfig; | ||||||||
| private final S3ConfigurationUpdater s3ConfigurationUpdater; | ||||||||
|
|
||||||||
| private final IcebergConfig icebergConfig; | ||||||||
|
|
||||||||
| @Inject | ||||||||
| public IcebergResourceFactory(IcebergConfig config, IcebergCatalogName catalogName, NessieConfig nessieConfig, S3ConfigurationUpdater s3ConfigurationUpdater) | ||||||||
| { | ||||||||
| this.catalogName = requireNonNull(catalogName, "catalogName is null").getCatalogName(); | ||||||||
| requireNonNull(config, "config is null"); | ||||||||
| this.icebergConfig = requireNonNull(config, "config is null"); | ||||||||
| this.catalogType = config.getCatalogType(); | ||||||||
| this.catalogWarehouse = config.getCatalogWarehouse(); | ||||||||
| this.hadoopConfigResources = config.getHadoopConfigResources(); | ||||||||
|
|
@@ -99,25 +102,7 @@ public SupportsNamespaces getNamespaces(ConnectorSession session) | |||||||
| private String getCatalogCacheKey(ConnectorSession session) | ||||||||
| { | ||||||||
| StringBuilder sb = new StringBuilder(); | ||||||||
| ConnectorIdentity identity = session.getIdentity(); | ||||||||
| sb.append("User:"); | ||||||||
| sb.append(identity.getUser()); | ||||||||
| if (identity.getPrincipal().isPresent()) { | ||||||||
| sb.append(",Principle:"); | ||||||||
| sb.append(identity.getPrincipal()); | ||||||||
| } | ||||||||
| if (identity.getRole().isPresent()) { | ||||||||
| sb.append(",Role:"); | ||||||||
| sb.append(identity.getRole()); | ||||||||
| } | ||||||||
| if (identity.getExtraCredentials() != null) { | ||||||||
| identity.getExtraCredentials().forEach((key, value) -> { | ||||||||
| sb.append(","); | ||||||||
| sb.append(key); | ||||||||
| sb.append(":"); | ||||||||
| sb.append(value); | ||||||||
| }); | ||||||||
| } | ||||||||
| sb.append(catalogName); | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment is for line ln 109
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So the user can still make changes to it though as per Nessie doc, it's an optional field. If we take it out from the cache key then it will load the cache value with the default
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a case that the nessie reference name itself is not unique and we have to use the reference hash to differentiate them? Or say, is it feasible to expect that reference names are unique in practical use cases?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really sure if we really have to expose
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ChunxuTang The ref hash is the hash of the reference name. If the name is not unique, the hash won't be either. Also the hash is optional according to https://projectnessie.org/tools/client_config/. I actually don't understand why the "nessie_reference_hash" Iceberg session property was added at the first place. It does not have a corresponding config property in NessieConfig, and it is not settable and is defaulted to null. There is no hashing applied to the reference name to set the "nessie_reference_hash" property, nor there was any verification it's a valid hash value. It seems it would always be null. @nastra @ChunxuTang was there any reason it was done this way? I think we should remove it from IcebergSessionProperties.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we absolutely need to remove the ref hash as part of this PR? To me it seems that introducing manifest file caching and nessie changes are intertwined.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, But as pointed out above, If it is not exposed to the user or integrated fully. We can remove for time being and add it later when we fully want to support it. But as @nastra suggested better to keep the current PR scope as small as possible.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ajantha-bhat thanks for your reply. But according to https://projectnessie.org/tools/client_config/ this ref hash is the hash of the ref name.
What is the ref name used for? If ref name is already part of the cache key, do we still need the ref hash?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nessie supports catalog level branching and tagging capabilities. ref name is (branch or tag name). catalog level hash configuration is mainly used for time travel queries. Once, we support that we need this config exposed. |
||||||||
|
|
||||||||
| if (catalogType == NESSIE) { | ||||||||
| sb.append(getNessieReferenceName(session)); | ||||||||
|
|
@@ -151,6 +136,12 @@ private Configuration getHadoopConfiguration() | |||||||
| public Map<String, String> getCatalogProperties(ConnectorSession session) | ||||||||
| { | ||||||||
| Map<String, String> properties = new HashMap<>(); | ||||||||
| if (icebergConfig.getManifestCachingEnabled()) { | ||||||||
| loadCachingProperties(properties, icebergConfig); | ||||||||
| } | ||||||||
| if (icebergConfig.getFileIOImpl() != null) { | ||||||||
| properties.put(FILE_IO_IMPL, icebergConfig.getFileIOImpl()); | ||||||||
| } | ||||||||
| if (catalogWarehouse != null) { | ||||||||
| properties.put(WAREHOUSE_LOCATION, catalogWarehouse); | ||||||||
| } | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick question (probably beyond the scope of this PR): Besides native catalogs (E.g. Hadoop and Nessie), could other catalogs support manifest file caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChunxuTang As of now from Presto no. Since manifest file caching is supported with
HadoopFileIOin Iceberg. But in Presto for Hive Catalog we have custom HdfsFileIODo we know why do we use this custom HdfsFileIO from Presto for Iceberg Hive Catalog?