-
Notifications
You must be signed in to change notification settings - Fork 5.5k
[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397
Conversation
0279ea7 to
2390fb0
Compare
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java
Outdated
Show resolved
Hide resolved
2390fb0 to
c3c00e1
Compare
c3c00e1 to
d18cd8b
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc! Looks good, only a few nits only of punctuation and capitalization.
d18cd8b to
5fd6b9c
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pulled updated branch, new local doc build, looks good. Thanks!
|
Thanks @steveburnett for your suggestion, fixed! |
agrawalreetika
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. LGTM
imjalpreet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd Thanks for the PR! I took a first pass and had some minor comments.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergConfig.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergNativeCatalogFactory.java
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a few high level comments. At the core, I understand what we're trying to solve, but I'm not sure yet if this is the right solution. What happens when we have a table which already exists at the warehouse directory but has a data directory which doesn't align with the new datadir property? Should we respect the property or the table in the case of an insert? This could be confusing for users.
And if the table already exists and doesn't doesn't have a metadata folder which is in a "safe" filesystem, should we error, or warn the user? Do we even have a way of knowing that a filesystem is "safe" (supports atomic renames)?
Correct me if I'm wrong, but in theory we could have already supported this use case within the iceberg connector by using SetTableProperty procedure and just setting "write.data.path" on the individual tables, right? From my understanding, all this change does is provide a default for new tables and makes it a viewable table property. I'm wondering if it might be better providing this as schema-level property that users can set, similar to how the hive connector has a schema-level "location" property. Then, we can set defaults for the data path on schema creation, but override it on a table-level if we prefer.
However, I don't believe the metadata path could be set that way since the HadoopCatalog relies on listing the warehouse metadata directories to get the schemas and tables.
Just some thoughts for discussion. I think I just want to refine our approach and make sure there isn't any ambiguous behavior from the user perspective.
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java
Outdated
Show resolved
Hide resolved
19b2e45
5fd6b9c to
19b2e45
Compare
|
@ZacBlanco Thanks for your feedback, very glad to be able to clarify and refine the approch through discussion. Overall, I think the core point of this change is to separate the data write path and metadata path of Iceberg tables in Hadoop catalog, and provide a simpler and more convenient way for users to specify the data write path of tables. I will reply item by item below. Please let me know if you have any comments.
Undoubtedly, when inserting data, the iceberg table's table property
Do you think the above behaviors make sense?
I think we could specify the requirements for the file system where the metadata is located in the document. But I don't think Presto engine is responsible for identifying and distinguishing specific file systems and reminding users. This is the user's own responsibility, because they have a better understanding of their own usage scenarios. For example, for a considerable number of users, they do not need to consider transaction consistency issues caused by cross engine interoperability.
Firstly, for Hadoop catalog, we cannot arbitrarily set the value of Therefore, if necessary, we can add this schema-level data dir property in the future. I think it can work well with catalog-level data dir. What's your opinion? |
|
New release note guidelines as of last week: PR #24354 automatically adds links to this PR to the release notes. Please remove the manual PR link in the following format from the release note entries for this PR. I have updated the Release Notes Guidelines to remove the examples of manually adding the PR link. |
|
Thanks for your response. I did some more thinking around this and concluded that we don't really have a consistent way of specifying location for each connector. But I think we should strive to do so. Some examples:
My thoughts are that
|
a177edb to
f9854dd
Compare
f9854dd to
e77a556
Compare
|
Hi @ZacBlanco @imjalpreet @agrawalreetika @steveburnett, the comments have been addressed, please take a look when convenient, thanks a lot! |
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
presto-iceberg/src/test/java/com/facebook/presto/iceberg/container/IcebergMinIODataLake.java
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/container/IcebergMinIODataLake.java
Outdated
Show resolved
Hide resolved
...eberg/src/test/java/com/facebook/presto/iceberg/hadoop/TestIcebergDistributedOnS3Hadoop.java
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/container/IcebergMinIODataLake.java
Show resolved
Hide resolved
.../java/com/facebook/presto/iceberg/hadoop/TestIcebergHadoopCatalogOnS3DistributedQueries.java
Show resolved
Hide resolved
dc08c32
e77a556 to
dc08c32
Compare
Description
This PR enable Presto Iceberg Hadoop catalog to specify an independent warehouse data directory to store table data files, in this way, we can manage metadata files on HDFS and store data files on Object Stores in a formal production environment.
See issue: #24383
Motivation and Context
Enabling Presto Iceberg to leverage the powerful capabilities of object storages
Impact
Hadoop catalog has the capability of leveraging object stores
Test Plan
iceberg.catalog.warehouseto a local file path, andiceberg.catalog.hadoop.warehouse.datadirto a s3 path, fully runIcebergDistributedTestBase,IcebergDistributedSmokeTestBase, andTestIcebergDistributedQueriesin CI testContributor checklist
Release Notes