Iceberg: use LocationProvider instead of hardcoded path#8573
Iceberg: use LocationProvider instead of hardcoded path#8573losipiuk merged 2 commits intotrinodb:masterfrom
Conversation
There was a problem hiding this comment.
We do not use Java serialization. In particular, it can lead to security issues.
There was a problem hiding this comment.
Sure, there are multiple ways we can go with this. We can also get the location provider through the dependency injection of table operation, since there is a general agreement that HiveTableOperations will be used across the board.
There was a problem hiding this comment.
Thought about it more, it would be inefficient to initialize a table operation, read table metadata and get the location provider. The alternative way is to pass in table properties. Please see if the new version works, thanks!
6b7d35f to
8408314
Compare
There was a problem hiding this comment.
The code changes seem reasonable, just one broader thought. A table could specify any custom LocationProvider implementation in the properties. Is that location guaranteed to be compatible with org.apache.hadoop.fs.Path?
Alternatively, we could scope this down to just supporting DefaultLocationProvider and ObjectStoreLocationProvider by introducing a config or table property option and initializing the LocationProvider directly rather than through LocationProviders#locationsFor.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
Outdated
Show resolved
Hide resolved
|
@alexjo2144 thanks for the feedback.
Sure I can introduce another static util method instead of using the Iceberg one.
This is related to the multi-catalog discussion, the conclusion there is that:
So yes, it's guaranteed to be compatible. |
There was a problem hiding this comment.
@alexjo2144 I think this is the best we can do on Trino side, the 2 location provider are protected and not public classes, so we can only block the creation if locaiton-impl table property is set.
There was a problem hiding this comment.
doesn't seem to belong to "pass in table properties to initialize location provider in sink prov…" commit
There was a problem hiding this comment.
doesn't seem to belong to "pass in table properties to initialize location provider in sink prov…" commit
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
|
@findepi did you have an opinion on allowing custom LocationProviders vs restricting? @jackye1995 this looks good to me, can you just add a test case to
|
9bab127 to
0ec481a
Compare
|
@findepi thanks for the comments, I am still getting familiar with the standards here so sorry for the bad commit message. I was used to commit with random message and merge with squash. I have force updated everything into a single commit, please let me know if this works. @alexjo2144 test added, should be good for another look |
There was a problem hiding this comment.
It's important to do an INSERT from Trino here too. That's what runs the changes you made in the IcebergPageSink
There was a problem hiding this comment.
Yeah actually this is a good point, it does not test anything by inserting it using Spark, so I directly changed it to be executed on onTrino.
If we want to test Trino able to read Spark data written by object storage location provider, I think it should be placed in the testPrestoReadingSparkData test. But LocationProvider is a write side feature, on read side it's just using the file path in the manifest, so I think there is little value for adding that test case.
fe90e49 to
34eafea
Compare
alexjo2144
left a comment
There was a problem hiding this comment.
Looks good to me. The option to create tables using ObjectStoreLocationProvider from Trino would be nice, but in separate Issue/PR
|
@alexjo2144 @findepi any updates on this? Please let me know if there is any change needed, otherwise could you merge the PR? Thanks! |
electrum
left a comment
There was a problem hiding this comment.
If I'm understanding this correctly, the location provider should be specific to the catalog implementation, correct? If so, then we should add this to the upcoming catalog API. Or is this something that the user should configure, for example, depending on if they are using HDFS or S3?
Either way, this shouldn't be or be passed in a table property. We can discuss this on Slack if that's easier.
There was a problem hiding this comment.
This parameter is for a filename, so an empty filename would seem to be invalid, and thus a location provider might reject it. We could use a default name like data-file instead.
|
@electrum thanks for the comment, I will both replay here just for visibility. We discussed about the use of table properties in Trino versus Iceberg last time, what I found out is that they are not really the same concept after all. The fundamental difference is that: On Trino side, table property is only used at table creation time, and it is up to the connector to decide what to do with each property. For Iceberg's case, we have:
On Iceberg side, table properties are directly stored as a part of the table metadata JSON file, which is the With this understanding, now we look at location provider configuration, which is related to the following Iceberg table properties:
What this PR is trying to support is the case for which:
In the case described above, Trino should respect the configurations and write to the configured location instead of the hard-coded table location. This has nothing to do with user configuring any Trino table property. I think what you are concerned about is the case where user can create a table with object storage mode or not based on a Trino table property or some other table features like the storage provider. I agree that behavior should be controlled. I am trying to keep this PR small, we can introduce a boolean table property like |
perhaps we can narrow down the scope of the change to
Such changes should require significant code changes, should they? Actually, isn't the following implementing the aforementioned behavior already? |
Yes this PR is exactly trying to achieve the 2 goals you listed. |
Yes, but to use it in the page sink provider, we need to construct a new |
34eafea to
fbebd44
Compare
There was a problem hiding this comment.
We should run this test for all supported formats, since page sink can have format-dependent behavior (currently it does not have). Requires #8751
...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java
Outdated
Show resolved
Hide resolved
...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
is using OPTIONS different from using TBLPROPERTIES?
There was a problem hiding this comment.
it's the same. I can change to TBLPROPERTIES for consistency.
There was a problem hiding this comment.
inline sparkTableDefinition variable, it's redundant
(i know it's a copy of existing test code; i am doing this cleanup for existing tests in 566d41a)
There was a problem hiding this comment.
in case of failure, this assertion will produce useless error message
user Assertions.assertThat.
...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java
Outdated
Show resolved
Hide resolved
fbebd44 to
34a878f
Compare
|
@findepi thanks for the comments.
It looks like we just need to rebase based on which PR gets in first? Apart from that, hopefully I have addressed all your comments except for the table deletion one.
This is actually an interesting topic that I would like to also get fixed, because the current behavior is broken even for Spark Iceberg tables that are not using the object storage location provider. In Iceberg there is another config The correct behavior is to call Iceberg's |
34a878f to
a1daef7
Compare
There was a problem hiding this comment.
@findepi I think this is the best we can do for now for all the Iceberg path overrides. We know they come from external systems like Spark, so we will not drop the table to avoid leaving files not deleted during a drop table operation. I can work on the solution to this after this PR is merged, when we start to enable creating Trino Iceberg table with those overrides.
a1daef7 to
c9cb32d
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
c9cb32d to
81c0fd9
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/HiveTableOperations.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergWritableTableHandle.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
81c0fd9 to
91f38c8
Compare
91f38c8 to
de16428
Compare
Use Iceberg's
LocationProviderinstead of hard-coding file paths. With the current hard-coded Hive-based file paths to write data files, users on cloud storage cannot enjoy the benefit of Iceberg'sObjectStorageLocationProvider. This PR fixes this issue.