Add hive iceberg support based on Trino created iceberg table.#10779
Add hive iceberg support based on Trino created iceberg table.#10779Imbruced wants to merge 7 commits intotrinodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
|
Thats definitely a draft I need to understand whats the approach for integration tests. IMHO integration should be tested for iceberg + trino + hive |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
findepi
left a comment
There was a problem hiding this comment.
(haven't reviewed the test yet)
| false)) | ||
| .add(booleanProperty( | ||
| HIVE_ENABLED, | ||
| "Enable hive support", |
There was a problem hiding this comment.
I'd call this "hive compatibility enabled".
However, to the best of my knowledge, the only downside to setting this flag is that it blows up completely, unless HMS has Iceberg jars added. Thus i don't see a need to make this a table property. Make it a config toggle: iceberg.hive-compatibility.enabled (boolean).
for the record, Iceberg has this on per-table basis
https://github.com/apache/iceberg/blob/7fcc71da65a47ca3c9f6eb6e862a238389b8bdc5/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L526-L547
@pvary @rdblue would you want to comment on the rationale?
is it because HiveIcebergMetaHook is used as a vehicle for propagating the flag to the Catalog?
There was a problem hiding this comment.
The preference is for setting this as an environment property -- your Hive install should have the Iceberg Jar or not. But you may want to set it at the table level for some tables because you can add the Iceberg runtime Jar on the client side. I would probably not include this in Trino unless you need to later.
There was a problem hiding this comment.
What exactly you mean by this ? "I would probably not include this in Trino unless you need to later". Keep the table property as it is ? Or change to as config variable ? I tried to do the same thing as it is done in spark iceberg jars, when creating table compatibile with hive appropriate table property is assigned.
Or you mean that chnage at all shouldnt be included in trino ?
There was a problem hiding this comment.
I wouldn't expose this in Trino unless it is necessary. People can set this in the HiveConf used by the catalog, so hopefully you can just ignore these settings and make that the way to control it. Hopefully Hive will have a better story around this soon as well, so we don't need to use fake InputFormat and Serde classes to avoid it breaking. Normally, I wouldn't expect needing to do so much in other engines to avoid Hive's user experience from breaking.
There was a problem hiding this comment.
Do you have any idea how to achieve that on table level ? I mean at the moment only way is to create table using spark or hive to be able to use on those 3 engines. I mean Hive, Trino, Spark.
There was a problem hiding this comment.
Iceberg has the same logic in the code, I mean by using table property.
There was a problem hiding this comment.
I wouldn't expose this in Trino unless it is necessary. People can set this in the HiveConf used by the catalog,
@rdblue you mean hive-site.xml files (trino's hive.config.resources config)?
or you mean some HMS configuration change?
thanks for confirming we don't need this to be a table property.
There was a problem hiding this comment.
Yeah, Iceberg will also check for that property to be configured. I'd leave it in Hive config rather than using session config or table config.
| ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builderWithExpectedSize(2); | ||
| FileFormat fileFormat = IcebergTableProperties.getFileFormat(tableMetadata.getProperties()); | ||
| propertiesBuilder.put(DEFAULT_FILE_FORMAT, fileFormat.toString()); | ||
| propertiesBuilder.put(HIVE_ENABLED, isHiveEnabled.toString()); |
There was a problem hiding this comment.
Why store this information in the table properties?
It seems that what matters is serde and input/output format classes we use when registering the table in the metastore?
There was a problem hiding this comment.
Thats right, but whats the other approach you are thinking of ? I didnt find btter solution to make it happen on table level ?
There was a problem hiding this comment.
Thats the similar approach whats is done on spark side when iceberg jar is added.
There was a problem hiding this comment.
Thats right, but whats the other approach you are thinking of ?
Add a boolean entry in some Iceberg config class.
This should be a new config class bound in IcebergHiveMetastoreCatalogModule, because it's going to be HMS-specific. See IcebergConfig & TestIcebergConfig for how to create a config class.
See #10845 for upcoming Glue support, which won't need this.
Then, pass the config value to HiveMetastoreTableOperationsProvider > HiveMetastoreTableOperations > AbstractMetastoreTableOperations and use there.
| public static final String METADATA_LOCATION = "metadata_location"; | ||
| public static final String PREVIOUS_METADATA_LOCATION = "previous_metadata_location"; | ||
| protected static final String METADATA_FOLDER_NAME = "metadata"; | ||
| protected static final String HIVE_ENABLED_FLAG = "engine.hive.enabled"; |
There was a problem hiding this comment.
we should directly reuse org.apache.iceberg.TableProperties#ENGINE_HIVE_ENABLED.
| FileInputFormat.class.getName(), | ||
| FileOutputFormat.class.getName()); | ||
|
|
||
| protected static final StorageFormat HIVE_ICEBERG_STORAGE_FORMAT = StorageFormat.create( |
There was a problem hiding this comment.
just ICEBERG_STORAGE_FORMAT
and rename the above one: STORAGE_FORMAT -> DUMMY_STORAGE_FORMAT
also, add a code comment why we cannot use ICEBERG_STORAGE_FORMAT unconditionally.
| assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT)"); | ||
| assertThat(query("SELECT * FROM \"test_iceberg_get_table_props$properties\"")) | ||
| .matches(format("VALUES (VARCHAR 'write.format.default', VARCHAR '%s')", format.name())); | ||
| dropTable("test_iceberg_get_table_props"); |
There was a problem hiding this comment.
not sure why this test is removed?
There was a problem hiding this comment.
by mistake I will bring it back.
| hive: | ||
| # Make hive server configuration invalid to make sure the hms_only tests do not accidentally depend on HS2. | ||
| jdbc_url: jdbc:hive2://${databases.hive.host}:12345 | ||
| jdbc_url: jdbc:hive2://${databases.hive.host}:10000 |
There was a problem hiding this comment.
This directly contradicts the comment above. Why do you want the change here?
There was a problem hiding this comment.
forget to chnage that to default value, the jars which were added to hive runtime is by default pushed to hive on port 10000. Do you have an idea where should I change that to be applicable in singlenode-spark-iceberg env only or even on my test class level ?
There was a problem hiding this comment.
tempto-configuration-for-hms-only shouldn't be used in singlenode-spark-iceberg if we're going to use HS2 there.
(and i don't think it is used there)
| echo "Applying hive-site configuration overrides for Spark" | ||
|
|
||
| wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-hive-runtime/0.12.1/iceberg-hive-runtime-0.12.1.jar | ||
| cp iceberg-hive-runtime-0.12.1.jar /usr/hdp/current/hive-client/auxlib |
There was a problem hiding this comment.
the echo above documents the apply-site-xml-override below. let them stay together
add code above of echo statement, with it's own documentation-like echo
(see also #10879)
| import static io.trino.tests.product.utils.QueryExecutors.onTrino; | ||
| import static java.lang.String.format; | ||
|
|
||
| public class TestHiveReadingTheTables |
There was a problem hiding this comment.
TestIcebergHiveCompatibility
and add javadoc describing what it does, to discern from TestIcebergHiveTablesCompatibility
(see also #10880)
|
👋 @Imbruced - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
|
Closing this one out due to inactivity, but please reopen if you would like to pick this back up. |
This PR aims to fix issue with reading trino created iceberg table registered in hms. The code changes are small but I need more sophisticated tests. Especially tests in products section should be created. As far I know you dont have docker file fot iceberg+spark+hive and that integration should be verified. Should I add this changes here https://github.com/trinodb/docker-images/blob/master/testing/spark3.0-iceberg/Dockerfile ? Or it is better to create new one and based on that image create additional e2e and integration tests ?
We need tests like (pseudo code)
@findepi Help for your support.