Skip to content

Add hive iceberg support based on Trino created iceberg table.#10779

Closed
Imbruced wants to merge 7 commits intotrinodb:masterfrom
Imbruced:add-hive-iceberg-trino-spark-integration
Closed

Add hive iceberg support based on Trino created iceberg table.#10779
Imbruced wants to merge 7 commits intotrinodb:masterfrom
Imbruced:add-hive-iceberg-trino-spark-integration

Conversation

@Imbruced
Copy link
Copy Markdown

This PR aims to fix issue with reading trino created iceberg table registered in hms. The code changes are small but I need more sophisticated tests. Especially tests in products section should be created. As far I know you dont have docker file fot iceberg+spark+hive and that integration should be verified. Should I add this changes here https://github.com/trinodb/docker-images/blob/master/testing/spark3.0-iceberg/Dockerfile ? Or it is better to create new one and based on that image create additional e2e and integration tests ?

We need tests like (pseudo code)

OnTrino(CREATE TABLE table_a ...)
OnTrino(INSERT INTO table_a VALUES ...)
OnHive(SELECT * FROM table_a ...)
assert cnt_trino ==cnt_hive
assert trino_elements == hive_elements

@findepi Help for your support.

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 25, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@Imbruced
Copy link
Copy Markdown
Author

Thats definitely a draft I need to understand whats the approach for integration tests. IMHO integration should be tested for iceberg + trino + hive

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 26, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 26, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 28, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 31, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(haven't reviewed the test yet)

false))
.add(booleanProperty(
HIVE_ENABLED,
"Enable hive support",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this "hive compatibility enabled".
However, to the best of my knowledge, the only downside to setting this flag is that it blows up completely, unless HMS has Iceberg jars added. Thus i don't see a need to make this a table property. Make it a config toggle: iceberg.hive-compatibility.enabled (boolean).

for the record, Iceberg has this on per-table basis
https://github.com/apache/iceberg/blob/7fcc71da65a47ca3c9f6eb6e862a238389b8bdc5/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L526-L547
@pvary @rdblue would you want to comment on the rationale?
is it because HiveIcebergMetaHook is used as a vehicle for propagating the flag to the Catalog?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preference is for setting this as an environment property -- your Hive install should have the Iceberg Jar or not. But you may want to set it at the table level for some tables because you can add the Iceberg runtime Jar on the client side. I would probably not include this in Trino unless you need to later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly you mean by this ? "I would probably not include this in Trino unless you need to later". Keep the table property as it is ? Or change to as config variable ? I tried to do the same thing as it is done in spark iceberg jars, when creating table compatibile with hive appropriate table property is assigned.
Or you mean that chnage at all shouldnt be included in trino ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expose this in Trino unless it is necessary. People can set this in the HiveConf used by the catalog, so hopefully you can just ignore these settings and make that the way to control it. Hopefully Hive will have a better story around this soon as well, so we don't need to use fake InputFormat and Serde classes to avoid it breaking. Normally, I wouldn't expect needing to do so much in other engines to avoid Hive's user experience from breaking.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea how to achieve that on table level ? I mean at the moment only way is to create table using spark or hive to be able to use on those 3 engines. I mean Hive, Trino, Spark.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg has the same logic in the code, I mean by using table property.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expose this in Trino unless it is necessary. People can set this in the HiveConf used by the catalog,

@rdblue you mean hive-site.xml files (trino's hive.config.resources config)?
or you mean some HMS configuration change?

thanks for confirming we don't need this to be a table property.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Iceberg will also check for that property to be configured. I'd leave it in Hive config rather than using session config or table config.

ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builderWithExpectedSize(2);
FileFormat fileFormat = IcebergTableProperties.getFileFormat(tableMetadata.getProperties());
propertiesBuilder.put(DEFAULT_FILE_FORMAT, fileFormat.toString());
propertiesBuilder.put(HIVE_ENABLED, isHiveEnabled.toString());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why store this information in the table properties?
It seems that what matters is serde and input/output format classes we use when registering the table in the metastore?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right, but whats the other approach you are thinking of ? I didnt find btter solution to make it happen on table level ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats the similar approach whats is done on spark side when iceberg jar is added.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right, but whats the other approach you are thinking of ?

Add a boolean entry in some Iceberg config class.
This should be a new config class bound in IcebergHiveMetastoreCatalogModule, because it's going to be HMS-specific. See IcebergConfig & TestIcebergConfig for how to create a config class.
See #10845 for upcoming Glue support, which won't need this.

Then, pass the config value to HiveMetastoreTableOperationsProvider > HiveMetastoreTableOperations > AbstractMetastoreTableOperations and use there.

public static final String METADATA_LOCATION = "metadata_location";
public static final String PREVIOUS_METADATA_LOCATION = "previous_metadata_location";
protected static final String METADATA_FOLDER_NAME = "metadata";
protected static final String HIVE_ENABLED_FLAG = "engine.hive.enabled";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should directly reuse org.apache.iceberg.TableProperties#ENGINE_HIVE_ENABLED.

FileInputFormat.class.getName(),
FileOutputFormat.class.getName());

protected static final StorageFormat HIVE_ICEBERG_STORAGE_FORMAT = StorageFormat.create(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ICEBERG_STORAGE_FORMAT

and rename the above one: STORAGE_FORMAT -> DUMMY_STORAGE_FORMAT

also, add a code comment why we cannot use ICEBERG_STORAGE_FORMAT unconditionally.

assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT)");
assertThat(query("SELECT * FROM \"test_iceberg_get_table_props$properties\""))
.matches(format("VALUES (VARCHAR 'write.format.default', VARCHAR '%s')", format.name()));
dropTable("test_iceberg_get_table_props");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this test is removed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by mistake I will bring it back.

hive:
# Make hive server configuration invalid to make sure the hms_only tests do not accidentally depend on HS2.
jdbc_url: jdbc:hive2://${databases.hive.host}:12345
jdbc_url: jdbc:hive2://${databases.hive.host}:10000
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directly contradicts the comment above. Why do you want the change here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget to chnage that to default value, the jars which were added to hive runtime is by default pushed to hive on port 10000. Do you have an idea where should I change that to be applicable in singlenode-spark-iceberg env only or even on my test class level ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempto-configuration-for-hms-only shouldn't be used in singlenode-spark-iceberg if we're going to use HS2 there.
(and i don't think it is used there)

echo "Applying hive-site configuration overrides for Spark"

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-hive-runtime/0.12.1/iceberg-hive-runtime-0.12.1.jar
cp iceberg-hive-runtime-0.12.1.jar /usr/hdp/current/hive-client/auxlib
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the echo above documents the apply-site-xml-override below. let them stay together
add code above of echo statement, with it's own documentation-like echo

(see also #10879)

import static io.trino.tests.product.utils.QueryExecutors.onTrino;
import static java.lang.String.format;

public class TestHiveReadingTheTables
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestIcebergHiveCompatibility

and add javadoc describing what it does, to discern from TestIcebergHiveTablesCompatibility

(see also #10880)

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Nov 3, 2022

👋 @Imbruced - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@colebow
Copy link
Copy Markdown
Member

colebow commented Mar 30, 2023

Closing this one out due to inactivity, but please reopen if you would like to pick this back up.

@colebow colebow closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants