Skip to content

Add register_table procedure support for iceberg table#14375

Merged
findepi merged 2 commits intotrinodb:masterfrom
krvikash:iceberg-support-register-table-procedure
Nov 9, 2022
Merged

Add register_table procedure support for iceberg table#14375
findepi merged 2 commits intotrinodb:masterfrom
krvikash:iceberg-support-register-table-procedure

Conversation

@krvikash
Copy link
Copy Markdown
Contributor

@krvikash krvikash commented Sep 29, 2022

Description

Fixes #13552

Adding register_table procedure for iceberg connector to register table using existing metadata.

  • Iceberg table will be created using provided table location
  • Look for the latest metadata file ending with .metadata.json inside the metadata folder and use it for creating the table
  • If provided location is invalid/does not exist then throw an exception
  • If no metadata file exists inside provided table location then throw an exception
  • If more than one metadata file exists with the latest sequence number then throw an exception
  • schema_name, table_name, and table_location should be not-null and valid, Otherwise, the exception will be thrown
  • User can optionally provide the valid metadata_file_location (See the usage below)

Valid usages:

  • register_table(schema_name => ..., table_name => ..., table_location => ...)
  • register_table(schema_name => ..., table_name => ..., table_location => ..., metadata_file_location => ...)

Sample Queries:
CALL iceberg.system.register_table('default', 'src_22', 'hdfs://hadoop-master:9000/user/hive/warehouse/orders_5-581fad8517934af6be1857a903559d44');

CALL iceberg.system.register_table('default', 'src_22', 'hdfs://hadoop-master:9000/user/hive/warehouse/orders_5-581fad8517934af6be1857a903559d44', null);

CALL iceberg.system.register_table('default', 'src_22', 'hdfs://hadoop-master:9000/user/hive/warehouse/orders_5-581fad8517934af6be1857a903559d44', '00003-409702ba-4735-4645-8f14-09537cc0b2c8.metadata.json');

Test cases are added for success and failure scenarios in the following classes.

  • Flat File -> TestIcebergRegisterTableProcedure
  • AWS Glue, Minio, GCP -> BaseIcebergConnectorSmokeTest
  • HDFS (Spark) -> TestIcebergSparkCompatibility

Configuration:
By default register_table procedure is disabled. Enable procedure by setting iceberg.register-table-procedure.enabled to true in config.

Non-technical explanation

NA

Release notes

( ) This is not user-visible or docs only and no release notes are required.
(X) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Sep 29, 2022
@krvikash krvikash changed the title Iceberg support register table procedure Iceberg support registertable procedure Sep 29, 2022
@krvikash krvikash changed the title Iceberg support registertable procedure Add register_table procedure support for iceberg table Sep 29, 2022
@krvikash krvikash force-pushed the iceberg-support-register-table-procedure branch from 6e09f17 to e464f17 Compare September 29, 2022 14:34
@krvikash krvikash self-assigned this Sep 29, 2022
@alexjo2144
Copy link
Copy Markdown
Member

The approach generally looks good to me. Thanks for putting this together

@krvikash krvikash marked this pull request as ready for review September 30, 2022 08:50
@krvikash krvikash force-pushed the iceberg-support-register-table-procedure branch 6 times, most recently from 57a7092 to 9ddb33d Compare October 1, 2022 15:23
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe integrate the tests from this class into BaseIcebergConnectorTest ?

@krvikash krvikash force-pushed the iceberg-support-register-table-procedure branch 2 times, most recently from 0308ec8 to 034538f Compare October 4, 2022 08:09
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    @Inject
    private HdfsClient hdfsClient;

and

private void hdfsDeleteAll(String directory)
{
if (!hdfsClient.exist(directory)) {
return;
}
for (String file : hdfsClient.listDirectory(directory)) {
hdfsClient.delete(directory + "/" + file);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of context behind hdfsClient

hdfs:
username: hive
webhdfs:
uri: http://${databases.hive.host}:50070

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

@findepi
Copy link
Copy Markdown
Member

findepi commented Nov 3, 2022

@anusudarsan @dain can you please review this with respect to locations & security?

Comment on lines 380 to 382
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we have a test for this? (against Glue catalog)
  • If we take out these lines, would it still pass?
    • if we do one call to Glue with proper err handling, we won't have any concurrency issues

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, We have test against Glue catalog as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #14375 (comment) i meant to use the existing IcebergFileFormat class (io.trino.plugin.iceberg.IcebergFileFormat). Why do you want to have a copy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum variable returns the value in the uppercase like (ORC, AVRO, and PARQUET) which was causing failure in some test cases where we access table comment and column comment. My intention was to make the format name lowercase while constructing the tableName.

Now, I have removed the local copy of IcebergFileFormat and used io.trino.plugin.iceberg.IcebergFileFormat and then converted to lowercase wherever required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not understand, In Locally tableName seems to be "not case-sensitive" whereas in CI tableName is "case-sensitive". Test cases are failing in CI build.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mac, the local filesystem is case-insensitive (by default).
On the CI it's case sensitive.

Comment on lines 275 to 277
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for Glue case -- if we remove this and have proper err handling, there is no question that the code is concurrently correct.

For HMS this is already test-covered io.trino.plugin.iceberg.TestIcebergRegisterTableProcedure#testRegisterTableWithExistingTable, you would just need to update the expected message, as it will be slightly different:

java.lang.AssertionError: 
Expecting message:
  <"Table already exists: 'tpch.test_register_table_with_existing_table_rwlb2koaqo'">
to match regex:
  <".*Table '(.*)' already exists.*">
but did not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the check and modified the test case of Glue and HMS.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by looking at the methoid signature, I don't understand what "withURI" means.
Why do we want a change here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some test cases were already using getTableLocation where the caller gets the tableLocation withouthdfs://hadoop-master:9000 (hdfsURI). For me, I wanted to have a fully qualified tableLocation to call registerTable.

Here withURI means, return the table location which includes hdfs://hadoop-master:9000

That is why I have overloaded the method to not disturb the existing caller.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I have reverted my change and returned the table location with hdfs://hadoop-master:9000 always. My mac has issue with product tests, so relying on CI build to see the error if any.

`

Copy link
Copy Markdown
Contributor Author

@krvikash krvikash Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases are failing.
It is found that hdfsClient requires table location without hdfs://hadoop-master:9000 to perform file operation and register_table procedure call requires table location with hdfs://hadoop-master:9000

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is withNamenodeURI make sense?

@findepi
Copy link
Copy Markdown
Member

findepi commented Nov 4, 2022

1 cancelled, 43 successful, 2 skipped, and 3 failing checks

ci / maven-checks (17) (pull_request) Cancelled after 45m
ci / pt (default, suite-delta-lake-databricks91, ) (pull_request) Failing after 41m
ci / pt (default, suite-delta-lake-databricks104, ) (pull_request) Failing after 47m
ci / pt (default, suite-7-non-generic, ) (pull_request) Failing after 76m

@krvikash
Copy link
Copy Markdown
Contributor Author

krvikash commented Nov 4, 2022

Moved smoke tests into BaseIcebergConnectorSmokeTest

@krvikash
Copy link
Copy Markdown
Contributor Author

krvikash commented Nov 5, 2022

When metadala_file_name is an existing file But not a valid metadata file (like .avro file in metadata folder), this still registers the table.

However select query fails with java.lang.IllegalArgumentException: hdfs://hadoop-master:9000/user/hive/warehouse/test-1dff2cfc382a43c88017a87abaacd4c5/metadata/01fc4fb9-2f1c-49d5-8500-7928c8cd92f0-m0.avro is not a valid metadata file.

Should we validate the metadata content before registering the table?

CC: @alexjo2144 | @findinpath | @findepi

Bumping up the comment.

Copy link
Copy Markdown
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some documentation wording nit picks 👍

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Nov 10, 2022

Congratulations @krvikash .. well done on persisting through and working with all the reviewers towards a successful merge. Great collaboration everyone!

@krvikash
Copy link
Copy Markdown
Contributor Author

Congratulations @krvikash .. well done on persisting through and working with all the reviewers towards a successful merge. Great collaboration everyone!

Thanks @mosabua :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add support for creating an Iceberg table from existing table content

7 participants