Skip to content

Add support for Nessie Catalog in Iceberg Connector#17719

Merged
beinan merged 1 commit intoprestodb:masterfrom
nastra:add-nessie-catalog-support
May 24, 2022
Merged

Add support for Nessie Catalog in Iceberg Connector#17719
beinan merged 1 commit intoprestodb:masterfrom
nastra:add-nessie-catalog-support

Conversation

@nastra
Copy link
Copy Markdown
Contributor

@nastra nastra commented May 3, 2022

This PR integrates the Nessie catalog functionality to the Iceberg connector.
Roughtly, it adds the following new things:

  • adds a new NESSIE catalog type, that uses NessieCatalog
  • renames IcebergHadoopMetadata to IcebergNativeMetadata and adds the particular CatalogType. This allows reusing that class for both HADOOP + NESSIE.
  • adds 2 new IcebergSessionProperties to control which referenceName / referenceHash to use with Nessie
  • some minor adjustments in IcebergResourceFactory that make sure to pass the correct properties to the NessieCatalog
  • adds NessieConfig that controls different Nessie settings and passes those to the underyling NessieCatalog
  • adds NessieContainer that allows to bring up a Nessie server for testing
  • additional details about Nessie itself can be found at https://projectnessie.org/
== RELEASE NOTES ==

Iceberg Changes
* Add support for NessieCatalog

@nastra nastra requested a review from a team as a code owner May 3, 2022 12:46
@nastra nastra force-pushed the add-nessie-catalog-support branch 2 times, most recently from 1a0b454 to ec00cc4 Compare May 3, 2022 16:03
@nastra
Copy link
Copy Markdown
Contributor Author

nastra commented May 4, 2022

@ChunxuTang could you review this please?

@ChunxuTang
Copy link
Copy Markdown
Member

@nastra Thanks for your work! Sure, happy to help review it soon this week.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires apache/iceberg#4700 to work as expected

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the one below will be changed to assertThat(computeActual(sessionOne, "SHOW SCHEMAS FROM iceberg LIKE 'namespace_one'").getMaterializedRows()).hasSize(1); once Iceberg 0.14.0 is out, since it requires apache/iceberg#4700 to properly work

@highker highker requested a review from kewang1024 May 8, 2022 08:06
Copy link
Copy Markdown
Member

@ChunxuTang ChunxuTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra Thanks so much for your nice contribution! This is a very helpful feature!
Just left some minor suggestions. Feel free to take a look.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: is iceberg.catalog.warehouse required for the Nessie catalog?

Copy link
Copy Markdown
Contributor Author

@nastra nastra May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is in fact required, see also https://iceberg.apache.org/docs/latest/nessie/#nessie-catalog

@nastra nastra force-pushed the add-nessie-catalog-support branch from ec00cc4 to 1def224 Compare May 12, 2022 04:49
Copy link
Copy Markdown
Member

@ChunxuTang ChunxuTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @nastra Thanks again for your nice work!

Hi @beinan and @kewang1024, would you also like to take a review of this PR? Thanks!

Copy link
Copy Markdown
Member

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except a couple of very minor issues.

@nastra nastra force-pushed the add-nessie-catalog-support branch from 1def224 to c899aff Compare May 16, 2022 06:28
@nastra nastra requested a review from beinan May 16, 2022 07:15
@kewang1024
Copy link
Copy Markdown
Collaborator

Hey sorry for the delay, will finish review by the end of today (5/17)

Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the four tests files, they share the same init/teardown/createQueryRunner function, can we extract those out to a common place so we don't need to create repeated logic

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have config for NESSIE_REFERENCE_HASH?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this is generally something a user would only specify at runtime when actually reading data at a particular hash. Having a config option for that would just be confusing to users because they would have to know the hash upfront.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does it mean usually different users would use different hash and they don't usually share? and thus we don't have any need to set a default value on a cluster level

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have isNullOrEmpty check here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this kind of validation being done in other Config classes, so I was assuming that the @NotNull annotation on getDefaultReferenceName() would take care of this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do have null check examples, StaticCatalogStoreConfig

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a null/empty check + added a test for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for compleness: I went over the config code to understand when things are being validated. I used the below test and temporarily removed the null/empty check in setDefaultReferenceName. Note that we provide an empty string to iceberg.nessie.ref in this example.

@Test
public void testNessieMetastore()
{
    getOnlyElement(new IcebergPlugin().getConnectorFactories()).create(
                    "test",
                    ImmutableMap.of(
                            "hive.metastore.uri", "thrift://foo:1234",
                            "iceberg.catalog.type", "nessie",
                            "iceberg.nessie.ref", ""),
                    new TestingConnectorContext())
            .shutdown();
}

The validation of the config basically happens when the Plugin is being bootstrapped here and this is when the validation annotations (@NotNull / @NotEmpty/...) are being checked/verified. The startup of the plugin then correctly failed with the following exception:

com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Error: Invalid configuration property iceberg.nessie.ref: must not be null or empty (for class com.facebook.presto.iceberg.nessie.NessieConfig.defaultReferenceName)

1 error

	at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:543)
	at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:159)
	at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
	at com.google.inject.Guice.createInjector(Guice.java:87)
	at com.facebook.airlift.bootstrap.Bootstrap.initialize(Bootstrap.java:257)
	at com.facebook.presto.iceberg.InternalIcebergConnectorFactory.createConnector(InternalIcebergConnectorFactory.java:86)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.facebook.presto.iceberg.IcebergConnectorFactory.create(IcebergConnectorFactory.java:49)
	at com.facebook.presto.iceberg.TestIcebergPlugin.testNessieMetastore(TestIcebergPlugin.java:42)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
	at org.testng.TestRunner.privateRun(TestRunner.java:756)
	at org.testng.TestRunner.run(TestRunner.java:610)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:387)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:382)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:340)
	at org.testng.SuiteRunner.run(SuiteRunner.java:289)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)

This would explain why the config classes don't perform validation checks in the setters but completely rely on the Validation annotations on the getters.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fact check on the config validation!! Then looks like the null check could actually be safely removed since the getter would perform the null check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I removed the null check now

@nastra
Copy link
Copy Markdown
Contributor Author

nastra commented May 18, 2022

For all the four tests files, they share the same init/teardown/createQueryRunner function, can we extract those out to a common place so we don't need to create repeated logic

I moved out the required connector properties for Nessie into a util method but I'm not sure it's worth extracting the init/teardown to a common place, since we're essentially really only calling start/stop on the container. There's currently no other complex init/teardown required for Nessie so imo it seems (at least for now) to be better to leave the start/stop of the container in the respective test classes.

@nastra nastra force-pushed the add-nessie-catalog-support branch from c899aff to 2df6cef Compare May 18, 2022 07:18
@nastra
Copy link
Copy Markdown
Contributor Author

nastra commented May 18, 2022

Thanks for your reviews @kewang1024 @beinan @ChunxuTang. I rebased the PR and I hope I addressed everything that's relevant. Please take another look if you can.

@kewang1024
Copy link
Copy Markdown
Collaborator

For all the four tests files, they share the same init/teardown/createQueryRunner function, can we extract those out to a common place so we don't need to create repeated logic

I moved out the required connector properties for Nessie into a util method but I'm not sure it's worth extracting the init/teardown to a common place, since we're essentially really only calling start/stop on the container. There's currently no other complex init/teardown required for Nessie so imo it seems (at least for now) to be better to leave the start/stop of the container in the respective test classes.

So essentially what I meant was creating a parent class that has init/teardown/createQueryRunner function for nessie and then all your four tests files can only focus on its own test logic

Copy link
Copy Markdown
Member

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one non-blocking minor issue. could please address all the comments from @kewang1024 , then we could merge your PR. Thanks a lot!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: to be more consistent, we could put all the cases in alphabetical order

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I had initially, but Intellij would then complain due to duplicate case branches. So I figured that it's better to merge HADOOP + NESSIE together

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beinan let me know if you would like me to still change this and introduce duplicate case branches

@nastra
Copy link
Copy Markdown
Contributor Author

nastra commented May 19, 2022

For all the four tests files, they share the same init/teardown/createQueryRunner function, can we extract those out to a common place so we don't need to create repeated logic

I moved out the required connector properties for Nessie into a util method but I'm not sure it's worth extracting the init/teardown to a common place, since we're essentially really only calling start/stop on the container. There's currently no other complex init/teardown required for Nessie so imo it seems (at least for now) to be better to leave the start/stop of the container in the respective test classes.

So essentially what I meant was creating a parent class that has init/teardown/createQueryRunner function for nessie and then all your four tests files can only focus on its own test logic

@kewang1024 that would be ideal if we could have a base class, but all of those 4 test classes are already extending different other test classes to test different things so I don't see how we could have a base class for this unfortunately.

@nastra nastra force-pushed the add-nessie-catalog-support branch from 2df6cef to 341a322 Compare May 19, 2022 06:56
@nastra nastra requested review from beinan and kewang1024 May 19, 2022 06:56
@nastra nastra force-pushed the add-nessie-catalog-support branch from 341a322 to 4c294cc Compare May 19, 2022 07:39
This PR integrates the (Nessie catalog functionality)[https://github.com/apache/iceberg/tree/master/nessie/src/main/java/org/apache/iceberg/nessie] to the Iceberg connector.
Roughtly, it adds the following new things:

* adds a new `NESSIE` catalog type, that uses `NessieCatalog`
* renames `IcebergHadoopMetadata` to `IcebergNativeMetadata` and adds the particular `CatalogType`. This allows reusing that class for both `HADOOP` + `NESSIE`.
* adds 2 new `IcebergSessionProperties` to control which referenceName / referenceHash to use with Nessie
* some minor adjustments in `IcebergResourceFactory` that make sure to pass the correct properties to the `NessieCatalog`
* adds `NessieConfig` that controls different Nessie settings and passes those to the underyling `NessieCatalog`
* adds `NessieContainer` that allows to bring up a Nessie server for testing
@nastra nastra force-pushed the add-nessie-catalog-support branch from 4c294cc to c06d08e Compare May 20, 2022 05:53
@nastra
Copy link
Copy Markdown
Contributor Author

nastra commented May 23, 2022

@beinan @ChunxuTang is this good to be merged?

Copy link
Copy Markdown
Member

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@beinan beinan merged commit 66c7001 into prestodb:master May 24, 2022
@nastra nastra deleted the add-nessie-catalog-support branch May 24, 2022 07:41
@highker highker mentioned this pull request Jul 6, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants