Nessie support for core #1587

rymurr · 2020-10-12T15:25:48Z

As per the mailing list annoucenment we would like to contribute integration between Iceberg and Nessie to the Iceberg project.

This PR does the following:

adds a NessieCatalog for core iceberg acid operations
adds nessie support to the catalog and source interfaces for spark 2 and spark 3
makes nessie branches and tags addressable for iceberg operations

Please have a look at Iceberg Spark for a more complete description of Nessie's capabilities with iceberg and Nessie Features for a broader introduction to Nessie.

Note this is currently in draft until a gradle plugin required for testing Nessie has been published.

rymurr · 2020-10-12T16:03:59Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+  }
+
+  private String getWarehouseLocation() {
+    String nessieWarehouseDir = config.get("nessie.warehouse.dir");


not sure if this is the best way to get hold of a directory to write tables into. Anyone have any suggestions?

In general, I would discourage depending so heavily on Hadoop Configuration. Spark and Flink have a way to pass catalog-specific options, which is the best way to configure catalogs.

There is some discussion about this in #1640. I think that catalogs should primarily depend on config passed in a string map, and should only use Hadoop Configuration when dependencies (like HadoopFileIO or HiveClient) require it.

I have cleaned this up a bit and tried to follow the pattern you suggested in #1640

rymurr · 2020-10-12T16:06:38Z

spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceNessieTables.java

+  private NessieClient client;
+  private String branch;
+
+  private Configuration getConfig() throws IOException {


The nessie specific tests all modify spark settings and reset the settings at the end. This is to interfere as little as possible w/ the 'normal' iceberg path.

rymurr · 2020-10-12T16:07:50Z

spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

-    if (path.get().contains("/")) {
-      HadoopTables tables = new HadoopTables(conf);
-      return tables.load(path.get());
+    if (nessie(options.asMap(), conf)) {


We identify Nessie as the core catalog/source when there are specific parameters available on the classpath or hadoop config. The idea here is to be fully backwards compatible w/ Hive and Hadoop catalogs.

This is probably an area to revisit. Right now, this is written to have minimal changes between 2.4.x and 3.0.x, but I think we will probably want to route all loading from here through a catalog. That will allow us to delegate all of this to Nessie or Hive the same way.

rymurr · 2020-10-12T16:08:31Z

spark3/src/test/java/org/apache/iceberg/spark/SparkCatalogTestBase.java

+        this.client = new NessieClient(NessieClient.AuthType.NONE, path, null, null);
+        try {
+          try {
+            this.client.getTreeApi().createEmptyBranch(branch);


All Nessie tests are run in their own branch to not interfere with parallel test execution

rymurr · 2020-10-12T16:10:38Z

spark3/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java

  @Test
  public void testCreateNamespace() {
+    // Nessie namespaces are explicit and do not need to be explicitly managed
+    Assume.assumeFalse(catalogName.endsWith("testnessie"));


The concept of a namespace is implicit in Nessie and are therefore not managed through the normal SupportsNamespaces interface. We skip tests of this interface when the catalog is a NessieCatalog.

There are a lot of tests that need this. Should we separate the test cases into different suites?

Sure, the hadoop catalog is also skipped for most of these. Makes sense to have separate tests

rymurr · 2020-10-12T16:11:58Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+/**
+ * Nessie implementation of Iceberg Catalog.
+ */
+public class NessieCatalog extends BaseMetastoreCatalog implements AutoCloseable {


We do not extend SupportsNamespaces as a Nessie object store supports the concept of namespaces implicitly. A Nessie namespace can be arbitrarily deep but is not explicitly created or stored. Similar to empty folders in git.

Should be fine, but I think the trade-off is that you won't be able to list namespaces in a namespace. It will be harder to find the namespaces themselves.

I will take another pass at this today, I can see totally valid reasons to support listing namespaces if they have tables in them. The problem as I see it comes from creating or deleting namespaces, and storing namespace metadata.

create/delete: in Nessie (similar to git) a namespace would be created implicitly with the first table in that namespace tree and deleted with the last table in that namespace tree. Separate crerate/delete options in nessie are either no-ops or require a dummy to be placed in that namespace. Both of which are odd operations. eg if its a no-op then creating namespace foo.bar then asking if foo.bar exists will return false.

namespace metadata: What is the use case envisioned for those operations? I think for Nessie we would start with the same behaviour as the hdfs catalog but am curious to know the benefit of supporting those apis.

Having another look we could add valid impls for namespaceExists and listNamespaces and do no-op or throw for the others. Then the clients can still navigate namespaces. Thoughts?

rymurr · 2020-10-21T18:59:25Z

Opening this up as a reviewable PR to get early feedback.

rdblue · 2020-10-23T21:54:48Z

build.gradle

 }

+project(':iceberg-nessie') {
+  apply plugin: 'org.projectnessie'


What's happening in the Nessie plugin?

It uses quarkusAppRunnerConfig dependencies to discover the Nessie Quarkus server and its dependencies then uses that to start a server. Some of the operations to discover all runtime dependencies are non-trivial and require a full gradle dependency graph, hence why its non-trivial to do in a test suite. I believe the primary reason for all this is to facilitate easily building graalvm native images.

See https://github.com/projectnessie/nessie/tree/main/tools/apprunner-gradle-plugin for the actual code

rdblue · 2020-10-23T21:56:29Z

build.gradle

      maxHeapSize '2500m'
    }
+    // start and stop quarkus for nessie tests
+    tasks.test.dependsOn("quarkus-start").finalizedBy("quarkus-stop")


From the comments in the Iceberg sync, it sounds like this is running a stand-alone Nessie server? Is that something we could handle like the current Hive MetaStore tests, where each test suite creates a new metastore and tears it down after the suite runs?

Quarkus (which is the underlying http framework) behaviour is slightly counterintuitive in that it doesn't offer an option to start Nessie like you can start the hive metastore. Hence we start it once per module and test suites are responsible for cleanup

rdblue · 2020-10-23T22:07:51Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+          .stream()
+          .filter(namespacePredicate(namespace))
+          .map(NessieCatalog::toIdentifier)
+          .collect(Collectors.toList());


Looks like this will return all tables underneath the given namespace, even if they are nested in other namespaces?

I haven't tested this in spark, does it work as expected?

You are correct, it will return everythiing in and below namespace. What is the contract supposed to be? Only tables in this namespace?

Just checked and the contract is Return all the identifiers under this namespace. I took this to mean everything under this and all sub namespaces. If that was not the intention of the method I will fix the predicate.

rdblue · 2020-10-23T22:08:24Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+          .map(NessieCatalog::toIdentifier)
+          .collect(Collectors.toList());
+    } catch (NessieNotFoundException ex) {
+      throw new RuntimeException("Unable to list tables due to missing ref.", ex);


Probably shouldn't use RuntimeException here. How about NoSuchNamespaceException?

rdblue · 2020-10-23T22:13:03Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java

+    try {
+      Contents contents = client.getContentsApi().getContents(key, reference.getHash());
+      this.table = contents.unwrap(IcebergTable.class)
+          .orElseThrow(() -> new IllegalStateException("Nessie points to a non-Iceberg object for that path."));


Style: Most Iceberg error messages use the form Cannot <some action>: <reason> (<workaround>). Consistency here tends to make at least Iceberg errors more readable and easy to consume.

rdblue · 2020-10-23T22:16:53Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java

+          .orElseThrow(() -> new IllegalStateException("Nessie points to a non-Iceberg object for that path."));
+      metadataLocation = table.getMetadataLocation();
+    } catch (NessieNotFoundException ex) {
+      this.table = null;


I think this should throw NoSuchTableException if the existing metadata is not null because the table was deleted under the reference. You'll probably want to follow the same behavior as the Hive catalog.

rdblue · 2020-10-23T22:17:36Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java

+      client.getContentsApi().setContents(key,
+                                          reference.getAsBranch().getName(),
+                                          reference.getHash(),
+                                          String.format("iceberg commit%s", applicationId()),


Doesn't look like the format here is quite correct. Missing a space?

good eye, the first char of the applicationId is a newline. I've put no space between commit and %s to not have extra trailing whitespace in message.

Also note that the handling of commit messages in nessie is still fairly primitive. This should get replaced by a structured object in the near future.

rdblue · 2020-10-23T22:18:12Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java

+                                          reference.getHash(),
+                                          String.format("iceberg commit%s", applicationId()),
+                                          newTable);
+    } catch (NessieNotFoundException | NessieConflictException ex) {


Is this right for NotFoundException? Iceberg will retry failed commits.

good eye, cleaned up exception message and handled throwing better

rdblue · 2020-10-23T22:18:52Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java

+        sparkEnvMethod = sparkEnvClazz.getMethod("get");
+        Class sparkConfClazz = Class.forName("org.apache.spark.SparkConf");
+        sparkConfMethod = sparkEnvClazz.getMethod("conf");
+        appIdMethod = sparkConfClazz.getMethod("getAppId");


You can use the DynFields helpers to do this a bit more easily.

rdblue · 2020-10-23T22:22:45Z

nessie/src/test/java/org/apache/iceberg/nessie/TestParsedTableIdentifier.java

+    ParsedTableIdentifier.getParsedTableIdentifier(path, new HashMap<>());
+  }
+
+  @Test(expected = IllegalArgumentException.class)


We prefer using AssertHelpers.assertThrows so that state after the exception was thrown can be validated. For example, testing catalog.createTable(invalid) would not only check ValidationException but also verify that the table was not created.

rdblue · 2020-10-23T22:27:29Z

spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+      case "nessie":
+        String defaultBranch = options.getOrDefault("nessie_ref", "main");
+        String nessieUrl = options.get("nessie_url");
+        return new NessieCatalog(name, conf, defaultBranch, nessieUrl);


Please have a look at #1640, I'd like to standardize how we do this. I do like using type = nessie, so we may want to have a lookup that points to the NessieCatalog implementation.

rdblue · 2020-10-23T22:28:37Z

spark3/src/test/java/org/apache/iceberg/spark/sql/TestCreateTableAsSelect.java

  @After
  public void removeTables() {
    sql("DROP TABLE IF EXISTS %s", tableName);
+    sql("DROP TABLE IF EXISTS %s", sourceName);


Why was this needed?

The way I was running in the test made it get deleted on the backend nessie server but not in the cached spark context I will clean this up as part of the Spark rework

rdblue · 2020-10-23T22:32:24Z

Thanks, @rymurr! This looks like a great start. I commented in a few places where I noticed some things. Overall, you're going in the right direction.

How do you want to start getting this in? I think it would be good to break it up a bit into smaller commits with a few tests. That way, we can iterate more quickly and we reduce the amount of scope that reviewers need to keep track of. Would it be possible to add just the Nessie module with a few tests and then move on to updating Spark modules?

rymurr · 2020-10-26T11:55:39Z

Thanks, @rymurr! This looks like a great start. I commented in a few places where I noticed some things. Overall, you're going in the right direction.

How do you want to start getting this in? I think it would be good to break it up a bit into smaller commits with a few tests. That way, we can iterate more quickly and we reduce the amount of scope that reviewers need to keep track of. Would it be possible to add just the Nessie module with a few tests and then move on to updating Spark modules?

Thanks a lot for the feedback @rdblue I will rework this PR to be just the nessie module and will open another for Spark. Will follow the pattern from #1640 for the spark PR.

rymurr · 2020-10-26T18:12:06Z

Hey @rdblue I have addressed the bulk of your comments above. Left to do:

make decision on namespace support for nessie catalog
the meaning of listTables in Catalog
revised NessieCatalog constructor along the lines of your comment in Allow loading custom Catalog implementation in Spark and Flink #1640

We should be publishing 0.2.0 of nessie in the next day or two. Once that is pushed I will update this PR with the new versions and we should have a green build.

rdblue · 2020-11-19T02:12:05Z

@rymurr, I did another thorough review with more time looking through the tests. Looking close, but I found a few things.

* nessie catalog/table ops * modifications to catalog/source for spark * add nessie to tests left to do: * support namespaces * start/stop nessie as part of gradle build

* remove namespace for nessie, handled implicitly * add gradle plugin

rymurr · 2020-11-20T19:18:44Z

Thanks again for the thorough review @rdblue I have updated w/ your suggestions and rebased. hope I didn't miss anything!

rdblue · 2020-11-23T18:18:32Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+ * </p>
+ */
+public class NessieCatalog extends BaseMetastoreCatalog implements AutoCloseable, SupportsNamespaces, Configurable {
+  private static final Logger logger = LoggerFactory.getLogger(NessieCatalog.class);


Nit: static final constants should use upper case names, like LOGGER. I'm not sure why style checks didn't catch this.

(Not a blocker)

fixed. I was just arguing w/ @jacques-n on this point on Fri ;-) He sided with you.

rdblue · 2020-11-23T18:20:14Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+           .stopRetryOn(NessieNotFoundException.class)
+           .throwFailureWhenFinished()
+           .run(this::dropTableInner, BaseNessieClientServerException.class);
+      threw = false;


Nit: threw is no longer needed so this could be simply return true. That simplifies the logic at the end of the method to just return false.

Up to you whether to change this or not. I know some people strongly prefer only one exit point from a method.

fixed. I like your way better too..just a hangover from the refactor

rdblue · 2020-11-23T18:26:05Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+    if (warehouseLocation == null) {
+      throw new IllegalStateException("Parameter warehouse not set, nessie can't store data.");
+    }
+    final String requestedRef = options.get(removePrefix.apply(NessieClient.CONF_NESSIE_REF));


Did you intend to change this to "ref"? Your reply seemed to imply that: #1587 (comment)

It is just ref now, the removePrefix method strips the nessie.from the constant in the nessie class. Didn't want to duplicate the constants already in NessieClient

I missed the removePrefix call. Thanks!

rdblue · 2020-11-23T18:38:39Z

@rymurr, thanks for all of the test changes, it is now much easier to understand! I don't see any blockers, although it looks like you may have intended to change the nessie.ref config to just ref. I'm going to go ahead and merge this since it is ready and we can clean that up later if you want to change it.

Thanks for all your hard work getting this ready! I actually quite like the way the Nessie reference works and simplifies assumptions in the catalog and table operations.

rymurr · 2020-11-23T19:19:28Z

Thanks for the merge @rdblue!! Super pumped to have this merged. I have the last round of changes ready and will post them with the PR to support timestamps in the table name asap.

… (apache#1587) Co-authored-by: Tom Tanaka <[email protected]>

rymurr commented Oct 12, 2020

View reviewed changes

kbendick mentioned this pull request Oct 15, 2020

add Glue support for HiveCatalog #1608

Closed

jacques-n mentioned this pull request Oct 20, 2020

Gradle plugin for quarkus projectnessie/nessie#301

Merged

rymurr marked this pull request as ready for review October 21, 2020 18:56

rymurr changed the title ~~DRAFT: Nessie support for core and Spark 2/3~~ Nessie support for core and Spark 2/3 Oct 21, 2020

rdblue reviewed Oct 23, 2020

View reviewed changes

rymurr force-pushed the nessie-support branch from 834c354 to da33d55 Compare October 26, 2020 18:10

rymurr mentioned this pull request Oct 27, 2020

Allow loading custom Catalog implementation in Spark and Flink #1640

Merged

rymurr force-pushed the nessie-support branch from da33d55 to 4ac611f Compare October 30, 2020 23:44

github-actions bot added the build label Oct 30, 2020

rymurr mentioned this pull request Nov 19, 2020

Custom catalogs from IcebergSource #1783

Merged

Ryan Murray added 16 commits November 20, 2020 13:57

initial commit of nessie:

52b34b7

* nessie catalog/table ops * modifications to catalog/source for spark * add nessie to tests left to do: * support namespaces * start/stop nessie as part of gradle build

working nessie

dd7278d

* remove namespace for nessie, handled implicitly * add gradle plugin

fix versions

a76061e

tidy up

a0e98a7

fix up quarkus plugin

0052ab3

code review comments

491e7c2

revert spark changes

c1ca68c

some more updates for code review

a52e751

basic support for Namespaces

823e5bc

fix tests and bump plugin version

4ef071f

update to support apache#1640

6fa22bb

simpler way to get spark app id

b23f6da

address code review

8a9009b

another round of code review

bf52afa

respond to code review comments

5a15cde

next round of code review

f9bc7ba

rymurr force-pushed the nessie-support branch from b46dbf7 to f9bc7ba Compare November 20, 2020 19:18

clarify branch/table visibility tests

fc7cc8f

rdblue reviewed Nov 23, 2020

View reviewed changes

rdblue merged commit 87143d5 into apache:master Nov 23, 2020

rymurr deleted the nessie-support branch November 24, 2020 09:38

rymurr mentioned this pull request Nov 25, 2020

Add timestamp to table definition in Nessie Catalog #1825

Closed

anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025

Spark 3.4: Migrate SparkTestBase related tests to JUnit5 (apache#13031)…

8ea740f

… (apache#1587) Co-authored-by: Tom Tanaka <[email protected]>

Nessie support for core #1587

Nessie support for core #1587

Uh oh!

Conversation

rymurr commented Oct 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rymurr commented Oct 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 23, 2020 •

edited

Loading