Core : Catalog Tables Migration API #5297

Mehul2500 · 2022-07-18T13:14:00Z

Introducing migrateTables() in CatalogUtil which could help in the migration of Iceberg tables in any Source Catalog to any Target Catalog. Uses PR #5037 , for the registerTable() functionality in BaseMetastoreCatalog.

I used tables migrating from Hadoop Catalog to Hive Catalog for the test case.

snazy · 2022-07-18T15:30:21Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+          "Cannot initialize Target Catalog implementation %s: %s", targetCatalogProperties.get("catalogImpl"),
+          e.getMessage()), e);
+    }
+    List<TableIdentifier> allIdentifiers = tableIdentifiers;


I think, this code should probably live in Catalog: A new function like Catalog.registerTableFromCatalog() to "move" a single table to the current catalog. The HadoopCatalog could then do the special-handling in its implementation.

snazy · 2022-07-18T15:32:59Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+          sourceCatalog.listTables(ns).stream()).collect(Collectors.toList());
+    }
+    List<TableIdentifier> migratedTableIdentifiers = new ArrayList<TableIdentifier>();
+    allIdentifiers.forEach(tableIdentifier -> {


I suspect this will run for a very long time, like when there a a lot of tables.
If things fail in the meantime, it's hard to resume after the failed table.
I.e. error handling here is tricky.

Not sure whether it is actually possible to properly handle the case when registerTable worked, but dropTable failed - in such a case you'd have the same table in two catalogs.

snazy · 2022-07-18T15:33:15Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+    if (tableIdentifiers == null || tableIdentifiers.isEmpty()) {
+      List<Namespace> namespaces = (sourceCatalog instanceof SupportsNamespaces) ?
+          ((SupportsNamespaces) sourceCatalog).listNamespaces() : ImmutableList.of(Namespace.empty());
+      allIdentifiers = namespaces.stream().flatMap(ns ->


I suspect this will run for a very long time, like when there a a lot of tables.

snazy · 2022-07-18T15:35:58Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+  public static List<TableIdentifier> migrateTables(List<TableIdentifier> tableIdentifiers,
+      Map<String, String> sourceCatalogProperties, Map<String, String> targetCatalogProperties,
+      Object sourceHadoopConfig, Object targetHadoopConfig) {
+    if (tableIdentifiers != null) {


Let's leave out the catalog instantiation and configuration here completely. I suspect that users have at least one of these catalogs already handy - and setting up "the same" catalog twice is superfluous.

snazy · 2022-07-18T15:39:14Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java

    }

-    String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 1);
+    String newMetadataLocation = (base == null) && (metadata.metadataFileLocation() != null) ?


Why is this necessary?
It's a new commit, not sure whether it is good that it re-uses an existing metadata location that is (potentially) "owned" by another catalog.

snazy · 2022-07-18T15:41:45Z

nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java

+    Assertions.assertThat(newTable).isNotNull();
+    TableOperations ops = ((HasTableOperations) newTable).operations();
+    String metadataLocation = ((NessieTableOperations) ops).currentMetadataLocation();
+    Assertions.assertThat("file:" + metadataVersionFiles).isEqualTo(metadataLocation);


Hint: Your assertions are often the "wrong" way around.
It's always assertThat(<current state>)...., followed by the expectations.

snazy · 2022-07-18T15:42:36Z

nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java

+  @Test
+  public void testRegisterTableWithGivenBranch() {
+    List<String> metadataVersionFiles = metadataVersionFiles(TABLE_NAME);
+    Assertions.assertThat(1).isEqualTo(metadataVersionFiles.size());


Hint: assertThat(metadataVersionFiles).hasSize(1)

snazy · 2022-07-18T15:43:43Z

nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java

+    List<String> metadataVersionFiles = metadataVersionFiles(TABLE_NAME);
+    Assertions.assertThat(1).isEqualTo(metadataVersionFiles.size());
+    ImmutableTableReference tableReference =
+        ImmutableTableReference.builder().reference("main").name(TABLE_NAME).build();


Please use a different branch here.
Using the default branch is not that great - and the test says ...Branch implying it's not the default branch.

snazy · 2022-07-18T15:54:19Z

core/src/test/java/org/apache/iceberg/jdbc/TestJdbcCatalog.java

    Assert.assertEquals(ns, JdbcUtil.stringToNamespace(nsString));
  }

+  @Test


Should these tests better live in CatalogTests?

snazy · 2022-07-18T15:55:53Z

aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbCatalog.java

  }

+  @Test
+  public void testRegisterTable() {


This pair of tests is repeated (in a very similar way) across multiple catalogs. Can those be centralized somewhere? CatalogTests maybe?

ajantha-bhat · 2022-07-22T09:06:56Z

@Mehul2500 : Please rebase this PR.

ajantha-bhat · 2022-08-11T05:19:02Z

As there is no activity in this.
I have opened #5492

github-actions · 2024-08-17T00:13:08Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-26T00:14:01Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added AWS core DELL hive NESSIE labels Jul 18, 2022

snazy reviewed Jul 18, 2022

View reviewed changes

Mehul2500 added 2 commits July 19, 2022 13:49

implemented changes in BaseMetastoreCatalog

23a5f06

catalog migration api implementation

88a050f

Mehul2500 force-pushed the catalogMigration branch from 9f6e3f8 to 88a050f Compare July 19, 2022 08:25

Mehul2500 mentioned this pull request Jul 21, 2022

Iceberg Catalog Migration Tool projectnessie/nessie#1124

Closed

github-actions bot added the stale label Aug 17, 2024

github-actions bot closed this Aug 26, 2024

Core : Catalog Tables Migration API #5297

Core : Catalog Tables Migration API #5297

Uh oh!

Conversation

Mehul2500 commented Jul 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented Jul 22, 2022

Uh oh!

ajantha-bhat commented Aug 11, 2022

Uh oh!

github-actions bot commented Aug 17, 2024

Uh oh!

github-actions bot commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants