Add Polaris synchronization and migration tool to polaris-tools. #4

mansehajsingh · 2025-04-08T16:49:22Z

Adds an in development tool in the polaris-synchronizer/ directory to migrate between two Polaris instances.

Changed file headers
Ported over code

Roadmap: The tool is in development and there are a multitude of enhancements planned to generalize the tool across a wider array of use cases. Before these enhancements are carried out, it is important that the tool becomes generally available to the community so the wider Polaris group can pitch in ideas and contributions to guide the tool to align with new features we plan to introduce in Polaris's roadmap. Some planned enhancements include:

Make tool build system conform with build items introduced in Initial commit for Iceberg catalog migrator #1
Generalize tool auth flow to be less dependent on the deprecated client credentials flow for the catalog and add token refresh for long running jobs
Improve robustness with enhancements like retries w/ exponential backoff, timeouts, etc.
Determine how Policy migration might look once the Policy APIs are complete in Polaris.

Design Specification: https://docs.google.com/document/d/1AXKmzp3JaTuUS_FMNnxr_pHsBTs86rWRMborMi3deCw/edit?usp=sharing

travis-bowen · 2025-04-09T16:40:47Z

...chronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisServiceFactory.java

+  }
+
+  public static PolarisService newPolarisService(String baseUrl, String accessToken) {
+    validatePolarisInstanceProperties(baseUrl, accessToken, null, null, null, null);


nit: often nice when there's a bunch of nulls to indicate what they are via comments.
validatePolarisInstanceProperties(baseUrl, accessToken, null /*oauth2ServerUri*/ , null /*clientId*/, null, null /*clientSecret*/, /*scope*/l);

travis-bowen · 2025-04-09T16:41:53Z

...ynchronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizer.java

+
+    for (Catalog catalog : catalogSyncPlan.entitiesToOverwrite()) {
+      try {
+        setupOmnipotentCatalogRoleIfNotExistsTarget(catalog.getName());


Is there a reason this is only needed on overwrite and remove and not on create?

This is because the overwrite of the catalog requires us to perform a cascading drop of the catalog. We only need to setup this omnipotent principal when we are initializing an iceberg rest client. On a createCatalog, we don't need an omnipotent principal until the time of syncing namespaces and tables. On overwrite and remove we need to do it before hand so we can drop catalog internals like namespaces and tables.

travis-bowen · 2025-04-09T16:45:37Z

...ynchronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizer.java

+  }
+
+  /**
+   * Setup an omnipotent principal for the provided catalog on the source Polaris instance.


nit: I think the comment means to say 'target' Polaris instance

travis-bowen · 2025-04-09T16:53:04Z

.../main/java/org/apache/polaris/tools/sync/polaris/catalog/MetadataWrapperTableOperations.java

+import org.apache.iceberg.io.LocationProvider;
+
+/**
+ * Wrapper table operationw class that just allows fetching a provided table metadata. Used to build


nit: spelling operationw -> operation

mansehajsingh · 2025-04-09T20:13:17Z

cc: @jbonofre It would be awesome to get your expertise on licensing and dependencies for this project.

jbonofre · 2025-04-09T20:35:05Z

@mansehajsingh sure thing ! I will !

dimas-b · 2025-04-09T21:20:30Z

This looks like a major addition to Polaris tools. While it looks very useful at first glance, I guess it would be helpful to have a dev ML summary of how it works and how it is meant to be used. WDYT? If I missed it, my apologies :) The README in this PR looks good for people who want to actually run the tool, but that text still seems to be too high level to help with understanding the tool's design (short of reading all the code).

eric-maynard · 2025-04-10T17:09:33Z

...ynchronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizer.java

+        + plan.entitiesToRemove().size();
+  }
+
+  /** Sync principal roles from source to target. */


What does sync mean here? It looks like this not only copies entities from source to target, but it will also drop things in the target which are not present in the source? Why?

Two part response here:

Why enable removal from the target? The idea is that if someone is running this periodically or multiple times, in the time since they last ran it, certain access control states may change. For example, BOB leaves the company, so all of BOB's principal roles are revoked. This kind of access control change should certainly be reflected in the target.

The usage of the word sync. To be honest I'm not a fan of this word myself. The reason I used this term was because the tool started initially being built off of the iceberg-catalog-migrator introduced in Initial commit for Iceberg catalog migrator #1. For that tool, the term migrate means removal of all the entities from the source instance after they have been created in the target. So, not wanting to give the impression that this tool performed a migration, I named it a "Synchronization". Now that the tools are separate, I can offer that we rename all occurrences of "Sync" and "Synchronization" to "Migration"? I think that would better represent the tool to users and be more applicable from a code perspective.

sync to me implies bidirectionality... after reading more, I see that this is almost like a backup. I agree that migrate also has some implications. Maybe replicator? I don't have a strong opinion here.

If we don't have strong opinions here, I think migrate might be fine. I think migrate opens up the least amount of assumptions of what this tool is supposed to do. I agree synchronization is pretty ambiguous. While I like replicator, replication to me is less applicable to how most people will be using this tool, which is to perform a bulk migration from Polaris 1 to Polaris 2.

eric-maynard · 2025-04-10T17:11:10Z

...ynchronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizer.java

+      principalRolesSource = source.listPrincipalRoles();
+      clientLogger.info("Listed {} principal-roles from source.", principalRolesSource.size());
+    } catch (Exception e) {
+      clientLogger.error("Failed to list principal-roles from source.", e);


In failure cases like this one, do we want the tool to actually continue?

The idea here was that in the event of failure, you could triage the reasons behind why something was failing from the logs (eg. if a catalog was being migrated, not having credentials setup on the target to access storage), and just re-run the tool "idemppotently" to migrate whatever failed to migrate last time, but leave everything that moved over successfully. Here, maybe we fail the principal roles for some reason, like the list call just had a network failure, but we can still migrate over everything else. This would be especially useful for cron scenarios, because we don't want to tank the whole thing because this part failed. To avoid one off failures, the roadmap for the tool includes introducing retries, etc.

Could we add a config to optionally fail loudly?

I have added an option --halt-on-failure

eric-maynard · 2025-04-10T17:14:27Z

...nizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/catalog/BaseTableWithETag.java

+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.metrics.MetricsReporter;
+
+/** Wrapper around {@link BaseTable} that contains the latest ETag for the table. */


Can we add a TODO that it's not needed once the Polaris dependency is upgraded with a version of Polaris that has ETag in the normal response type(s)

Added! Also added to a bunch of other ETag related classes.

eric-maynard · 2025-04-10T17:15:19Z

...ynchronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/catalog/ETagService.java

+ * Generic interface to provide and store ETags for tables within catalogs. This allows the storage
+ * of the ETag to be completely independent from the tool.
+ */
+public interface ETagService {


Rather than service this is maybe a manager? From what I understand the normal implementation will just be a local hashmap or something

I have renamed it.

eric-maynard · 2025-04-10T17:16:19Z

...ronizer/api/src/main/java/org/apache/polaris/tools/sync/polaris/catalog/NoOpETagService.java

+import org.apache.iceberg.catalog.TableIdentifier;
+
+/** Implementation that returns nothing and stores no ETags. */
+public class NoOpETagService implements ETagService {


Is this the only implementation provided? Seems like a HashMap could get us pretty far.

Or a Caffeine cache

The reason why an in-memory implementation doesn't quite make sense for this is that you'd never really be able to reload this context on following runs of the tool. Since you're only ever retrieving/using the ETag once per table per migration, it is not useful to store it ephemerally.

eric-maynard · 2025-04-10T17:16:48Z

...er/api/src/main/java/org/apache/polaris/tools/sync/polaris/catalog/NotModifiedException.java

+
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class NotModifiedException extends RuntimeException {


When is this meant to be thrown?

If we keep this can we get a Javadoc?

This is meant to be thrown when we get a 304 NOT MODIFIED from the server indicating that table metadata is current for an ETag we provide. I have renamed it for clarity and given it a javadoc.

polaris-synchronizer/README.md

eric-maynard · 2025-04-10T17:21:21Z

...synchronizer/cli/src/main/java/org/apache/polaris/tools/sync/polaris/SyncPolarisCommand.java

+
+    if (etagFilePath != null) {
+      File etagFile = new File(etagFilePath);
+      etagService = new CsvETagService(etagFile);


Oh I see, we also have this implementation as the default one. That seems good. think that's fine, but how are you meant to reconfigure this if it's hard-coded?

The point of making this configurable wasn't so much a concern for the CLI, but if someone wanted to use the PolarisSynchronizer from somewhere outside the CLI and have it persist ETags to a different persistence store than a file.

Can we initialize the ETagService based on the config here?

Ok, what I've done is made it so that a custom implementation is configurable via the CLI. Now you can specify an ETagManager type with an option --etag-storage-type (currently NONE, FILE, and CUSTOM) and pass a fluid set of properties via --etag-storage-properties. Is this sort of what you're looking for? I still have doubts as to whether someone would necessarily want to use anything beside a file from the CLI. I just made it configurable so if the code itself was used anywhere someone could easily write and pass in a custom implementation.

...hronizer/cli/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizerCLI.java

eric-maynard · 2025-04-10T17:37:38Z

...synchronizer/cli/src/main/java/org/apache/polaris/tools/sync/polaris/SyncPolarisCommand.java

+
+  @CommandLine.Option(
+      names = {"--etag-file"},
+      description = "The file path of the file to retrieve and store table ETags from.")


We include this arg here which looks specific to one implementation of the ETagService; is the intent to have ETagService be pluggable or to always rely on a file?

The intent is to have it be pluggable by a client using the PolarisSynchronizer outside of the CLI implementation. When we started building this tool, we had interest for being able to plug the tool's logic into use cases outside of the CLI, eg. internally within a background service, where we could pick a different persistence store than a file.

Addressed comments Add minimal build configuration

mansehajsingh · 2025-04-13T22:54:08Z

Updates:

I have added migration of principals and their assignments to principal roles to the tool. I have gated these behind a flag, --sync-principals so that users have to opt-in to migrate these entities. The reason for this is that the client credentials for these principals get reset on the target instance. Since these are only available at the time of creation, I've made it so that the tool logs the target credentials to stdout.

eg.

WARN  - Principal migration will reset credentials on the target Polaris instance. Principal migration will log the new target Principal credentials to stdout.

...

INFO  - Overwrote principal principal-2 on target. Target credentials: <client-id>:<client-secret> - 1/1

I have added tests for this functionality. Improved the javadocs and made some minor changes. I have updated the design spec with these changes as well.

mansehajsingh · 2025-04-13T22:55:41Z

I have also added base level build configs so that the tool can build a runnable shadowjar. This will need to be iterated on to add CI, running checks for codestyle etc. now that the build configuration in #1 is not common to the entire repository.

mansehajsingh · 2025-04-13T22:55:57Z

@dimas-b I have updated the PR description with a link to the design spec!

Generalize source and target Polaris behind interface.

travis-bowen

Looked over the latest changes.

travis-bowen · 2025-04-15T21:35:55Z

...r/api/src/main/java/org/apache/polaris/tools/sync/polaris/service/IcebergCatalogService.java

+
+    /**
+     * Drop a namespace by first dropping all nested namespaces and tables underneath the namespace
+     * hierarchy. The empty namespace will not be dropped.


By 'the empty namespace will not be dropped' might clarify to say 'Namespace.empty()' which represents the root namespace - to clarify that a namespace that is empty isn't the same thing.

I have updated this comment

travis-bowen · 2025-04-15T21:37:40Z

...r/api/src/main/java/org/apache/polaris/tools/sync/polaris/service/IcebergCatalogService.java

+    List<TableIdentifier> listTables(Namespace namespace);
+    Table loadTable(TableIdentifier tableIdentifier);
+    void registerTable(TableIdentifier tableIdentifier, String metadataFileLocation);
+    void dropTableWithoutPurge(TableIdentifier tableIdentifier);


Feels a little strange to have dropTableWithoutPurge when I believe the iceberg apis are just dropCatalog and then you specify whether purge is requested or not.

not a blocker though so if it's not a quick method name update feel free to skip.

I named this explicitly to ensure that an implementor never enables purge on a drop of a table. Unfortunately, while I think this is super scary, the default dropTable on the Catalog interface within iceberg is described as "Drop a table and delete all data and metadata files.". Not explicity passing purge=false would be the correct thing to do here because we do not want to clean anything up, we need to register that same metadata file to the target catalog.

travis-bowen · 2025-04-15T21:44:21Z

...in/java/org/apache/polaris/tools/sync/polaris/service/impl/PolarisIcebergCatalogService.java

+        catalogProperties.put("warehouse", catalogName);
+
+        String clientId = migratorPrincipal.getCredentials().getClientId();
+        String clientSecret = migratorPrincipal.getCredentials().getClientSecret();


Does this need any other properties provided to the CLI? For example - oauth endpoint if it's not standard? Or other forms of access (like just an auth token - if that's supported)

I have added a new property omnipotent-principal-oauth2-server-uri that defaults to the v1/oauth/tokens endpoint, but now is configurable from the CLI. The current omnipotent principal workflow outputs client credentials at the moment, so I'm not sure an auth token is necessary right now. We should explore this with external OAuth.

travis-bowen · 2025-04-15T21:48:26Z

...li/src/main/java/org/apache/polaris/tools/sync/polaris/CreateOmnipotentPrincipalCommand.java

-    AccessControlService accessControlService = new AccessControlService(polaris);
+    polarisApiConnectionProperties.putIfAbsent("iceberg-write-access", String.valueOf(withWriteAccess));
+
+    PolarisService polaris = PolarisServiceFactory.createPolarisService(


It feels a little strange to convert withWriteAccess to implicitly imply target or source - it would be kind of nicer if it's possible for the variable to just maintain the withWriteAccess through the object creation, but if there's good reason we can always start with this.

No, you're right, this is a bit clunky. I've made it so that the individual commands just pipe the property down and it is no longer implicitly implied by the being the source or target.

… factory

eric-maynard

LGTM! There's lots of followup work I can imagine, like import/export from files or parallelization, but this will be really useful for anyone migrating from one Polaris instance to another where the metastores are not the same. Thanks for all your work on this!

travis-bowen approved these changes Apr 9, 2025

View reviewed changes

eric-maynard reviewed Apr 10, 2025

View reviewed changes

polaris-synchronizer/README.md Show resolved Hide resolved

eric-maynard reviewed Apr 10, 2025

View reviewed changes

polaris-synchronizer/README.md Outdated Show resolved Hide resolved

eric-maynard reviewed Apr 10, 2025

View reviewed changes

...hronizer/cli/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizerCLI.java Show resolved Hide resolved

eric-maynard reviewed Apr 10, 2025

View reviewed changes

Initial commit of synchronizer code

f2cd430

Addressed comments Add minimal build configuration

mansehajsingh force-pushed the polaris-migrator-only branch from eb30310 to f2cd430 Compare April 11, 2025 18:29

mansehajsingh added 5 commits April 11, 2025 16:03

Added optional principal migrations

b03958e

Updated tests to accommodate principal creation

b24f648

Added migration of principal roles to principals

2afaff4

Addressed comments

05f2988

Updated tests

05a2803

mansehajsingh requested a review from eric-maynard April 13, 2025 22:56

mansehajsingh added 4 commits April 14, 2025 19:06

Add generic Polaris entity source and target- not tied to API

a0d024d

Updated docs

e7431f9

Remove type in options

8604742

update docs

67c3e41

mansehajsingh added 2 commits April 15, 2025 13:03

Added license headers

2ea47ed

Merge pull request #4 from mansehajsingh/generalize-polaris-service

e06f149

Generalize source and target Polaris behind interface.

travis-bowen approved these changes Apr 15, 2025

View reviewed changes

mansehajsingh added 7 commits April 16, 2025 09:48

Add hard failure flag

fbf51a5

make flag final

c4a2539

Added explanation to README.md

5f1d35a

Make ETagManager configurable

4491f9f

Add configurable oauth server for omnipotent principal

0525d9a

Set iceberg write access as connection property explicitly outside of…

6f551b4

… factory

Add external id to aws storage config ignore list

8cb745d

eric-maynard approved these changes Apr 18, 2025

View reviewed changes

eric-maynard merged commit bdda19f into apache:main Apr 18, 2025


		import org.apache.iceberg.catalog.TableIdentifier;

		public class NotModifiedException extends RuntimeException {

Add Polaris synchronization and migration tool to polaris-tools. #4

Add Polaris synchronization and migration tool to polaris-tools. #4

Uh oh!

Conversation

mansehajsingh commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mansehajsingh commented Apr 9, 2025

Uh oh!

jbonofre commented Apr 9, 2025

Uh oh!

dimas-b commented Apr 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mansehajsingh commented Apr 13, 2025

Uh oh!

mansehajsingh commented Apr 8, 2025 •

edited

Loading