Skip to content

Conversation

@mansehajsingh
Copy link
Contributor

@mansehajsingh mansehajsingh commented Apr 8, 2025

Adds an in development tool in the polaris-synchronizer/ directory to migrate between two Polaris instances.

  • Changed file headers
  • Ported over code

Roadmap: The tool is in development and there are a multitude of enhancements planned to generalize the tool across a wider array of use cases. Before these enhancements are carried out, it is important that the tool becomes generally available to the community so the wider Polaris group can pitch in ideas and contributions to guide the tool to align with new features we plan to introduce in Polaris's roadmap. Some planned enhancements include:

  • Make tool build system conform with build items introduced in Initial commit for Iceberg catalog migrator #1
  • Generalize tool auth flow to be less dependent on the deprecated client credentials flow for the catalog and add token refresh for long running jobs
  • Improve robustness with enhancements like retries w/ exponential backoff, timeouts, etc.
  • Determine how Policy migration might look once the Policy APIs are complete in Polaris.

Design Specification: https://docs.google.com/document/d/1AXKmzp3JaTuUS_FMNnxr_pHsBTs86rWRMborMi3deCw/edit?usp=sharing

}

public static PolarisService newPolarisService(String baseUrl, String accessToken) {
validatePolarisInstanceProperties(baseUrl, accessToken, null, null, null, null);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: often nice when there's a bunch of nulls to indicate what they are via comments.
validatePolarisInstanceProperties(baseUrl, accessToken, null /*oauth2ServerUri*/ , null /*clientId*/, null, null /*clientSecret*/, /*scope*/l);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!


for (Catalog catalog : catalogSyncPlan.entitiesToOverwrite()) {
try {
setupOmnipotentCatalogRoleIfNotExistsTarget(catalog.getName());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is only needed on overwrite and remove and not on create?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the overwrite of the catalog requires us to perform a cascading drop of the catalog. We only need to setup this omnipotent principal when we are initializing an iceberg rest client. On a createCatalog, we don't need an omnipotent principal until the time of syncing namespaces and tables. On overwrite and remove we need to do it before hand so we can drop catalog internals like namespaces and tables.

}

/**
* Setup an omnipotent principal for the provided catalog on the source Polaris instance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the comment means to say 'target' Polaris instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

import org.apache.iceberg.io.LocationProvider;

/**
* Wrapper table operationw class that just allows fetching a provided table metadata. Used to build

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling operationw -> operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@mansehajsingh
Copy link
Contributor Author

cc: @jbonofre It would be awesome to get your expertise on licensing and dependencies for this project.

@jbonofre
Copy link
Member

jbonofre commented Apr 9, 2025

@mansehajsingh sure thing ! I will !

@dimas-b
Copy link

dimas-b commented Apr 9, 2025

This looks like a major addition to Polaris tools. While it looks very useful at first glance, I guess it would be helpful to have a dev ML summary of how it works and how it is meant to be used. WDYT? If I missed it, my apologies :) The README in this PR looks good for people who want to actually run the tool, but that text still seems to be too high level to help with understanding the tool's design (short of reading all the code).

+ plan.entitiesToRemove().size();
}

/** Sync principal roles from source to target. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does sync mean here? It looks like this not only copies entities from source to target, but it will also drop things in the target which are not present in the source? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two part response here:

  • Why enable removal from the target? The idea is that if someone is running this periodically or multiple times, in the time since they last ran it, certain access control states may change. For example, BOB leaves the company, so all of BOB's principal roles are revoked. This kind of access control change should certainly be reflected in the target.
  • The usage of the word sync. To be honest I'm not a fan of this word myself. The reason I used this term was because the tool started initially being built off of the iceberg-catalog-migrator introduced in Initial commit for Iceberg catalog migrator #1. For that tool, the term migrate means removal of all the entities from the source instance after they have been created in the target. So, not wanting to give the impression that this tool performed a migration, I named it a "Synchronization". Now that the tools are separate, I can offer that we rename all occurrences of "Sync" and "Synchronization" to "Migration"? I think that would better represent the tool to users and be more applicable from a code perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync to me implies bidirectionality... after reading more, I see that this is almost like a backup. I agree that migrate also has some implications. Maybe replicator? I don't have a strong opinion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have strong opinions here, I think migrate might be fine. I think migrate opens up the least amount of assumptions of what this tool is supposed to do. I agree synchronization is pretty ambiguous. While I like replicator, replication to me is less applicable to how most people will be using this tool, which is to perform a bulk migration from Polaris 1 to Polaris 2.

principalRolesSource = source.listPrincipalRoles();
clientLogger.info("Listed {} principal-roles from source.", principalRolesSource.size());
} catch (Exception e) {
clientLogger.error("Failed to list principal-roles from source.", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In failure cases like this one, do we want the tool to actually continue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was that in the event of failure, you could triage the reasons behind why something was failing from the logs (eg. if a catalog was being migrated, not having credentials setup on the target to access storage), and just re-run the tool "idemppotently" to migrate whatever failed to migrate last time, but leave everything that moved over successfully. Here, maybe we fail the principal roles for some reason, like the list call just had a network failure, but we can still migrate over everything else. This would be especially useful for cron scenarios, because we don't want to tank the whole thing because this part failed. To avoid one off failures, the roadmap for the tool includes introducing retries, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a config to optionally fail loudly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an option --halt-on-failure

import org.apache.iceberg.TableOperations;
import org.apache.iceberg.metrics.MetricsReporter;

/** Wrapper around {@link BaseTable} that contains the latest ETag for the table. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO that it's not needed once the Polaris dependency is upgraded with a version of Polaris that has ETag in the normal response type(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! Also added to a bunch of other ETag related classes.

* Generic interface to provide and store ETags for tables within catalogs. This allows the storage
* of the ETag to be completely independent from the tool.
*/
public interface ETagService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than service this is maybe a manager? From what I understand the normal implementation will just be a local hashmap or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed it.

import org.apache.iceberg.catalog.TableIdentifier;

/** Implementation that returns nothing and stores no ETags. */
public class NoOpETagService implements ETagService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only implementation provided? Seems like a HashMap could get us pretty far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a Caffeine cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why an in-memory implementation doesn't quite make sense for this is that you'd never really be able to reload this context on following runs of the tool. Since you're only ever retrieving/using the ETag once per table per migration, it is not useful to store it ephemerally.


import org.apache.iceberg.catalog.TableIdentifier;

public class NotModifiedException extends RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this meant to be thrown?

If we keep this can we get a Javadoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be thrown when we get a 304 NOT MODIFIED from the server indicating that table metadata is current for an ETag we provide. I have renamed it for clarity and given it a javadoc.


if (etagFilePath != null) {
File etagFile = new File(etagFilePath);
etagService = new CsvETagService(etagFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, we also have this implementation as the default one. That seems good. think that's fine, but how are you meant to reconfigure this if it's hard-coded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of making this configurable wasn't so much a concern for the CLI, but if someone wanted to use the PolarisSynchronizer from somewhere outside the CLI and have it persist ETags to a different persistence store than a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we initialize the ETagService based on the config here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what I've done is made it so that a custom implementation is configurable via the CLI. Now you can specify an ETagManager type with an option --etag-storage-type (currently NONE, FILE, and CUSTOM) and pass a fluid set of properties via --etag-storage-properties. Is this sort of what you're looking for? I still have doubts as to whether someone would necessarily want to use anything beside a file from the CLI. I just made it configurable so if the code itself was used anywhere someone could easily write and pass in a custom implementation.


@CommandLine.Option(
names = {"--etag-file"},
description = "The file path of the file to retrieve and store table ETags from.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We include this arg here which looks specific to one implementation of the ETagService; is the intent to have ETagService be pluggable or to always rely on a file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to have it be pluggable by a client using the PolarisSynchronizer outside of the CLI implementation. When we started building this tool, we had interest for being able to plug the tool's logic into use cases outside of the CLI, eg. internally within a background service, where we could pick a different persistence store than a file.

Addressed comments

Add minimal build configuration
@mansehajsingh mansehajsingh force-pushed the polaris-migrator-only branch from eb30310 to f2cd430 Compare April 11, 2025 18:29
@mansehajsingh
Copy link
Contributor Author

Updates:

I have added migration of principals and their assignments to principal roles to the tool. I have gated these behind a flag, --sync-principals so that users have to opt-in to migrate these entities. The reason for this is that the client credentials for these principals get reset on the target instance. Since these are only available at the time of creation, I've made it so that the tool logs the target credentials to stdout.

eg.

WARN  - Principal migration will reset credentials on the target Polaris instance. Principal migration will log the new target Principal credentials to stdout.

...

INFO  - Overwrote principal principal-2 on target. Target credentials: <client-id>:<client-secret> - 1/1

I have added tests for this functionality. Improved the javadocs and made some minor changes. I have updated the design spec with these changes as well.

@mansehajsingh
Copy link
Contributor Author

I have also added base level build configs so that the tool can build a runnable shadowjar. This will need to be iterated on to add CI, running checks for codestyle etc. now that the build configuration in #1 is not common to the entire repository.

@mansehajsingh
Copy link
Contributor Author

@dimas-b I have updated the PR description with a link to the design spec!

Copy link

@travis-bowen travis-bowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over the latest changes.


/**
* Drop a namespace by first dropping all nested namespaces and tables underneath the namespace
* hierarchy. The empty namespace will not be dropped.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 'the empty namespace will not be dropped' might clarify to say 'Namespace.empty()' which represents the root namespace - to clarify that a namespace that is empty isn't the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this comment

List<TableIdentifier> listTables(Namespace namespace);
Table loadTable(TableIdentifier tableIdentifier);
void registerTable(TableIdentifier tableIdentifier, String metadataFileLocation);
void dropTableWithoutPurge(TableIdentifier tableIdentifier);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a little strange to have dropTableWithoutPurge when I believe the iceberg apis are just dropCatalog and then you specify whether purge is requested or not.

not a blocker though so if it's not a quick method name update feel free to skip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named this explicitly to ensure that an implementor never enables purge on a drop of a table. Unfortunately, while I think this is super scary, the default dropTable on the Catalog interface within iceberg is described as "Drop a table and delete all data and metadata files.". Not explicity passing purge=false would be the correct thing to do here because we do not want to clean anything up, we need to register that same metadata file to the target catalog.

catalogProperties.put("warehouse", catalogName);

String clientId = migratorPrincipal.getCredentials().getClientId();
String clientSecret = migratorPrincipal.getCredentials().getClientSecret();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need any other properties provided to the CLI? For example - oauth endpoint if it's not standard? Or other forms of access (like just an auth token - if that's supported)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new property omnipotent-principal-oauth2-server-uri that defaults to the v1/oauth/tokens endpoint, but now is configurable from the CLI. The current omnipotent principal workflow outputs client credentials at the moment, so I'm not sure an auth token is necessary right now. We should explore this with external OAuth.

AccessControlService accessControlService = new AccessControlService(polaris);
polarisApiConnectionProperties.putIfAbsent("iceberg-write-access", String.valueOf(withWriteAccess));

PolarisService polaris = PolarisServiceFactory.createPolarisService(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little strange to convert withWriteAccess to implicitly imply target or source - it would be kind of nicer if it's possible for the variable to just maintain the withWriteAccess through the object creation, but if there's good reason we can always start with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right, this is a bit clunky. I've made it so that the individual commands just pipe the property down and it is no longer implicitly implied by the being the source or target.

Copy link
Contributor

@eric-maynard eric-maynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There's lots of followup work I can imagine, like import/export from files or parallelization, but this will be really useful for anyone migrating from one Polaris instance to another where the metastores are not the same. Thanks for all your work on this!

@eric-maynard eric-maynard merged commit bdda19f into apache:main Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants