-
Notifications
You must be signed in to change notification settings - Fork 27
Add Polaris synchronization and migration tool to polaris-tools. #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Polaris synchronization and migration tool to polaris-tools. #4
Conversation
| } | ||
|
|
||
| public static PolarisService newPolarisService(String baseUrl, String accessToken) { | ||
| validatePolarisInstanceProperties(baseUrl, accessToken, null, null, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: often nice when there's a bunch of nulls to indicate what they are via comments.
validatePolarisInstanceProperties(baseUrl, accessToken, null /*oauth2ServerUri*/ , null /*clientId*/, null, null /*clientSecret*/, /*scope*/l);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
|
|
||
| for (Catalog catalog : catalogSyncPlan.entitiesToOverwrite()) { | ||
| try { | ||
| setupOmnipotentCatalogRoleIfNotExistsTarget(catalog.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is only needed on overwrite and remove and not on create?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the overwrite of the catalog requires us to perform a cascading drop of the catalog. We only need to setup this omnipotent principal when we are initializing an iceberg rest client. On a createCatalog, we don't need an omnipotent principal until the time of syncing namespaces and tables. On overwrite and remove we need to do it before hand so we can drop catalog internals like namespaces and tables.
| } | ||
|
|
||
| /** | ||
| * Setup an omnipotent principal for the provided catalog on the source Polaris instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the comment means to say 'target' Polaris instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
| import org.apache.iceberg.io.LocationProvider; | ||
|
|
||
| /** | ||
| * Wrapper table operationw class that just allows fetching a provided table metadata. Used to build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spelling operationw -> operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
|
cc: @jbonofre It would be awesome to get your expertise on licensing and dependencies for this project. |
|
@mansehajsingh sure thing ! I will ! |
|
This looks like a major addition to Polaris tools. While it looks very useful at first glance, I guess it would be helpful to have a |
| + plan.entitiesToRemove().size(); | ||
| } | ||
|
|
||
| /** Sync principal roles from source to target. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does sync mean here? It looks like this not only copies entities from source to target, but it will also drop things in the target which are not present in the source? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two part response here:
- Why enable removal from the target? The idea is that if someone is running this periodically or multiple times, in the time since they last ran it, certain access control states may change. For example, BOB leaves the company, so all of BOB's principal roles are revoked. This kind of access control change should certainly be reflected in the target.
- The usage of the word
sync. To be honest I'm not a fan of this word myself. The reason I used this term was because the tool started initially being built off of the iceberg-catalog-migrator introduced in Initial commit for Iceberg catalog migrator #1. For that tool, the termmigratemeans removal of all the entities from the source instance after they have been created in the target. So, not wanting to give the impression that this tool performed amigration, I named it a "Synchronization". Now that the tools are separate, I can offer that we rename all occurrences of "Sync" and "Synchronization" to "Migration"? I think that would better represent the tool to users and be more applicable from a code perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync to me implies bidirectionality... after reading more, I see that this is almost like a backup. I agree that migrate also has some implications. Maybe replicator? I don't have a strong opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have strong opinions here, I think migrate might be fine. I think migrate opens up the least amount of assumptions of what this tool is supposed to do. I agree synchronization is pretty ambiguous. While I like replicator, replication to me is less applicable to how most people will be using this tool, which is to perform a bulk migration from Polaris 1 to Polaris 2.
| principalRolesSource = source.listPrincipalRoles(); | ||
| clientLogger.info("Listed {} principal-roles from source.", principalRolesSource.size()); | ||
| } catch (Exception e) { | ||
| clientLogger.error("Failed to list principal-roles from source.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In failure cases like this one, do we want the tool to actually continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here was that in the event of failure, you could triage the reasons behind why something was failing from the logs (eg. if a catalog was being migrated, not having credentials setup on the target to access storage), and just re-run the tool "idemppotently" to migrate whatever failed to migrate last time, but leave everything that moved over successfully. Here, maybe we fail the principal roles for some reason, like the list call just had a network failure, but we can still migrate over everything else. This would be especially useful for cron scenarios, because we don't want to tank the whole thing because this part failed. To avoid one off failures, the roadmap for the tool includes introducing retries, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a config to optionally fail loudly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an option --halt-on-failure
| import org.apache.iceberg.TableOperations; | ||
| import org.apache.iceberg.metrics.MetricsReporter; | ||
|
|
||
| /** Wrapper around {@link BaseTable} that contains the latest ETag for the table. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO that it's not needed once the Polaris dependency is upgraded with a version of Polaris that has ETag in the normal response type(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! Also added to a bunch of other ETag related classes.
| * Generic interface to provide and store ETags for tables within catalogs. This allows the storage | ||
| * of the ETag to be completely independent from the tool. | ||
| */ | ||
| public interface ETagService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than service this is maybe a manager? From what I understand the normal implementation will just be a local hashmap or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it.
| import org.apache.iceberg.catalog.TableIdentifier; | ||
|
|
||
| /** Implementation that returns nothing and stores no ETags. */ | ||
| public class NoOpETagService implements ETagService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only implementation provided? Seems like a HashMap could get us pretty far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a Caffeine cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why an in-memory implementation doesn't quite make sense for this is that you'd never really be able to reload this context on following runs of the tool. Since you're only ever retrieving/using the ETag once per table per migration, it is not useful to store it ephemerally.
|
|
||
| import org.apache.iceberg.catalog.TableIdentifier; | ||
|
|
||
| public class NotModifiedException extends RuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this meant to be thrown?
If we keep this can we get a Javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to be thrown when we get a 304 NOT MODIFIED from the server indicating that table metadata is current for an ETag we provide. I have renamed it for clarity and given it a javadoc.
|
|
||
| if (etagFilePath != null) { | ||
| File etagFile = new File(etagFilePath); | ||
| etagService = new CsvETagService(etagFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, we also have this implementation as the default one. That seems good. think that's fine, but how are you meant to reconfigure this if it's hard-coded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of making this configurable wasn't so much a concern for the CLI, but if someone wanted to use the PolarisSynchronizer from somewhere outside the CLI and have it persist ETags to a different persistence store than a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initialize the ETagService based on the config here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what I've done is made it so that a custom implementation is configurable via the CLI. Now you can specify an ETagManager type with an option --etag-storage-type (currently NONE, FILE, and CUSTOM) and pass a fluid set of properties via --etag-storage-properties. Is this sort of what you're looking for? I still have doubts as to whether someone would necessarily want to use anything beside a file from the CLI. I just made it configurable so if the code itself was used anywhere someone could easily write and pass in a custom implementation.
...hronizer/cli/src/main/java/org/apache/polaris/tools/sync/polaris/PolarisSynchronizerCLI.java
Show resolved
Hide resolved
|
|
||
| @CommandLine.Option( | ||
| names = {"--etag-file"}, | ||
| description = "The file path of the file to retrieve and store table ETags from.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We include this arg here which looks specific to one implementation of the ETagService; is the intent to have ETagService be pluggable or to always rely on a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to have it be pluggable by a client using the PolarisSynchronizer outside of the CLI implementation. When we started building this tool, we had interest for being able to plug the tool's logic into use cases outside of the CLI, eg. internally within a background service, where we could pick a different persistence store than a file.
Addressed comments Add minimal build configuration
eb30310 to
f2cd430
Compare
|
Updates: I have added migration of principals and their assignments to principal roles to the tool. I have gated these behind a flag, eg. I have added tests for this functionality. Improved the javadocs and made some minor changes. I have updated the design spec with these changes as well. |
|
I have also added base level build configs so that the tool can build a runnable shadowjar. This will need to be iterated on to add CI, running checks for codestyle etc. now that the build configuration in #1 is not common to the entire repository. |
|
@dimas-b I have updated the PR description with a link to the design spec! |
Generalize source and target Polaris behind interface.
travis-bowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked over the latest changes.
|
|
||
| /** | ||
| * Drop a namespace by first dropping all nested namespaces and tables underneath the namespace | ||
| * hierarchy. The empty namespace will not be dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By 'the empty namespace will not be dropped' might clarify to say 'Namespace.empty()' which represents the root namespace - to clarify that a namespace that is empty isn't the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this comment
| List<TableIdentifier> listTables(Namespace namespace); | ||
| Table loadTable(TableIdentifier tableIdentifier); | ||
| void registerTable(TableIdentifier tableIdentifier, String metadataFileLocation); | ||
| void dropTableWithoutPurge(TableIdentifier tableIdentifier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels a little strange to have dropTableWithoutPurge when I believe the iceberg apis are just dropCatalog and then you specify whether purge is requested or not.
not a blocker though so if it's not a quick method name update feel free to skip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named this explicitly to ensure that an implementor never enables purge on a drop of a table. Unfortunately, while I think this is super scary, the default dropTable on the Catalog interface within iceberg is described as "Drop a table and delete all data and metadata files.". Not explicity passing purge=false would be the correct thing to do here because we do not want to clean anything up, we need to register that same metadata file to the target catalog.
| catalogProperties.put("warehouse", catalogName); | ||
|
|
||
| String clientId = migratorPrincipal.getCredentials().getClientId(); | ||
| String clientSecret = migratorPrincipal.getCredentials().getClientSecret(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need any other properties provided to the CLI? For example - oauth endpoint if it's not standard? Or other forms of access (like just an auth token - if that's supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a new property omnipotent-principal-oauth2-server-uri that defaults to the v1/oauth/tokens endpoint, but now is configurable from the CLI. The current omnipotent principal workflow outputs client credentials at the moment, so I'm not sure an auth token is necessary right now. We should explore this with external OAuth.
| AccessControlService accessControlService = new AccessControlService(polaris); | ||
| polarisApiConnectionProperties.putIfAbsent("iceberg-write-access", String.valueOf(withWriteAccess)); | ||
|
|
||
| PolarisService polaris = PolarisServiceFactory.createPolarisService( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little strange to convert withWriteAccess to implicitly imply target or source - it would be kind of nicer if it's possible for the variable to just maintain the withWriteAccess through the object creation, but if there's good reason we can always start with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right, this is a bit clunky. I've made it so that the individual commands just pipe the property down and it is no longer implicitly implied by the being the source or target.
eric-maynard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! There's lots of followup work I can imagine, like import/export from files or parallelization, but this will be really useful for anyone migrating from one Polaris instance to another where the metastores are not the same. Thanks for all your work on this!
Adds an in development tool in the polaris-synchronizer/ directory to migrate between two Polaris instances.
Roadmap: The tool is in development and there are a multitude of enhancements planned to generalize the tool across a wider array of use cases. Before these enhancements are carried out, it is important that the tool becomes generally available to the community so the wider Polaris group can pitch in ideas and contributions to guide the tool to align with new features we plan to introduce in Polaris's roadmap. Some planned enhancements include:
Design Specification: https://docs.google.com/document/d/1AXKmzp3JaTuUS_FMNnxr_pHsBTs86rWRMborMi3deCw/edit?usp=sharing