-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Refactor snapshot and migrate actions #2437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Refactor snapshot and migrate actions #2437
Conversation
spark3/src/main/java/org/apache/iceberg/spark/actions/BaseMigrateTableSparkAction.java
Outdated
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/actions/BaseSnapshotTableSparkAction.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Override | ||
| public SnapshotTable tableLocation(String location) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will have to be refined separately.
spark3/src/main/java/org/apache/iceberg/spark/actions/BaseTableMigrationSparkAction.java
Outdated
Show resolved
Hide resolved
| public CreateAction withProperties(Map<String, String> properties) { | ||
| this.additionalProperties.putAll(properties); | ||
| return this; | ||
| protected void setDestCatalogAndIdent(CatalogPlugin catalog, Identifier ident) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Called from as.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We let you use "as" with Migration? Maybe I'm forgetting how this worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was needed before since Snapshot and Migrate descended from CreateTable Actions, now I don't think you need dest catalog at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the parent class where we need to reference the dest catalog. However, I think your idea makes sense. We can probably remove the dest fields from the parent class as they are initialized differently now. Then we don't have to do that weird if statement in the constructor.
I've pushed an update. Let me know if you prefer the old approach.
spark3/src/main/java/org/apache/iceberg/spark/actions/BaseTableMigrationSparkAction.java
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public long importedDataFilesCount() { | ||
| return importedDataFilesCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this pr, but now that I think about it we should probably also let the user know how many metadata files were created as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Now that we have the result interface, we can evolve it. Could you create an issue for this, @RussellSpitzer?
@flyrain, would you be interested to pick it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I see more and more Tez-related failures on recent PRs. |
Could you please provide a link for any of the failed test runs? I would like to get the logs for the failed tests. Thanks, |
I see this at the end of the logs: Was it done manually, or it was cancelled automatically because of the failures? I am asking because some time ago (in #1789) I have introduced a feature to collect the test logs in exactly the same situations that happen here (flaky test failures on CI). This created a If the run was cancelled automatically then I have to check what changed around the build process. OTOH if the run was cancelled manually then I need to find a non-cancelled run with the same failures. Thanks, |
|
Hmm, I think you analysis is right, @pvary. It looks like we are hitting timeouts, though. The latest CI job on this PR, for example, failed with the timeout of 360 minutes. |
|
I'll try to see what causes the timeout tomorrow. |
|
Looks like the job I pointed to also failed with a timeout. |
Maybe this is related to resource problems when trying to create the Tez session? When we are creating a new TezAM we ask a new container from YARN. Maybe the issue is that we do not get the new Yarn container (because of the missing resources) and we waiting until the timeout is reached. It might be a good idea to add |
f5ec56f to
54ec4bd
Compare
54ec4bd to
90aacfb
Compare
|
Rebased this one. @RussellSpitzer, could you take a look? |
|
|
||
| Spark3CreateAction(SparkSession spark, CatalogPlugin sourceCatalog, Identifier sourceTableIdent, | ||
| CatalogPlugin destCatalog, Identifier destTableIdent) { | ||
| BaseTableMigrationSparkAction(SparkSession spark, CatalogPlugin sourceCatalog, Identifier sourceTableIdent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked having this be a non "migrate - snapshot" name here because I didn't want any confusion if there was an error during "snapshots" that made it look like it was doing a migrate because of the trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense to me, My only real issue is I think the namings are a little confusing,
We have
BaseTableMigrationSparkAction
| - BaseMigrateTableSparkAction
| - BaseSparkTableSnapshot Action
I think the base class here should be "Create" or something more visually distinct because currently we have TableMigration and MigrateTable, and as I noted I think we should keep our Snapshot traces clear of any references to migrating.
That said I wouldn't hold this back based on naming, so feel free to ignore.
I still think BaseTableCreationSparkAction is fine :)
|
Agree, @RussellSpitzer. Let me fix it. |
|
Thanks for reviewing, @RussellSpitzer! |
This PR moves our snapshot and migrate actions to use the new API.