-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.4: Add RewritePositionDeleteFilesProcedure #7572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.4: Add RewritePositionDeleteFilesProcedure #7572
Conversation
...k/src/main/java/org/apache/iceberg/spark/procedures/RewritePositionDeleteFilesProcedure.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/spark/procedures/RewritePositionDeleteFilesProcedure.java
Outdated
Show resolved
Hide resolved
| mapBuilder.put("register_table", RegisterTableProcedure::builder); | ||
| mapBuilder.put("publish_changes", PublishChangesProcedure::builder); | ||
| mapBuilder.put("create_changelog_view", CreateChangelogViewProcedure::builder); | ||
| mapBuilder.put("rewrite_position_delete_files", RewritePositionDeleteFilesProcedure::builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe shorten the name by just calling it as rewrite_position_deletes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also add documentation here
https://github.com/apache/iceberg/blob/master/docs/spark-procedures.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does everyone think? I use RewritePositionDeletes and PositionDeletes in the code a lot for shortness, but was not sure as the procedure names all indicate files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I'd go for shorter names, but I see here all the keys in the map already have files suffix so I think it's fine as it is currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea thats the primary reason why i kept 'files' though i do think it is long, @aokolnychyi what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion. I also like shorter names but it seems we use files suffix in other procedures so I'll be inclined to keep the name as is.
| Long.valueOf(snapshotSummary.get(ADDED_FILE_SIZE_PROP)))), | ||
| output); | ||
|
|
||
| Assert.assertEquals(1, TestHelpers.deleteFiles(table).size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we also validate the contents of the new delete file? (whether it really has all the rewritten files' contents?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted this test to be about the procedure code, as I already have added delete file content check in the test of of Action itself in TestRewritePositionDeleteFilesAction
| private Map<String, String> snapshotSummary() { | ||
| return validationCatalog.loadTable(tableIdent).currentSnapshot().summary(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add a test case with dangling deletes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, added already tests of dangling deletes on : TestRewritePositionDeleteFilesAction, and was thinking this to be a quicker test just to validate the procedure code only
714c6dc to
f8158e7
Compare
| mapBuilder.put("register_table", RegisterTableProcedure::builder); | ||
| mapBuilder.put("publish_changes", PublishChangesProcedure::builder); | ||
| mapBuilder.put("create_changelog_view", CreateChangelogViewProcedure::builder); | ||
| mapBuilder.put("rewrite_position_delete_files", RewritePositionDeleteFilesProcedure::builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I'd go for shorter names, but I see here all the keys in the map already have files suffix so I think it's fine as it is currently
...k/src/main/java/org/apache/iceberg/spark/procedures/RewritePositionDeleteFilesProcedure.java
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/spark/procedures/RewritePositionDeleteFilesProcedure.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/iceberg/spark/extensions/TestRewritePositionDeleteFilesProcedure.java
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/spark/procedures/RewritePositionDeleteFilesProcedure.java
Show resolved
Hide resolved
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @szehon-ho !
|
Let me take a look as well. |
| }); | ||
| } | ||
|
|
||
| private InternalRow[] toOutputRows(Result result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Another way to write it.
private InternalRow toOutputRow(Result result) {
return newInternalRow(
result.rewrittenDeleteFilesCount(),
result.rewrittenBytesCount(),
result.addedDeleteFilesCount(),
result.addedBytesCount());
}
...
return new InternalRow[] {toOutputRow(result)};
Up to you, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, figured out your argument is in the wrong order, but yep, done.
b22c46f to
12ba560
Compare
| return modifyIcebergTable( | ||
| tableIdent, | ||
| table -> { | ||
| RewritePositionDeleteFiles action = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed I can get remove this as well and just chain the action call
12ba560 to
662e534
Compare
|
Thanks everyone! |
This commit backports PR #7572 to Spark 3.3.
Adds a spark procedure: rewrite_position_delete_files for #7389