-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark-3.3: Handle statistics file clean up from expireSnapshots action/procedure #6091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @findepi, @rdblue, @szehon-ho |
b557b25 to
7743b5c
Compare
bc5575b to
69c1b96
Compare
69c1b96 to
7a5be5d
Compare
7a5be5d to
d3b5fa1
Compare
d3b5fa1 to
a9253d8
Compare
|
Now that the core functionality PR (#6090) is merged, I think this PR can be reviewed and merged now. cc: @aokolnychyi, @szehon-ho, @rdblue, @jackye1995 |
|
|
||
| List<Object[]> output = sql("CALL %s.system.expire_snapshots('%s')", catalogName, tableIdent); | ||
| assertEquals("Should not delete any files", ImmutableList.of(row(0L, 0L, 0L, 0L, 0L)), output); | ||
| assertEquals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added one more entry for deletedStatisticsFilesCount for all the existing cases.
|
I added the core PR to 1.2 release milestone, not sure if we can get this in in time, but I can add it first so it gets more tractions. |
| * returns location of all the statistics files in a table. | ||
| * @return the location of statistics files | ||
| */ | ||
| public static List<String> statisticsFilesLocations(Table table, Set<Long> snapshotIds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simplify the logic of the original method using stream()? Also for this method, I think we can provide a more generic util method that takes in a filter, instead of just adding a specific method to filter by snapshot ID set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated now.
Originally I wanted to keep the same style as manifestListLocations and others. So, I didn't used modified.
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this overall looks good to me, just a nit comment
22611ea to
a7413fe
Compare
a7413fe to
96081b3
Compare
f81301d to
1030155
Compare
1030155 to
6421f45
Compare
6421f45 to
c4fba5f
Compare
|
finally figured out the testcase failure 😄 . It was not failing locally as I didn't rebase it. After rebase it failed locally and the reason is newly merged #6682 was missing a case for stats files. |
core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
Show resolved
Hide resolved
| } else if (MANIFEST_LIST.equalsIgnoreCase(type)) { | ||
| manifestListsCount.addAndGet(numFiles); | ||
|
|
||
| } else if (STATISTICS_FILES.equalsIgnoreCase(type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newly added #6682 was missing this case. Hence, test was failing after rebasing.
c4fba5f to
78b495a
Compare
|
@rdblue, @jackye1995 : PR is ready |
| summary.equalityDeleteFilesCount(), | ||
| summary.manifestsCount(), | ||
| summary.manifestListsCount()); | ||
| return ImmutableResult.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this lib but won't the name conflict if we migrate more results to auto-generated beans as it does not take the outer class name into account?
import org.apache.iceberg.actions.ImmutableResult;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call it Result in all actions and use outer class for identification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I will check if it is possible to alter the generated class name or include the outer class name.
If I don't find anything in a few hours, I will just use a new private Inner class in ExpireSnapshotsSparkAction.java.
Later It can be easily replaced (as it is private) with Immutables once we have better ways in a follow-up PR.
I don't want to block this PR (targeted for 1.2.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used @Value.Enclosing So that builder looks like ImmutableExpireSnapshots.Result.builder() now.
In a follow-up PR, I plan to deprecate all public result classes of actions and use immutables annotation for actions to build result objects from that.
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java
Show resolved
Hide resolved
| } | ||
|
|
||
| return toFileInfoDS( | ||
| ReachableFileUtil.statisticsFilesLocations(table, predicate), STATISTICS_FILES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we define an extra var with a relatively short name like statisticsFiles or locations so that this would fit on one line?
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java
Show resolved
Hide resolved
|
Thanks, @ajantha-bhat! Thanks for reviewing, @findepi @rdblue @jackye1995 @nastra! |
|
Thanks @aokolnychyi for merging this.
|
This change backports PR #6091 to Spark 3.2.
|
I merged the Spark 3.2 PR, thanks. We stopped active development on 2.4 and 3.1. We also consider dropping 3.1 soon. In my view, it is OK not to cherrypick this change there. Up to you, though. |
) This change backports PR apache#6091 to Spark 3.2.
This PR handles the spark action/procedure side of the statistics file clean-up for expire_snapshots.
It is a follow-up based on the core functionality of #6090