Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Nov 1, 2022

This PR handles the spark action/procedure side of the statistics file clean-up for expire_snapshots.
It is a follow-up based on the core functionality of #6090

@ajantha-bhat ajantha-bhat marked this pull request as draft November 1, 2022 07:30
@ajantha-bhat
Copy link
Member Author

cc: @findepi, @rdblue, @szehon-ho

@ajantha-bhat ajantha-bhat force-pushed the stats_expire_all branch 3 times, most recently from b557b25 to 7743b5c Compare November 1, 2022 18:05
@ajantha-bhat ajantha-bhat marked this pull request as ready for review November 2, 2022 01:59
@ajantha-bhat
Copy link
Member Author

Now that the core functionality PR (#6090) is merged, I think this PR can be reviewed and merged now.

cc: @aokolnychyi, @szehon-ho, @rdblue, @jackye1995


List<Object[]> output = sql("CALL %s.system.expire_snapshots('%s')", catalogName, tableIdent);
assertEquals("Should not delete any files", ImmutableList.of(row(0L, 0L, 0L, 0L, 0L)), output);
assertEquals(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added one more entry for deletedStatisticsFilesCount for all the existing cases.

@jackye1995 jackye1995 added this to the Iceberg 1.2.0 milestone Feb 23, 2023
@jackye1995
Copy link
Contributor

I added the core PR to 1.2 release milestone, not sure if we can get this in in time, but I can add it first so it gets more tractions.

@jackye1995 jackye1995 self-requested a review February 23, 2023 16:45
* returns location of all the statistics files in a table.
* @return the location of statistics files
*/
public static List<String> statisticsFilesLocations(Table table, Set<Long> snapshotIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify the logic of the original method using stream()? Also for this method, I think we can provide a more generic util method that takes in a filter, instead of just adding a specific method to filter by snapshot ID set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated now.
Originally I wanted to keep the same style as manifestListLocations and others. So, I didn't used modified.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this overall looks good to me, just a nit comment

@ajantha-bhat ajantha-bhat marked this pull request as draft March 9, 2023 18:00
@ajantha-bhat ajantha-bhat marked this pull request as ready for review March 9, 2023 18:12
@ajantha-bhat
Copy link
Member Author

finally figured out the testcase failure 😄 . It was not failing locally as I didn't rebase it. After rebase it failed locally and the reason is newly merged #6682 was missing a case for stats files.

} else if (MANIFEST_LIST.equalsIgnoreCase(type)) {
manifestListsCount.addAndGet(numFiles);

} else if (STATISTICS_FILES.equalsIgnoreCase(type)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newly added #6682 was missing this case. Hence, test was failing after rebasing.

@ajantha-bhat
Copy link
Member Author

@rdblue, @jackye1995 : PR is ready

summary.equalityDeleteFilesCount(),
summary.manifestsCount(),
summary.manifestListsCount());
return ImmutableResult.builder()
Copy link
Contributor

@aokolnychyi aokolnychyi Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this lib but won't the name conflict if we migrate more results to auto-generated beans as it does not take the outer class name into account?

import org.apache.iceberg.actions.ImmutableResult;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call it Result in all actions and use outer class for identification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I will check if it is possible to alter the generated class name or include the outer class name.
If I don't find anything in a few hours, I will just use a new private Inner class in ExpireSnapshotsSparkAction.java.

Later It can be easily replaced (as it is private) with Immutables once we have better ways in a follow-up PR.
I don't want to block this PR (targeted for 1.2.0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used @Value.Enclosing So that builder looks like ImmutableExpireSnapshots.Result.builder() now.

In a follow-up PR, I plan to deprecate all public result classes of actions and use immutables annotation for actions to build result objects from that.

}

return toFileInfoDS(
ReachableFileUtil.statisticsFilesLocations(table, predicate), STATISTICS_FILES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we define an extra var with a relatively short name like statisticsFiles or locations so that this would fit on one line?

@aokolnychyi aokolnychyi merged commit 5921421 into apache:master Mar 10, 2023
@aokolnychyi
Copy link
Contributor

Thanks, @ajantha-bhat! Thanks for reviewing, @findepi @rdblue @jackye1995 @nastra!

@ajantha-bhat
Copy link
Member Author

Thanks @aokolnychyi for merging this.

aokolnychyi pushed a commit that referenced this pull request Mar 11, 2023
@aokolnychyi
Copy link
Contributor

I merged the Spark 3.2 PR, thanks. We stopped active development on 2.4 and 3.1. We also consider dropping 3.1 soon. In my view, it is OK not to cherrypick this change there. Up to you, though.

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants