Skip to content

Conversation

@szehon-ho
Copy link
Member

Currently both the ExpireSnapshot SparkAction and Procedure return 'deletedDataFileCount' but this includes delete files.

Now return deletedPositionalDeleteFiles and deletedEqualityDeleteFiles as separate values, and take them out of deletedDataFiles.

@dramaticlly
Copy link
Contributor

dramaticlly commented Apr 25, 2022

from what I can tell, all Spark share the similar interface BaseExpireSnapshotsActionResult, so looks like Spark3.2 is updated with new results but Spark 2.4 is still unhappy per CI failure

/home/runner/work/iceberg/iceberg/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java:274: error: constructor BaseExpireSnapshotsActionResult in class org.apache.iceberg.actions.BaseExpireSnapshotsActionResult cannot be applied to given types;

    return new BaseExpireSnapshotsActionResult(dataFileCount.get(), manifestCount.get(), manifestListCount.get());
           ^

@github-actions github-actions bot added the build label Apr 26, 2022
@szehon-ho szehon-ho force-pushed the expire_snapshots_result branch 2 times, most recently from b1e2bfa to 6b452f1 Compare April 26, 2022 22:51
public void testExpireOlderThanWithDeleteFile() {
table.updateProperties()
.set(TableProperties.FORMAT_VERSION, "2")
.set(TableProperties.MANIFEST_MERGE_ENABLED, "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Any specific reason to disable the manifest merge?

Copy link
Member Author

@szehon-ho szehon-ho Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea as this test checks the exact list of files removed for ExpireSnapshots, I thought it's simpler to get this if we don't have to deal with manifests auto-merging after some time, as with the merge turned off, I just need to get manifests from the latest snapshot.

Though in this case, I tried and it seems not necessary as the auto-merge does not kick in (default is set high). I was thinking to keep it in case the algorithm changes and manifests start merging, failing this test, but I am open, if it's better to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense to me to keep the manifest merge as false here.

switch (manifest.content()) {
case DATA:
return CloseableIterator.transform(
ManifestFiles.read(manifest, table.getValue().io(), table.getValue().specs()).iterator(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary, only path is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I got the comment, but this needs to get the FileContent of the file to get the total number in the end, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the other pr, it was readPath, did this method open the data file ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we can't use readPaths as we need to fetch the content file type. That being said, I think we have to add a projection. Right now, we will read all columns whereas we read only paths before.

Will something like this work?

    public CloseableIterator<Tuple2<String, String>> entries(ManifestFileBean manifest) {
      FileIO io = table.getValue().io();
      Map<Integer, PartitionSpec> specs = table.getValue().specs();
      ImmutableList<String> projection = ImmutableList.of(DataFile.FILE_PATH.name(), DataFile.CONTENT.name());

      switch (manifest.content()) {
        case DATA:
          return CloseableIterator.transform(
              ManifestFiles.read(manifest, io, specs).select(projection).iterator(),
              ReadManifest::contentFileWithType);
        case DELETES:
          return CloseableIterator.transform(
              ManifestFiles.readDeleteManifest(manifest, io, specs).select(projection).iterator(),
              ReadManifest::contentFileWithType);
        default:
          throw new IllegalArgumentException("Unsupported manifest content type:" + manifest.content());
      }
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for suggestion

switch (manifest.content()) {
case DATA:
return CloseableIterator.transform(
ManifestFiles.read(manifest, table.getValue().io(), table.getValue().specs()).iterator(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we can't use readPaths as we need to fetch the content file type. That being said, I think we have to add a projection. Right now, we will read all columns whereas we read only paths before.

Will something like this work?

    public CloseableIterator<Tuple2<String, String>> entries(ManifestFileBean manifest) {
      FileIO io = table.getValue().io();
      Map<Integer, PartitionSpec> specs = table.getValue().specs();
      ImmutableList<String> projection = ImmutableList.of(DataFile.FILE_PATH.name(), DataFile.CONTENT.name());

      switch (manifest.content()) {
        case DATA:
          return CloseableIterator.transform(
              ManifestFiles.read(manifest, io, specs).select(projection).iterator(),
              ReadManifest::contentFileWithType);
        case DELETES:
          return CloseableIterator.transform(
              ManifestFiles.readDeleteManifest(manifest, io, specs).select(projection).iterator(),
              ReadManifest::contentFileWithType);
        default:
          throw new IllegalArgumentException("Unsupported manifest content type:" + manifest.content());
      }
    }

@szehon-ho
Copy link
Member Author

@aokolnychyi resolved all the comments, save the one about not having access to field ManifestFile 'content'. I mean this still works as-is (just the ManifestContent=DELETES path is never taken though still returning delete files), but I can wait until that one goes in, either way.

@szehon-ho szehon-ho force-pushed the expire_snapshots_result branch from 26f6725 to 7975d94 Compare May 18, 2022 00:48
@szehon-ho
Copy link
Member Author

szehon-ho commented May 18, 2022

@aokolnychyi when you have a chance, this now correctly utilizes 'content' field of Manifest table during the computation of valid files.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I had a few mostly optional comments.
Great work, @szehon-ho! I think we should fix other actions next.

- code: "java.method.addedToInterface"
new: "method ThisT org.apache.iceberg.SnapshotUpdate<ThisT>::scanManifestsWith(java.util.concurrent.ExecutorService)"
justification: "Accept all changes prior to introducing API compatibility checks"
- code: "java.method.addedToInterface"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with this given other changes, upcoming 1.0 and that it is mostly Iceberg itself that provides implementations of this interface. Even if someone has custom actions, they probably reuse the provided result implementation.

If other folks are concerned, we could default the new methods.


private void checkExpirationResults(long expectedDatafiles, long expectedManifestsDeleted,
long expectedManifestListsDeleted, ExpireSnapshots.Result results) {
private void checkExpirationResults(long expectedDatafiles, long expectedPosDeleteFiles, long expectedEqDeleteFiles,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the indentation!

@szehon-ho szehon-ho merged commit 71282b8 into apache:master May 18, 2022
@szehon-ho
Copy link
Member Author

Thanks @aokolnychyi , @zinking , @dramaticlly , @rajarshisarkar for detailed reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants