Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots #3457

szehon-ho · 2021-11-03T01:54:42Z

Expire snapshots can take a long time for large tables with millions of files and thousands of snapshots/manifests.

One cause is the calculation of files to be deleted. The current algorithm is:

Find the reachability graph of all snapshots before expiration
Find reachability graph of all snapshots after expiration
Subtract the second from the first, these are files to delete

But this explores every retained snapshot twice. Example: any periodic expire-snapshot job that expires 1 snapshot needs to explore all n-1 snapshots twice.

Proposal:

Find reachability graph of all snapshots after expiration
Find reachability graph of expired snapshots (if only a few expired, should be much smaller set)
Subtract the first from the second, these are files to delete

Implementation: For expired-snapshot scan, change the original spark query of metadata tables to custom spark jobs that only explore from expired snapshot(s).

Note: The new expired-snapshot scan duplicates manifestList scan logic to handle "write.manifest-lists.enabled"="false" flag, but unfortunately the functionality seems broken without this change and so not possible to test currently. Added a test for demonstration purpose.

szehon-ho · 2021-11-03T02:00:13Z

I'm going to try to do some simple benchmarks to validate it improves the perf, but putting the idea out here for any early feedback

rdblue · 2021-11-07T23:43:30Z

core/src/main/java/org/apache/iceberg/ReachableFileUtil.java

+   */
+  public static List<String> manifestListLocations(Table table, Set<Long> snapshotIds) {
+    Iterable<Snapshot> snapshots = table.snapshots();
+    return StreamSupport.stream(snapshots.spliterator(), false)


Can you add a comment to clarify the false argument?

rdblue · 2021-11-07T23:44:22Z

core/src/main/java/org/apache/iceberg/ReachableFileUtil.java

+    return StreamSupport.stream(snapshots.spliterator(), false)
+        .filter(s -> snapshotIds.contains(s.snapshotId()))
+        .filter(s -> s.manifestListLocation() != null)
+        .map(Snapshot::manifestListLocation)


Instead of calling manifestListLocation twice, why not map and then filter using Objects.nonNull?

Good suggestion, thanks

rdblue · 2021-11-07T23:47:56Z

...3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java

        expireSnapshots = expireSnapshots.retainLast(retainLastValue);
      }

+      List<Snapshot> expired = expireSnapshots.apply();


There is no guarantee that apply followed by commit will produce the same result, so this is unsafe. It may work most of the time, but there will probably be leaks when expired here doesn't contain a snapshot that was actually removed. That's why we always refresh and compare against the version that was actually committed.

I see, changed.

rdblue · 2021-11-07T23:51:02Z

...3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java

  };

-  private final Set<Long> expiredSnapshotIds = Sets.newHashSet();
+  private final Set<Long> snapshotIdsToExpire = Sets.newHashSet();


I think this is more correct, but I don't think it's worth making this change larger. I would just use a removedSnapshotIds variable later to avoid the name conflict.

OK, reverted

rdblue · 2021-11-07T23:51:23Z

...3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java

      }

+      List<Snapshot> expired = expireSnapshots.apply();
+      Set<Long> expiredSnapshotIds = expired.stream().map(Snapshot::snapshotId).collect(Collectors.toSet());


I think this needs to be done by diffing the snapshot ID sets between metadata versions.

szehon-ho · 2021-11-10T04:44:08Z

I think this is actually promising, the performance gain is in line with expectation.

Test: table with 1000 small snapshots, expire 1 at a time. The time and resources spent in expire-snapshot and input/shuffle sizes falls by little less than half, as most snapshots are not double-scanned.

Current:

Optimized:

So I think it's worth to continue. @rdblue thanks for the early review, I will look at the comments and make the changes

szehon-ho · 2021-11-12T04:13:28Z

Patch should be ready for more review
Fyi @aokolnychyi this is the snapshot-based metadata scan I mentioned, not sure if it will be useful elsewhere.

RussellSpitzer · 2022-03-30T23:32:58Z

@szehon-ho sorry for not getting back to this sooner, let's finish this up when you are back online

szehon-ho · 2022-05-05T01:01:52Z

I'm not sure if people think these changes are too hacky.

Another option I've thought, is to implement IncrementalScan (#4580) for All_files table (to be added in #4694), which will allow snapshot filtering. Then rewrite ExpireSnapshotAction to use that table, with different filters to avoid double-scanning all_files. It would be much cleaner that way, but a bigger refactor (and would make #4674 much harder if we go that route)

I wonder was there was some reason initially to use all_manifests table and flatmap the ReadManifest task, rather than relying on all_files table? @rdblue @aokolnychyi for any thoughts/information.

szehon-ho · 2022-05-05T01:35:10Z

Actually I think I get why, all_files table does not parallelize the planning (reading each snapshot in spark task), so maybe better to keep this way (all_manifests table and ReadManifest task). Will take a look to see if this can be cleaned up in another way.

rdblue · 2022-05-10T01:58:02Z

@szehon-ho, the reason why we use all_manifests is to avoid the expensive driver-side manifest list file reads. all_files has to read through every manifest list in the table from the driver to find all the manifests, dedup, and then make tasks. The all_manifests table makes a task out of each manifest list to produce a collection of manifests from just the list of snapshots/manifest lists.

szehon-ho · 2022-05-10T02:33:01Z

@RussellSpitzer @aokolnychyi Rebased the pr , it's still using the manual way to filter out snapshots from all_manifest table, if you guys have time to take a look.

The idea I was mentioning in above few comments to use manifest table with snapshot filtering via time-travel (to make it cleaner), I tried to implement in : #4736 , I'd like to see your thoughts if that is a better approach. The problem there is I stopped when I realized manifest table do not support time-travel, and it didn't look trivial to implement.

Also FYI @ajantha-bhat as we were talking about this on #4674

szehon-ho · 2022-05-10T02:34:08Z

@rdblue yea thanks, I realized it after asking.

szehon-ho · 2022-07-01T18:35:32Z

Ref : discussion on #4736. This implements the original idea to reduce the 'deleteCandidate' scan to just the ones from deleted snapshots.

There is another idea to remove current manifests during the delete candidate scan as well, not done yet.

aokolnychyi · 2022-07-29T22:53:41Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

+      Set<Long> retainedSnapshots =
+          updatedTable.snapshots().stream().map(Snapshot::snapshotId).collect(Collectors.toSet());
+      Dataset<Row> validFiles = buildValidFileDF(updatedTable);
+      Dataset<Row> deleteCandidateFiles =


Am I correct that we are trying to build the reachability set of expired snapshots? Would it be easier to write this logic in terms of expired snapshots instead of retained snapshots? Right now, we pass snapshots to ignore, which takes a bit of time to wrap your head around. Would it be easier if we passed snapshots we are looking for?

Sure, it makes sense to me, can make the change.

I had the optimization in the end to pass the lesser of the two sets to Spark , with either 'in' or 'not in', but can still do it the other way.

aokolnychyi · 2022-07-29T23:23:16Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

+  private Dataset<Row> buildFilteredValidDataDF(
+      TableMetadata metadata, Set<Long> snapshotsToExclude) {
+    Table staticTable = newStaticTable(metadata, table.io());
+    return buildValidContentFileWithTypeDF(staticTable, snapshotsToExclude)


Two questions.

Will we scan ALL_MANIFESTS twice? Once for content files and once for manifests?

Will we open extra manifests when building the reachability set for expired snapshots? I don't think the current implementation removes any manifests from the expired set that are currently in the table.

Yes, I think there was a proposal for that by @ajantha-bhat Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions #4674, we werent sure that caching is worth the cost (it may involves an extra read-write)

Yes, I think I gave it a try but decided to punt for now as it always involves an extra Spark join. (the all-manifests already being in a dataset). And so not always beneficial in all cases. For example in my test, where each manifest is a small number of data files, it is more expensive. But of course there are other cases where a manifest has many data files where it helps.
For these maybe we can do make it configurable?

This change only has the changes I see will not hurt performance but only help it.

Yes, I think there was a proposal for that by @ajantha-bhat #4674, we werent sure that caching is worth the cost (it may involves an extra read-write)

I didn't find so much performance difference when tested locally. Hence, I didn't proceed further on #4674. May be I need to Test in S3 to really see the benefit of avoiding IO scans.

I was thinking about something like this:

Cache/persist a dataset of unique manifests in still valid snapshots (via all_manifests)

Cache/persist a dataset of unique manifests in expired snapshots (via all_manifests)

Find expired manifests that are no longer referenced by still valid snapshots (anti-join)

Read expired manifests to build a dataset of content files referenced by expired manifests

Read still valid manifests to build a dataset of still valid content files

Find expired content files (anti-join)

Find expired manifest lists

Union datasets for different expired results

If we do it this way, we read all_manifests table only once and also skip still live manifests when looking for expired content files. The extra shuffle is a valid concern but I think we can avoid it. Right now, we are doing a round-robin partitioning when building unique manifests. Instead, we can assign a deterministic hash partitioning to both datasets to avoid a shuffle during anti-joins. Spark won't shuffle datasets if they are partitioned in the same way. We are shuffling those datasets anyway as we deduplicate manifests before opening them up and then have an extra round-robin shuffle to distribute the load. I feel we can just use a deterministic hash partitioning instead. We will have to set an explicit number of shuffle partitions to prevent AQE from coalescing tasks.

@karuppayya is also working on some benchmarks for actions, which would be really handy to evaluate the impact here.

Thoughts, @szehon-ho @RussellSpitzer @rdblue?

I've kinda been thinking we probably don't need the anti-joins distributed (mostly we just want distributed IO) for this operation but I think all of your logic makes sense here. if I think this through we have

ALL_MANIFESTS broken into 3 subsets

oldManifests: [Manifests only used in Expired Snapshots ] commonManifests: [ Manifests used in both expired and unexpired Snapshots ] newManifests: [ Manifests only used in unexpired-snapshots ]

To build this we read allManifests

expiredManifests = allManifests where snapshot is in snapshots being expired unexpiredManifests = allManifests where snapshots is not in snapshots being expired oldManifests = expiredManifests -- unexpiredManifests newManifests = unexpiredManifests -- expiredManifests

We read all the manifests in oldManifests and newManifests

oldManifests -> flatmap read manifest and dedupe -> oldFiles newManifests -> flatmap read manifest and dedupe -> newFiles filesToDelete == (oldFiles -- newFiles) + oldManifests + oldManifestLists

Did I get that right?

rdblue · 2022-10-09T19:25:26Z

core/src/main/java/org/apache/iceberg/ReachableFileUtil.java

+   * @return the location of manifest Lists
+   */
+  public static List<String> manifestListLocations(Table table, Set<Long> snapshots) {
+    Stream<Snapshot> snapshotStream = StreamSupport.stream(table.snapshots().spliterator(), false);


Minor: rather than using Java's spliterator method and streams, we generally prefer to use Iterables.filter and Iterables.transform or a simple for loop.

Done, used Iterables.filter and restored the old for loop

rdblue · 2022-10-09T19:26:25Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

+
+  protected Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshots) {
+    Broadcast<Table> tableBroadcast =
+        sparkContext.broadcast(SerializableTableWithSize.copyOf(table));


Minor: this was broken out into two lines (L138-L139) before. It would be good to keep it the same way to reduce churn.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

rdblue · 2022-10-09T19:29:07Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

    }
  }

+  protected Dataset<Row> filterAllManifests(Dataset<Row> allManifestDF, Set<Long> snapshots) {


Nit: snapshots would be more clear if it were named snapshotIds.

Refactored snapshots -> snapshotIds in these files

rdblue · 2022-10-09T19:29:49Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

+  protected Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshots) {
+    Dataset<Row> allManifests = loadMetadataTable(table, ALL_MANIFESTS);
+    if (snapshots != null) {
+      allManifests = filterAllManifests(allManifests, snapshots);


Nit: preserving the use of all makes this awkward after filtering.

Good point, renamed.

rdblue · 2022-10-09T19:37:37Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestExpireSnapshotsAction.java

+    testExpireFilesAreDeleted(2, 5);
+  }
+
+  public void testExpireFilesAreDeleted(int dataFilesExpired, int dataFilesRetained) {


I think this test is okay, but I think that it doesn't really test cases where the retained and deleted files are mixed together. That deserves a direct test, where the deleteCandidateFileDS actually contains files in the validFileDS.

You are right, added test below

rdblue · 2022-10-09T19:39:31Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java

  }
+
+  public static Set<String> reachableManifestPaths(Table table) {
+    return StreamSupport.stream(table.snapshots().spliterator(), false)


There's actually a method in ManifestFiles that more efficiently returns just the paths if you want to use it. @amogh-jahagirdar uses it in the non-Spark reachable file implementation: #5669 (comment)

Not sure if I miss something, but looks like that is for data files of a manifest, while this is for listing manifests?

Yeah it's for listing live data file paths for a given manifest.

rdblue

This looks good to me other than a few minor things. It would be great to get this in!

…ired manifests

szehon-ho

Thanks for looking at this! Addressed the comments

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

szehon-ho · 2022-10-11T21:57:31Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestExpireSnapshotsAction.java

+    testExpireFilesAreDeleted(2, 5);
+  }
+
+  public void testExpireFilesAreDeleted(int dataFilesExpired, int dataFilesRetained) {


You are right, added test below

szehon-ho · 2022-10-11T21:58:14Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java

  }
+
+  public static Set<String> reachableManifestPaths(Table table) {
+    return StreamSupport.stream(table.snapshots().spliterator(), false)


Not sure if I miss something, but looks like that is for data files of a manifest, while this is for listing manifests?

szehon-ho · 2022-10-11T21:58:40Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

+  protected Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshots) {
+    Dataset<Row> allManifests = loadMetadataTable(table, ALL_MANIFESTS);
+    if (snapshots != null) {
+      allManifests = filterAllManifests(allManifests, snapshots);


Good point, renamed.

szehon-ho · 2022-10-11T21:58:59Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

    }
  }

+  protected Dataset<Row> filterAllManifests(Dataset<Row> allManifestDF, Set<Long> snapshots) {


Refactored snapshots -> snapshotIds in these files

szehon-ho · 2022-10-11T21:59:06Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

+
+  protected Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshots) {
+    Broadcast<Table> tableBroadcast =
+        sparkContext.broadcast(SerializableTableWithSize.copyOf(table));


szehon-ho · 2022-10-11T21:59:34Z

core/src/main/java/org/apache/iceberg/ReachableFileUtil.java

+   * @return the location of manifest Lists
+   */
+  public static List<String> manifestListLocations(Table table, Set<Long> snapshots) {
+    Stream<Snapshot> snapshotStream = StreamSupport.stream(table.snapshots().spliterator(), false);


Done, used Iterables.filter and restored the old for loop

aokolnychyi · 2022-10-11T22:14:05Z

I'd love to take a quick look today too!

aokolnychyi

The change looks great. I left just a few suggestions.

aokolnychyi · 2022-10-12T00:02:49Z

core/src/main/java/org/apache/iceberg/ReachableFileUtil.java

+   * @return the location of manifest Lists
+   */
+  public static List<String> manifestListLocations(Table table, Set<Long> snapshotIds) {
+    Iterable<Snapshot> tableSnapshots = table.snapshots();


nit: Is there a particular reason for calling it tableSnapshots instead of just snapshots?

aokolnychyi · 2022-10-12T01:19:03Z

core/src/main/java/org/apache/iceberg/ReachableFileUtil.java

+   *
+   * @param table table for which manifestList needs to be fetched
+   * @param snapshotIds ids of snapshots for which manifest lists will be returned
+   * @return the location of manifest Lists


nit: Lists -> lists?

It was a typo from the earlier method, changed on both.

aokolnychyi · 2022-10-12T01:22:48Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

+  }
+
+  protected Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds) {
+    Broadcast<Table> tableBroadcast =


nit: I'd prefer to keep the temp var so that we don't have to split the statement on multiple lines.

Table serializableTable = SerializableTableWithSize.copyOf(table); Broadcast<Table> tableBroadcast = sparkContext.broadcast(serializableTable);

Done, missed this when changing spark 3.3 version.

aokolnychyi · 2022-10-12T02:43:50Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

-    Dataset<ManifestFileBean> allManifests =
-        loadMetadataTable(table, ALL_MANIFESTS)
+    Dataset<Row> allManifests = loadMetadataTable(table, ALL_MANIFESTS);
+    if (snapshotIds != null) {


I feel like we have this piece that loads ALL_MANIFESTS with optional filtering in quite a few places. What about refactoring the entire logic into a separate method?

We already have manifestDS, so we can simply add manifestDF.

private Dataset<Row> manifestDF(Table table, Set<Long> snapshotIds) { Dataset<Row> manifestDF = loadMetadataTable(table, ALL_MANIFESTS); if (snapshotIds != null) { Column filterCond = col(AllManifestsTable.REF_SNAPSHOT_ID.name()).isInCollection(snapshotIds); return manifestDF.filter(filterCond); } else { return manifestDF; } }

Then we don't need filterAllManifests and can simply replace loadMetadataTable calls with manifestDF.

protected Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshotIds) { return manifestDF(table, snapshotIds) .select(col("path"), lit(MANIFEST).as("type")) .as(FileInfo.ENCODER); }

This should allow us to reduce the number of changes to minimum.

Good point, made new method and changed two methods to use it.

aokolnychyi · 2022-10-12T03:14:04Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

-      Dataset<FileInfo> originalFileDS = validFileDS(ops.current());
+
+      // Save old metadata
+      TableMetadata originalTable = ops.current();


nit: What about calling it originalMetadata since it is TableMetadata, not Table?

aokolnychyi · 2022-10-12T03:26:34Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

-      Dataset<FileInfo> validFileDS = validFileDS(ops.refresh());
+      // determine expired files
+      TableMetadata updatedTable = ops.refresh();
+      Set<Long> retainedSnapshots =


What about adding findExpiredSnapshotIds that would accept originalMetadata, updatedMetadata?

// fetch valid files after expiration TableMetadata updatedMetadata = ops.refresh(); Dataset<FileInfo> validFileDS = fileDS(updatedMetadata); // fetch files referenced by expired snapshots Set<Long> expiredSnapshotIds = findExpiredSnapshotIds(originalMetadata, updatedMetadata); Dataset<FileInfo> deleteCandidateFileDS = fileDS(originalMetadata, expiredSnapshotIds); // determine expired files this.expiredFileDS = deleteCandidateFileDS.except(validFileDS);

aokolnychyi · 2022-10-12T03:27:53Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java


-      // fetch metadata after expiration
-      Dataset<FileInfo> validFileDS = validFileDS(ops.refresh());
+      // determine expired files


nit: We have the same comment determine expired files on different blocks now.

Changed with your other code suggestion

aokolnychyi · 2022-10-12T03:34:51Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

-      // fetch metadata before expiration
-      Dataset<FileInfo> originalFileDS = validFileDS(ops.current());
+
+      // Save old metadata


nit: Is the comment change deliberate? The old comment seems OK.

aokolnychyi · 2022-10-12T03:42:27Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java


-  private Dataset<FileInfo> validFileDS(TableMetadata metadata) {
+  private Dataset<FileInfo> fileDS(TableMetadata metadata) {
    Table staticTable = newStaticTable(metadata, table.io());


Can we simply call fileDS(metadata, null)?

Good point, changed

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

aokolnychyi

LGTM. I think we have an extra method left in BaseSparkAction in 3.3.
Otherwise, looks ready to go. Thanks, @szehon-ho!

aokolnychyi · 2022-10-13T15:51:14Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

    }
  }

+  protected Dataset<Row> filterAllManifests(Dataset<Row> allManifestDF, Set<Long> snapshotIds) {


No longer needed?

Good catch, removed

aokolnychyi · 2022-10-13T15:56:00Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

-    Dataset<ManifestFileBean> allManifests =
-        loadMetadataTable(table, ALL_MANIFESTS)
+    Dataset<Row> manifests = manifestDF(table, snapshotIds);
+    Dataset<ManifestFileBean> manifestsBean =


nit: Most other vars in this class use xxxDF for Dataset<Row> or xxxDS for Dataset of a specific type.

Dataset<ManifestFileBean> manifestBeanDS = manifestDF(table, snapshotIds) .select(...)

Changed var name and inlined the previous var as suggested.

szehon-ho · 2022-10-14T00:10:27Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

+      Dataset<FileInfo> validFileDS = fileDS(updatedMetadata);
+
+      // fetch files referenced by expired snapshots
+      Set<Long> removedSnapshotIds = findExpiredSnapshotIds(originalMetadata, updatedMetadata);


Just noticed this variable name doesnt match spark 3.2, will change it. (Initially it was 'expiredSnapshotIds' but it hides the member variable name and so I made last minute change).

aokolnychyi

Looks great. Excited to see how it is going to perform. Thanks, @szehon-ho!

github-actions bot added core spark labels Nov 3, 2021

szehon-ho force-pushed the expire_snapshot_perf_master branch from 415c210 to f52ef9c Compare November 3, 2021 01:59

rdblue reviewed Nov 7, 2021

View reviewed changes

szehon-ho changed the title ~~[WIP] Core: Improve performance of expire snapshot by not double-scanning retained Snapshots~~ Core: Improve performance of expire snapshot by not double-scanning retained Snapshots Nov 12, 2021

szehon-ho force-pushed the expire_snapshot_perf_master branch from e632307 to 5490d42 Compare January 19, 2022 07:08

szehon-ho mentioned this pull request May 5, 2022

Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions #4674

Closed

szehon-ho mentioned this pull request May 5, 2022

Spark: Add positional and equality delete file count to ExpireSnapshot results #4629

Merged

szehon-ho force-pushed the expire_snapshot_perf_master branch 2 times, most recently from c27cac2 to 8232c8b Compare May 10, 2022 01:40

szehon-ho mentioned this pull request May 10, 2022

WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

Closed

szehon-ho force-pushed the expire_snapshot_perf_master branch 4 times, most recently from 8d3c706 to 222423e Compare July 1, 2022 18:23

szehon-ho changed the title ~~Core: Improve performance of expire snapshot by not double-scanning retained Snapshots~~ Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots Jul 1, 2022

szehon-ho force-pushed the expire_snapshot_perf_master branch 2 times, most recently from bf44a0d to 343fa89 Compare July 28, 2022 21:27

aokolnychyi reviewed Jul 29, 2022

View reviewed changes

szehon-ho force-pushed the expire_snapshot_perf_master branch from 51b3802 to 93f637b Compare August 12, 2022 01:12

rdblue reviewed Oct 9, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java Outdated Show resolved Hide resolved

rdblue reviewed Oct 9, 2022

View reviewed changes

rdblue approved these changes Oct 9, 2022

View reviewed changes

Improve performance of expire snapshot by not double-scanning non-exp…

f46a17f

…ired manifests

szehon-ho force-pushed the expire_snapshot_perf_master branch from 93f637b to 1109ae5 Compare October 11, 2022 21:03

Address review comments

aa2630a

szehon-ho force-pushed the expire_snapshot_perf_master branch from 1109ae5 to aa2630a Compare October 11, 2022 21:56

szehon-ho commented Oct 11, 2022

View reviewed changes

aokolnychyi reviewed Oct 12, 2022

View reviewed changes

More review comments

b3d37b4

szehon-ho force-pushed the expire_snapshot_perf_master branch from ae7f255 to b3d37b4 Compare October 13, 2022 00:21

aokolnychyi approved these changes Oct 13, 2022

View reviewed changes

More review comments

f829649

szehon-ho commented Oct 14, 2022

View reviewed changes

Minor change var name to harmonize both spark versions

8d7ea44

aokolnychyi approved these changes Oct 14, 2022

View reviewed changes

aokolnychyi merged commit 312592d into apache:master Oct 14, 2022

ajantha-bhat mentioned this pull request Mar 11, 2023

Spark-3.3: Handle statistics file clean up from expireSnapshots action/procedure #6091

Merged

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023

Spark: Optimize snapshot expiry (apache#3457)

27bdce9

zhongyujiang added a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

[Cherry-pick] Spark: Optimize snapshot expiry (apache#3457)

28514e1

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots #3457

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots #3457

Uh oh!

Conversation

szehon-ho commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Nov 12, 2021

Uh oh!

RussellSpitzer commented Mar 30, 2022

Uh oh!

szehon-ho commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented May 10, 2022

Uh oh!

szehon-ho commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented May 10, 2022

Uh oh!

szehon-ho commented Jul 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Nov 3, 2021 •

edited

Loading

szehon-ho commented Nov 3, 2021 •

edited

Loading

szehon-ho commented Nov 10, 2021 •

edited

Loading

szehon-ho commented May 5, 2022 •

edited

Loading

szehon-ho commented May 5, 2022 •

edited

Loading

szehon-ho commented May 10, 2022 •

edited

Loading

szehon-ho Jul 29, 2022 •

edited

Loading

ajantha-bhat Jul 30, 2022 •

edited

Loading

aokolnychyi Aug 4, 2022 •

edited

Loading

RussellSpitzer Aug 4, 2022 •

edited

Loading