Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions #4674

ajantha-bhat · 2022-04-30T12:05:18Z

ALL_MANIFESTS computation is a heavy operation when lot of snapshots exist as it involves reading of the the manifest list file for every snapshot.
We are currently having duplicate computation of the same in many places in spark actions.

This PR aims to improve the performance of spark actions and stored procedures by computing once and reusing it in other locations using dataset.persist().

ajantha-bhat · 2022-04-30T13:02:30Z

cc: @RussellSpitzer , @rdblue

RussellSpitzer · 2022-04-30T15:12:37Z

...park/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteReachableFilesSparkAction.java

-        .union(withFileType(buildManifestFileDF(staticTable), MANIFEST))
+    Dataset<Row> allManifests = loadMetadataTable(staticTable, ALL_MANIFESTS);
+    return withFileType(buildValidContentFileDF(staticTable, allManifests), CONTENT_FILE)
+        .union(withFileType(buildManifestFileDF(allManifests), MANIFEST))


How does this change the performance? Don't we have to compute all manifests in both locations here? Or Does changing to an object let Spark cache the relation?

loadMetadataTable was called twice, now it is only once for ALL_MANIFESTS.

Or Does changing to an object let Spark cache the relation?

I thought compute of dataset will happen in the first action and the results are reused for both the steps.

@RussellSpitzer : I got what you meant now.

I will fix by adding persist(), it will cache and reuse it. Finally it will avoid two time reading the manifest list.

fixed and verified the scanning by adding and checking logs.

Ah yes I should have been more clear, I was referring to the fact that the dataset would be recomputed on both lines. The "loadMetadataTable" function should be very fast but the actual planning should be expensive and that would require a cache of some kind.

I'm a little worried in general about persisting things since I want to make sure we clean up our caches asap.

One other thing to worry about here is what is the additional cost of the persist operation here. Running a persist is not free and we need to check for sure that doing this cache is cost effective. In my experience 3 uses of a persisted df is most of the time worth it, but 2 sometimes is not (Very much dependent on the computation leading up to the cached df)

Very much dependent on the computation leading up to the cached df

Because it involves the IO operation, it will definitely help when there are hundreds or thousands of snapshots.

@RussellSpitzer , which is better? a cache or manually calling dataset.collect() on that allManifest df and building new dataset on top of those collected rows and reusing it on both the location?

@ajantha-bhat The problem is that persisting does another IO serialization of what should be basically the same amount of information but hopefully in a more readable form. Persist by default is a Memory and Disk based cache. You really need to test it out to be sure.

For Cache vs Collect. Cache is probably much better, pieces would be stored on executors and the IO would hopefully be mostly local, doing a Collect and and building a DF from it would essentially bring all data back to the driver serialize, then deserialize and send everything back out. This is always worse than cache/persist

RussellSpitzer · 2022-05-02T15:54:44Z

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

    Column joinCond = nameEqual.and(actualContains);
    List<String> orphanFiles = actualFileDF.join(validFileDF, joinCond, "leftanti")
        .as(Encoders.STRING())
+        .unpersist()


I do not think this will trigger the unpersist on "allManifests". It would only trigger it on the join result, this shouldn't cascade.

yeah, I figured that out, but didn't figured out the final solution. So, changed to draft :)

But as you said, I need to call allManifests table again I guess.

RussellSpitzer · 2022-05-02T15:57:20Z

...3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java


      // determine expired files
-      this.expiredFiles = originalFiles.except(validFiles);
+      this.expiredFiles = originalFiles.except(validFiles).unpersist();


Again I believe if you want to get rid of the cache I do think you need to call that the allManifests table

szehon-ho · 2022-05-05T00:06:50Z

@RussellSpitzer pointed me to this, I had a pr is orthogonal to this, to avoid duplicate computation of all_reachable_files here #3457 To me that was the bigger time consumer (exploring all reachable files), though maybe I need to re-do that pr. Wasn't sure how much bottleneck getting all_manifests was.

Anyway, agree with @RussellSpitzer that maybe cache is a better option than persist? It'd be great to see some numbers for tables with huge snapshots for these two options vs today, if possible. I think if , if we go with this approach, it should probably be 1) configurable , 2) able to be GC'ed sooner than later.

ajantha-bhat · 2022-05-05T02:39:56Z

@szehon-ho :

@RussellSpitzer pointed me to this, I had a pr is orthogonal to this, to avoid duplicate computation of all_reachable_files here #3457 To me that was the bigger time consumer (exploring all reachable files), though maybe I need to re-do that pr. Wasn't sure how much bottleneck getting all_manifests was.

yeah, scanning the all_manifest table twice was the major problem for me.

Anyway, agree with @RussellSpitzer that maybe cache is a better option than persist? It'd be great to see some numbers for tables with huge snapshots for these two options vs today, if possible. I think if , if we go with this approach, it should probably be 1) configurable , 2) able to be GC'ed sooner than later.

Sure, I will make it configurable option to cache or not and get the performance report locally with large number of snapshots. I will work on this over this weekend.

szehon-ho · 2022-05-05T21:59:31Z

Also FYI @aokolnychyi , if you wanted to take a look as well

…e for spark actions

ajantha-bhat · 2022-05-08T12:06:52Z

@RussellSpitzer, @szehon-ho, @aokolnychyi: PR is ready for review.

a) I tested with local file system and local test case in IDE, without cache (22ms), with cache (18ms) with 1000 manifest list files to read.
Not so much difference, probably because it is local file system. As PR avoids reading manifest list files again (verified with logs), in S3 file system probably better difference can be seen with huge number of snapshots.

@Test
  public void testExpireSnapshotUsingNamedArgs() {
    sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", tableName);

    for (int i = 0; i < 1000; i++) {
      sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
    }

    Table table = validationCatalog.loadTable(tableIdent);

    Assert.assertEquals("Should be 1000 snapshots", 1000, Iterables.size(table.snapshots()));

    waitUntilAfter(table.currentSnapshot().timestampMillis());

    Timestamp currentTimestamp = Timestamp.from(Instant.ofEpochMilli(System.currentTimeMillis()));

    long time = System.currentTimeMillis();

    List<Object[]> output = sql(
        "CALL %s.system.expire_snapshots(" +
            "older_than => TIMESTAMP '%s'," +
            "table => '%s'," +
            "retain_last => 999, use_caching => false)",
        catalogName, currentTimestamp, tableIdent);

    System.out.println("time taken in ms: " + (System.currentTimeMillis() - time));
  }

  time taken in ms: 18478  -- with cache
  time taken in ms: 22589  -- without cache

b) Also in the base code I found we use caching for RewriteManifestsProcedure, so I used the same pattern and syntax (keeping cache as default).
Also found a bug in the base code and raised PR #4722

c) I have some more plans for improving the expire_snaphots (different idea than #3457), I will work in the subsequent PR.

Idea is that add a new column "from_snapshot_id" while preparing the actual files, then filter out (NOT IN filter) the expired snapshot ids rows from persisted output (without scanning again) and then same logic of df.except() to find the expired files.

ajantha-bhat · 2022-05-08T12:08:38Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java

        action.retainLast(retainLastNum);
      }

-      if (maxConcurrentDeletes != null && maxConcurrentDeletes > 0) {


maxConcurrentDeletes > 0 is already checked in the preconditions above.

It's best to leave changes like this for a separate cleanup pr

szehon-ho · 2022-05-10T02:40:23Z

c) I have some more plans for improving the expire_snaphots (different idea than #3457), I will work in the subsequent PR.

Idea is that add a new column "from_snapshot_id" while preparing the actual files, then filter out (NOT IN filter) the expired snapshot ids rows from persisted output (without scanning again) and then same logic of df.except() to find the expired files.

The problem I think is that there's not many Iceberg utilities to project anything other than partition filter to do the filtering . I spent some time to look again, and tried to use time-travel which is effectively snapshot-filtering in #4736 but unfortunately it didn't work as manifest table does not support it. You can take a look if that also makes sense to pursue that path (to implement time-travel on manifest table).

Anyway look forward to working together on this.

ajantha-bhat · 2022-05-11T15:42:37Z

@RussellSpitzer , @szehon-ho : What you guys think about this PR?

RussellSpitzer · 2022-05-11T15:45:00Z

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

        .as(Encoders.STRING())
        .collectAsList();
-
+    if (useCaching) {


this needs to be in a finally block, otherwise if we have an error in the collect as list there is a possibility that we do not uncache. For services running these actions that would be a problem.

RussellSpitzer · 2022-05-11T15:55:26Z

@RussellSpitzer, @szehon-ho, @aokolnychyi: PR is ready for review.

a) I tested with local file system and local test case in IDE, without cache (22ms), with cache (18ms) with 1000 manifest list files to read. Not so much difference, probably because it is local file system. As PR avoids reading manifest list files again (verified with logs), in S3 file system probably better difference can be seen with huge number of snapshots.
@Test
  public void testExpireSnapshotUsingNamedArgs() {
    sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", tableName);

    for (int i = 0; i < 1000; i++) {
      sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
    }

    Table table = validationCatalog.loadTable(tableIdent);

    Assert.assertEquals("Should be 1000 snapshots", 1000, Iterables.size(table.snapshots()));

    waitUntilAfter(table.currentSnapshot().timestampMillis());

    Timestamp currentTimestamp = Timestamp.from(Instant.ofEpochMilli(System.currentTimeMillis()));

    long time = System.currentTimeMillis();

    List<Object[]> output = sql(
        "CALL %s.system.expire_snapshots(" +
            "older_than => TIMESTAMP '%s'," +
            "table => '%s'," +
            "retain_last => 999, use_caching => false)",
        catalogName, currentTimestamp, tableIdent);

    System.out.println("time taken in ms: " + (System.currentTimeMillis() - time));
  }

  time taken in ms: 18478  -- with cache
  time taken in ms: 22589  -- without cache
b) Also in the base code I found we use caching for RewriteManifestsProcedure, so I used the same pattern and syntax (keeping cache as default). Also found a bug in the base code and raised PR #4722

c) I have some more plans for improving the expire_snaphots (different idea than #3457), I will work in the subsequent PR.

Idea is that add a new column "from_snapshot_id" while preparing the actual files, then filter out (NOT IN filter) the expired snapshot ids rows from persisted output (without scanning again) and then same logic of df.except() to find the expired files.

The benchmarks really need to be on datasets that take in the "minutes" to run if we really want to see the difference. The key would be to setup a test with thousands or tens of thousands of manifests.

RussellSpitzer · 2022-05-11T15:55:48Z

...park/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteReachableFilesSparkAction.java

+      result = deleteFiles(reachableFileDF.collectAsList().iterator());
+    }
+    if (useCaching) {
+      allManifests.unpersist();


Needs to be in a finally block

Yea if we go for this approach, can we do something like in https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java#L237. (withReusableDS) ?

RussellSpitzer · 2022-05-11T15:58:22Z

...3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java

+      allManifestsBefore = loadMetadataTable(staticTableBefore, ALL_MANIFESTS);
+      useCaching = PropertyUtil.propertyAsBoolean(options(), USE_CACHING, USE_CACHING_DEFAULT);
+      if (useCaching) {
+        allManifestsBefore.persist();


I think we would want to do the "memory only" cache here (and in all the other usages) ... but I'm not sure

ajantha-bhat · 2023-02-24T05:48:14Z

closing as this is stale and we have improved the performance by other PRs

github-actions bot added the spark label Apr 30, 2022

RussellSpitzer reviewed Apr 30, 2022

View reviewed changes

ajantha-bhat marked this pull request as draft May 2, 2022 05:15

RussellSpitzer reviewed May 2, 2022

View reviewed changes

szehon-ho mentioned this pull request May 5, 2022

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots #3457

Merged

ajantha-bhat force-pushed the expire branch from d3b4429 to 79161b3 Compare May 7, 2022 17:38

Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata tabl…

9de125f

…e for spark actions

ajantha-bhat force-pushed the expire branch from 79161b3 to 9de125f Compare May 7, 2022 17:39

configure

65eb7ef

ajantha-bhat force-pushed the expire branch from 6e630e8 to 65eb7ef Compare May 8, 2022 09:56

github-actions bot added the docs label May 8, 2022

ajantha-bhat marked this pull request as ready for review May 8, 2022 12:06

ajantha-bhat commented May 8, 2022

View reviewed changes

RussellSpitzer reviewed May 11, 2022

View reviewed changes

ajantha-bhat closed this Feb 24, 2023

Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions #4674

Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions #4674

Uh oh!

Conversation

ajantha-bhat commented Apr 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat commented Apr 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat May 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat commented May 5, 2022

Uh oh!

szehon-ho commented May 5, 2022

Uh oh!

ajantha-bhat commented May 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat commented May 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented May 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented Feb 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajantha-bhat commented Apr 30, 2022 •

edited

Loading

ajantha-bhat May 1, 2022 •

edited

Loading

RussellSpitzer May 2, 2022 •

edited

Loading

RussellSpitzer May 2, 2022 •

edited

Loading

szehon-ho commented May 5, 2022 •

edited

Loading

szehon-ho commented May 10, 2022 •

edited

Loading