Spark: Merge new position deletes with old deletes during writing #11273

amogh-jahagirdar · 2024-10-07T00:24:04Z

This change consumes the updated fanout position delete writers in #11222 to maintain position deletes during writes (minor compaction). The mapping of data files to file scoped deletes is broadcasted to executors, where delete file writers merge the historical position deletes with new position deletes. This behavior is behind a Spark conf maintain-position-deletes and can also be controlled via the write.delete.maintain-during-write table property.
By default this maintenance during write is enabled.

ToDo: UpdateProjectionBenchmark may need to include some changes to use file granularity for MoR cases, so that we can look at the impact of this change before/after. The benchmark should probably be run after #11131 gets in

amogh-jahagirdar · 2024-10-07T00:32:10Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

          new BasePositionDeltaWriter<>(
              newDataWriter(table, writerFactory, dataFileFactory, context),
-              newDeleteWriter(table, writerFactory, deleteFileFactory, context));
+              newDeleteWriter(


We may want to put this behind a spark conf which defaults to loading the previous deletes and doing the merging in the write. Maybe it's better to handle such a conf when building the mapping (if the conf is false, we could return a empty map or don't perform the broadcast).

That way in case users hit some unforseen issue with merging they can disable the behavior dynamically.

Didn't want to introduce more configuration knobs initially to keep it simple but I can see an argument that it's worth it just to have a lever for unforseen issues with performing minor compactions as part of writes.

CC @aokolnychyi @rdblue @nastra

Some of the tests in TestRewritePositionDeleteFilesProcedure are expectedly failing after this change since those tests have assertions on expected delete files after performing a set of delete operations but now since there's maintenance happening as part of the write, the number of delete files will be expectedly reduced (e.g.)

If we decide to add a conf, that can be set accordingly, otherwise I'll go ahead and update the tests. At the moment, still leaning towards adding a conf for the reasons mentioned earlier but I'll get others input.

I went ahead with adding the configuration and table property similar to delete file granularity.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

aokolnychyi

Looks really promising!

aokolnychyi · 2024-10-08T20:40:46Z

core/src/main/java/org/apache/iceberg/TableProperties.java


  public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;
+
+  public static final String MAINTAIN_POSITION_DELETES_DURING_WRITE =


I am not sure we need this configuration. It is pretty clear we want to always maintain position deletes. We will also not support this property in V3. Given that we want to switch to file-scoped position deletes in V2 tables by default, I think we should just always maintain deletes if the granularity is file.

In other words, switching write.delete.granularity to file should trigger maintenance.
If set to partition, skip it as we can't do it safely.

write.delete.granularity to file

[doubt] are any other writers accept from spark respecting this property ? Are other writers also gonna respect this property going forward, if yes how ?

This is up to an engine in V2. We discuss making it a requirement for V3 on the dev list right now.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

singhpk234 · 2024-10-09T22:33:20Z

core/src/main/java/org/apache/iceberg/TableProperties.java


  public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;
+
+  public static final String MAINTAIN_POSITION_DELETES_DURING_WRITE =


write.delete.granularity to file

[doubt] are any other writers accept from spark respecting this property ? Are other writers also gonna respect this property going forward, if yes how ?

singhpk234 · 2024-10-09T23:09:42Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

    }
  }

+  protected Map<String, DeleteFileSet> dataToFileScopedDeletes() {


can we put an estimate on the size of the HM ? if it goes very high it can fail the query in that case should we let the query fail ?

This is a good point. We need to check what's the actual limit on the object size and how Spark would behave.

nastra

overall this LGTM, just left a few nits

nastra · 2024-10-14T11:21:18Z

core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

    return location != null ? location.toString() : null;
  }
+
+  public static boolean isFileScopedDelete(DeleteFile deleteFile) {


nit: might be worth updating

iceberg/core/src/main/java/org/apache/iceberg/deletes/SortingPositionOnlyDeleteWriter.java

Line 174 in c8fe01e

private boolean isFileScoped(DeleteFile deleteFile) {

to use this method too

nastra · 2024-10-14T11:24:17Z

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadDelete.java

+    String partitionStmt = "PARTITIONED BY (id)";
+    sql(
+        "CREATE TABLE %s (id int, data string) USING iceberg %s TBLPROPERTIES"
+            + "('format-version'='2', 'write.delete.mode'='merge-on-read', 'write.delete.granularity'='file')",


nit: in checkDeleteFileGranularity() we're using TableProperties rather than plain strings, so I think it would be good to do the same here too

aokolnychyi

This seems almost ready to me.

Like @singhpk234 mentioned, we need to verify Spark can handle this logic if the map gets large. The table state that we currently send to executors is larger than this map but we need to look into how this particular broadcast is handled. If it is split into chunks by the torrent broadcast, that should probably be OK.

@amogh-jahagirdar, could you check it is going to be safe?

core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

aokolnychyi · 2024-10-14T17:32:28Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

    @Override
    public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
      Table table = tableBroadcast.value();
+      Map<String, DeleteFileSet> rewritableDeletes = Maps.newHashMap();


What about a helper method and using it directly in statements below?

private Map<String, DeleteFileSet> rewritableDeletes() { return rewritableDeletesBroadcast != null ? rewritableDeletesBroadcast.getValue() : null; }

Good suggestion, done! Note, since it's possible now that rewritable deletes is null in the writer, I explicitly pass in the previous loader to be a function path -> null. Previously I was passing in an empty map so the lookup would return null, but i think your suggestion is better because we can avoid having to do a lookup in the map entirely

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadDelete.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

aokolnychyi · 2024-11-04T21:27:54Z

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadUpdate.java

        sql("SELECT * FROM %s ORDER BY dep ASC, id ASC", selectTarget()));
  }
+
+  private void initTable(String partitionedBy, DeleteGranularity deleteGranularity) {


I understand we refactor common logic but this makes tests harder to read. For instance, the reader has no indication we add 4 batches in the init method.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

aokolnychyi

LGTM. I left some minor comments/suggestions. We will revisit and benchmark this prior to 1.8.

singhpk234

LGTM as well, Thanks @amogh-jahagirdar !

I am assuming since we went ahead with broadcasting approach, it sends it chunk by chunk using torrent broadcast as @aokolnychyi mentioned, so OOM not a problem ? Was mostly coming from when doing BHJ spark enforces 8GB limit and if anything more than that is observed spark fails the query.

amogh-jahagirdar · 2024-11-05T17:21:47Z

I am assuming since we went ahead with broadcasting approach, it sends it chunk by chunk using torrent broadcast as @aokolnychyi mentioned, so OOM not a problem ?

I collected some data points on memory consumption of the broadcast here https://docs.google.com/document/d/1yUObq45kBIwyofJYhurcQrpsdXQJC6NFIPBJugiZNWI/edit?tab=t.0

Torrent broadcast is performed in a chunked way but I wouldn't say that doesn't mean OOMs aren't possible. The TLDR is that we would have to be at pretty large scale (multiple millions of data files) and have a very large multiple of deletes per data file for OOMs to be hit in most environments, and running position delete maintenance to shrink that ratio + increasing memory should be a practical enough solution. As maintenance of position deletes runs, that ratio becomes more 1:1 between data to delete files. In V3, this will actually be a requirement. I did a look at more distributed approaches to compute this (changing the Spark APIs to pass historical deletes just for a particular executor) but there are limitations on that. One thing I'm looking into further is how the Spark Delta DV integration looks like to handle this, and we can perhaps take some inspiration from that, but don't think there's really any need to wait for all of that.

There are relatively simple things we can do to limit the size of the global map, one is removing any unnecessary metadata that executors don't need, for example referenced manifest locations per delete file (that's only needed in the driver for more efficient commits), and also relativizing the paths in memory. That should shrink the total amount of memory that gets used by the paths in all the in-memory structure and has more of an impact the longer file path before the actual data file/delete file name.

Edit: One other aspect I looked into is using Spark's BytesToBytesMap which is an offheap map, this requires a bit more work but is another possible path to shrinking memory usage. Java strings are UTF-16 but in Spark, it's UTF-8 so we could theoretically cut memory usage in half for the equivalent data + delete file combination.

My plan is to work towards having the "simple things we can do" in the 1.8 release, so that we further reduce the chance of OOMs in large scale cases. The long term plan is to look at how Spark + delta DV handles this and if it makes sense for us, incorporate that strategy here.

Was mostly coming from when doing BHJ spark enforces 8GB limit and if anything more than that is observed spark fails the query.

That 8gb limit is specific to broadcast joins, there's no such limit enforced by Spark itself for arbitrary broadcasts (of course there are system limitations that would get hit).

amogh-jahagirdar · 2024-11-05T18:14:38Z

I'll go ahead and merge, thanks for reviewing @singhpk234 @aokolnychyi . As discussed above, I will work towards shrinking the memory consumption of the broadcast until the 1.8 release!

…che#11273)

…11273) to 3.4

… to 3.4 (#11975)

github-actions bot added the spark label Oct 7, 2024

amogh-jahagirdar commented Oct 7, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Oct 7, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Show resolved Hide resolved

amogh-jahagirdar force-pushed the spark-sync-maintenance branch 2 times, most recently from 0af0344 to 61c0e1c Compare October 7, 2024 15:21

github-actions bot added the core label Oct 7, 2024

amogh-jahagirdar force-pushed the spark-sync-maintenance branch from 61c0e1c to db92efa Compare October 7, 2024 17:31

amogh-jahagirdar requested a review from aokolnychyi October 7, 2024 17:33

amogh-jahagirdar force-pushed the spark-sync-maintenance branch from db92efa to e36ee93 Compare October 7, 2024 17:36

amogh-jahagirdar marked this pull request as ready for review October 7, 2024 17:40

amogh-jahagirdar changed the title ~~Spark: Synchronously merge new position deletes with old deletes~~ Spark: Merge new position deletes with old deletes during writing Oct 7, 2024

aokolnychyi reviewed Oct 8, 2024

View reviewed changes

singhpk234 reviewed Oct 9, 2024

View reviewed changes

amogh-jahagirdar force-pushed the spark-sync-maintenance branch 5 times, most recently from ac6da06 to fc8b39f Compare October 14, 2024 05:13

nastra approved these changes Oct 14, 2024

View reviewed changes

aokolnychyi reviewed Oct 14, 2024

View reviewed changes

amogh-jahagirdar force-pushed the spark-sync-maintenance branch 3 times, most recently from dd1aec3 to 7744ae5 Compare November 4, 2024 17:12

amogh-jahagirdar requested a review from aokolnychyi November 4, 2024 18:50

aokolnychyi reviewed Nov 4, 2024

View reviewed changes

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadDelete.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Nov 4, 2024

View reviewed changes

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadDelete.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Nov 4, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Show resolved Hide resolved

aokolnychyi reviewed Nov 4, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Show resolved Hide resolved

aokolnychyi reviewed Nov 4, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Nov 4, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Outdated Show resolved Hide resolved

aokolnychyi approved these changes Nov 4, 2024

View reviewed changes

amogh-jahagirdar force-pushed the spark-sync-maintenance branch from 7744ae5 to 037bfcd Compare November 5, 2024 03:24

singhpk234 approved these changes Nov 5, 2024

View reviewed changes

amogh-jahagirdar force-pushed the spark-sync-maintenance branch from 037bfcd to 55e4674 Compare November 5, 2024 17:03

Spark: Synchronously merge new position deletes with old deletes

d106545

amogh-jahagirdar force-pushed the spark-sync-maintenance branch from 55e4674 to d106545 Compare November 5, 2024 17:04

amogh-jahagirdar merged commit ad24d4b into apache:main Nov 5, 2024
49 checks passed

amogh-jahagirdar mentioned this pull request Nov 11, 2024

Spark: Change Delete granularity to file for Spark 3.5 #11478

Merged

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spark: Synchronously merge new position deletes with old deletes (apa…

877cd69

…che#11273)

amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 15, 2025

Spark 3.4: Backport rewriting historical file-scoped deletes (apache#…

14d7ddb

…11273) to 3.4

amogh-jahagirdar mentioned this pull request Jan 15, 2025

Spark 3.4: Backport rewriting historical file-scoped deletes (#11273) to 3.4 #11975

Merged

amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 15, 2025

Spark 3.4: Backport rewriting historical file-scoped deletes (apache#…

d6ec043

…11273) to 3.4

amogh-jahagirdar added a commit that referenced this pull request Jan 15, 2025

Spark 3.4: Backport rewriting historical file-scoped deletes (#11273)…

d96901b

… to 3.4 (#11975)


		public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;

		public static final String MAINTAIN_POSITION_DELETES_DURING_WRITE =

Spark: Merge new position deletes with old deletes during writing #11273

Spark: Merge new position deletes with old deletes during writing #11273

Uh oh!

Conversation

amogh-jahagirdar commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Oct 7, 2024 •

edited

Loading

amogh-jahagirdar Oct 7, 2024 •

edited

Loading

aokolnychyi Oct 8, 2024 •

edited

Loading

singhpk234 Oct 9, 2024 •

edited

Loading

singhpk234 left a comment •

edited

Loading

amogh-jahagirdar commented Nov 5, 2024 •

edited

Loading