Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Oct 7, 2024

This change consumes the updated fanout position delete writers in #11222 to maintain position deletes during writes (minor compaction). The mapping of data files to file scoped deletes is broadcasted to executors, where delete file writers merge the historical position deletes with new position deletes. This behavior is behind a Spark conf maintain-position-deletes and can also be controlled via the write.delete.maintain-during-write table property.
By default this maintenance during write is enabled.

ToDo: UpdateProjectionBenchmark may need to include some changes to use file granularity for MoR cases, so that we can look at the impact of this change before/after. The benchmark should probably be run after #11131 gets in

@github-actions github-actions bot added the spark label Oct 7, 2024
new BasePositionDeltaWriter<>(
newDataWriter(table, writerFactory, dataFileFactory, context),
newDeleteWriter(table, writerFactory, deleteFileFactory, context));
newDeleteWriter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to put this behind a spark conf which defaults to loading the previous deletes and doing the merging in the write. Maybe it's better to handle such a conf when building the mapping (if the conf is false, we could return a empty map or don't perform the broadcast).

That way in case users hit some unforseen issue with merging they can disable the behavior dynamically.

Didn't want to introduce more configuration knobs initially to keep it simple but I can see an argument that it's worth it just to have a lever for unforseen issues with performing minor compactions as part of writes.

CC @aokolnychyi @rdblue @nastra

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tests in TestRewritePositionDeleteFilesProcedure are expectedly failing after this change since those tests have assertions on expected delete files after performing a set of delete operations but now since there's maintenance happening as part of the write, the number of delete files will be expectedly reduced (e.g.)

If we decide to add a conf, that can be set accordingly, otherwise I'll go ahead and update the tests. At the moment, still leaning towards adding a conf for the reasons mentioned earlier but I'll get others input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead with adding the configuration and table property similar to delete file granularity.

@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-sync-maintenance branch 2 times, most recently from 0af0344 to 61c0e1c Compare October 7, 2024 15:21
@github-actions github-actions bot added the core label Oct 7, 2024
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review October 7, 2024 17:40
@amogh-jahagirdar amogh-jahagirdar changed the title Spark: Synchronously merge new position deletes with old deletes Spark: Merge new position deletes with old deletes during writing Oct 7, 2024
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really promising!


public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;

public static final String MAINTAIN_POSITION_DELETES_DURING_WRITE =
Copy link
Contributor

@aokolnychyi aokolnychyi Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need this configuration. It is pretty clear we want to always maintain position deletes. We will also not support this property in V3. Given that we want to switch to file-scoped position deletes in V2 tables by default, I think we should just always maintain deletes if the granularity is file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, switching write.delete.granularity to file should trigger maintenance.
If set to partition, skip it as we can't do it safely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.delete.granularity to file

[doubt] are any other writers accept from spark respecting this property ? Are other writers also gonna respect this property going forward, if yes how ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up to an engine in V2. We discuss making it a requirement for V3 on the dev list right now.


public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;

public static final String MAINTAIN_POSITION_DELETES_DURING_WRITE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.delete.granularity to file

[doubt] are any other writers accept from spark respecting this property ? Are other writers also gonna respect this property going forward, if yes how ?

}
}

protected Map<String, DeleteFileSet> dataToFileScopedDeletes() {
Copy link
Contributor

@singhpk234 singhpk234 Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put an estimate on the size of the HM ? if it goes very high it can fail the query in that case should we let the query fail ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We need to check what's the actual limit on the object size and how Spark would behave.

@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-sync-maintenance branch 5 times, most recently from ac6da06 to fc8b39f Compare October 14, 2024 05:13
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this LGTM, just left a few nits

return location != null ? location.toString() : null;
}

public static boolean isFileScopedDelete(DeleteFile deleteFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be worth updating

private boolean isFileScoped(DeleteFile deleteFile) {
to use this method too

String partitionStmt = "PARTITIONED BY (id)";
sql(
"CREATE TABLE %s (id int, data string) USING iceberg %s TBLPROPERTIES"
+ "('format-version'='2', 'write.delete.mode'='merge-on-read', 'write.delete.granularity'='file')",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in checkDeleteFileGranularity() we're using TableProperties rather than plain strings, so I think it would be good to do the same here too

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems almost ready to me.

Like @singhpk234 mentioned, we need to verify Spark can handle this logic if the map gets large. The table state that we currently send to executors is larger than this map but we need to look into how this particular broadcast is handled. If it is split into chunks by the torrent broadcast, that should probably be OK.

@amogh-jahagirdar, could you check it is going to be safe?

@Override
public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
Table table = tableBroadcast.value();
Map<String, DeleteFileSet> rewritableDeletes = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a helper method and using it directly in statements below?

private Map<String, DeleteFileSet> rewritableDeletes() {
  return rewritableDeletesBroadcast != null ? rewritableDeletesBroadcast.getValue() : null;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, done! Note, since it's possible now that rewritable deletes is null in the writer, I explicitly pass in the previous loader to be a function path -> null. Previously I was passing in an empty map so the lookup would return null, but i think your suggestion is better because we can avoid having to do a lookup in the map entirely

@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-sync-maintenance branch 3 times, most recently from dd1aec3 to 7744ae5 Compare November 4, 2024 17:12
sql("SELECT * FROM %s ORDER BY dep ASC, id ASC", selectTarget()));
}

private void initTable(String partitionedBy, DeleteGranularity deleteGranularity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we refactor common logic but this makes tests harder to read. For instance, the reader has no indication we add 4 batches in the init method.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left some minor comments/suggestions. We will revisit and benchmark this prior to 1.8.

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, Thanks @amogh-jahagirdar !

I am assuming since we went ahead with broadcasting approach, it sends it chunk by chunk using torrent broadcast as @aokolnychyi mentioned, so OOM not a problem ? Was mostly coming from when doing BHJ spark enforces 8GB limit and if anything more than that is observed spark fails the query.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Nov 5, 2024

I am assuming since we went ahead with broadcasting approach, it sends it chunk by chunk using torrent broadcast as @aokolnychyi mentioned, so OOM not a problem ?

I collected some data points on memory consumption of the broadcast here https://docs.google.com/document/d/1yUObq45kBIwyofJYhurcQrpsdXQJC6NFIPBJugiZNWI/edit?tab=t.0

Torrent broadcast is performed in a chunked way but I wouldn't say that doesn't mean OOMs aren't possible. The TLDR is that we would have to be at pretty large scale (multiple millions of data files) and have a very large multiple of deletes per data file for OOMs to be hit in most environments, and running position delete maintenance to shrink that ratio + increasing memory should be a practical enough solution. As maintenance of position deletes runs, that ratio becomes more 1:1 between data to delete files. In V3, this will actually be a requirement. I did a look at more distributed approaches to compute this (changing the Spark APIs to pass historical deletes just for a particular executor) but there are limitations on that. One thing I'm looking into further is how the Spark Delta DV integration looks like to handle this, and we can perhaps take some inspiration from that, but don't think there's really any need to wait for all of that.

There are relatively simple things we can do to limit the size of the global map, one is removing any unnecessary metadata that executors don't need, for example referenced manifest locations per delete file (that's only needed in the driver for more efficient commits), and also relativizing the paths in memory. That should shrink the total amount of memory that gets used by the paths in all the in-memory structure and has more of an impact the longer file path before the actual data file/delete file name.

Edit: One other aspect I looked into is using Spark's BytesToBytesMap which is an offheap map, this requires a bit more work but is another possible path to shrinking memory usage. Java strings are UTF-16 but in Spark, it's UTF-8 so we could theoretically cut memory usage in half for the equivalent data + delete file combination.

My plan is to work towards having the "simple things we can do" in the 1.8 release, so that we further reduce the chance of OOMs in large scale cases. The long term plan is to look at how Spark + delta DV handles this and if it makes sense for us, incorporate that strategy here.

Was mostly coming from when doing BHJ spark enforces 8GB limit and if anything more than that is observed spark fails the query.

That 8gb limit is specific to broadcast joins, there's no such limit enforced by Spark itself for arbitrary broadcasts (of course there are system limitations that would get hit).

@amogh-jahagirdar
Copy link
Contributor Author

I'll go ahead and merge, thanks for reviewing @singhpk234 @aokolnychyi . As discussed above, I will work towards shrinking the memory consumption of the broadcast until the 1.8 release!

@amogh-jahagirdar amogh-jahagirdar merged commit ad24d4b into apache:main Nov 5, 2024
49 checks passed
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 15, 2025
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants