Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Mar 8, 2022

#2925 adds isolation level support to the core Iceberg API ReplacePartitions, and also exposes it via Spark DataFrame.overwritePartitions() API.

This change is to extend isolation level support to the Spark DataFrame.overwrite(filter) API for symmetry. The underlying core Iceberg API (OverwriteFiles) already supported isolation level validation in this case, so the change is smaller.

One observation, DF.overwrite(filter) will be less aggressive than DF.overwritePartitions() in concurrent validation due to the two different API code paths. OverwriteFiles checks exactly for the file that will be re-written, so it will not throw an exception if another file was deleted in the same partition. This is unlike ReplacePartitions API which throws exception if any file was deleted in the same partition, as it does not keep track of files but rather whole partitions.

@github-actions github-actions bot added the spark label Mar 8, 2022
@szehon-ho szehon-ho requested a review from aokolnychyi March 8, 2022 19:46
@aokolnychyi
Copy link
Contributor

Let me take a look. Sorry for the delay.

@aokolnychyi
Copy link
Contributor

One observation, DF.overwrite(filter) will be less aggressive than DF.overwritePartitions() in concurrent validation due to the two different API code paths. OverwriteFiles checks exactly for the file that will be re-written, so it will not throw an exception if another file was deleted in the same partition. This is unlike ReplacePartitions API which throws exception if any file was deleted in the same partition, as it does not keep track of files but rather whole partitions.

That's a bug. Let me fix it.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks correct to me. I had a few nits.

I'll fix OverwriteFiles and it would be great to submit changes to the dynamic overwrites in a separate PR to limit this one to static overwrites.

@szehon-ho szehon-ho force-pushed the overwrite_filter_isolation branch from d4e6bae to 4f793e7 Compare March 10, 2022 01:43
@szehon-ho
Copy link
Member Author

@aokolnychyi thanks, will create another pr soon for the refactoring/cleanup of existing code to be consistent with this one.

Added another test for deleted data files after fix of #3581 , works now.

@szehon-ho
Copy link
Member Author

Restarting as it looks like a temporary build artifact download failure

@aokolnychyi aokolnychyi merged commit 5b9cd7a into apache:master Mar 17, 2022
@aokolnychyi
Copy link
Contributor

Thanks, @szehon-ho! Sorry it took so long to get back to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants