-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: ValidationException on metadata-only delete of certain delete files #4304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…only delete with presence of certain delete files
b401647 to
b213236
Compare
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Outdated
Show resolved
Hide resolved
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java
Show resolved
Hide resolved
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java
Show resolved
Hide resolved
c07770e to
c82bae4
Compare
|
@RussellSpitzer thanks for review, unnested the new case (though it seems still breaks the "cyclomatic complexity" sadly). Also added the new test asserts for metadata delete. |
Yeah to fix that the code needs less "exits" I just thought it reads easier |
|
Thanks @RussellSpitzer for review, might look at reducing the method's cyclomatic complexity in a follow up if its possible |
|
Note: this fix does not fix all the issues without #4370 |
…te files (apache#4304) (cherry picked from commit b64aa0d)
The presence of some V2 delete files lead to following error for a subsequent delete that tries to use metadata.
Specifically, these are delete files that cannot be removed by the file-metrics alone, such as if all values of the column of the delete statement's expression are null (as the test demonstrates).
This because there is a mismatch. The optimizer chooses metadata-only delete if all DataFiles can be deleted using metadata, and does not check if all DeleteFiles can be removed only by metadata: https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L241
But the actual metadata-delete logic in ManifestFIlterManager tries to delete even DeleteFiles, and it throws ValidationException when it finds a DeleteFile that cannot be cleanlly deleted because the metrics-bound do not match or are not available.
This change attempts to reconcile these two, by having the ManifestFilterManager skip DeleteFiles that do not have tight enough metrics to be safely removed. These will get cleaned up eventually by sequenceNumber expiration, or later if we build another action to more aggressively drop dangling DeleteFiles (there are already instances today where they are left behind, beyond this).
The other options I considered:
These may be cleaner, but the first one will be a huge performance hit as it means we skip the metadata-only delete unnecessarily if such a delete file exists.
The second is also too conservative, as 90% of the time the delete files can be removed opportunistically by metadata-only deletes (certain tests actually asserting that they are removed, so it would be an unexpected change).
Hence this more specific check.