-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten #11131
base: main
Are you sure you want to change the base?
Conversation
Publishing a draft so I can test against entire CI. I need to think more about a good way to benchmark this and if there's even more reasonable optimizations that I can do here Well another possible optimization to think through: If we have both the manifestLocation and the ordinal position of the content file in the manifest AND there are no delete expressions or deletions by pure paths, we can possibly just write the new manifest with entries that are not part of the referenced positions without having to evaluate file paths or predicates against manifest entries. We could just evaluate against the positions (every entry would be compared against the "deleted" pos set) as opposed to file paths/partition values, which should be a bit more performant. |
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Outdated
Show resolved
Hide resolved
ad1dd92
to
b230f1e
Compare
@@ -421,7 +433,7 @@ private ManifestFile filterManifestWithDeletedFiles( | |||
entry -> { | |||
F file = entry.file(); | |||
boolean markedForDelete = | |||
deletePaths.contains(file.path()) | |||
fileIsDeleted(file, manifest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to better names here, I was more so creating a helper method to group the position based and the path based check, since there's a lot of conditions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just noticed for the pos based comparison we may not be really optimizing much since for the duplicate file detection case we do another string comparison on line 457. So we really only remove one path string comparison instead of all of them in the current implementation.
47cda8a
to
4099608
Compare
if (file.manifestLocation() != null) { | ||
deletedManifestPositions | ||
.computeIfAbsent(file.manifestLocation(), key -> Sets.newHashSet()) | ||
.add(file.pos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The case where file.pos() is defined is the same as where file.manifestLocation() is defined; as long as the file is read from a manifest both of these will be set, so I don't think we need an additional null check for pos
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could optimize memory consumption by not populating deletePaths
if the manifest location is defined. Basically put line 162 to the rest of the method behind an else
block?
4099608
to
bfed848
Compare
boolean hasDeletedFiles = manifestDeletedPositions.containsKey(manifest.path()); | ||
if (hasDeletedFiles) { | ||
return filterManifestWithDeletedFiles(evaluator, manifest, reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The !canContainDeletedFiles(manifest) check on 310 already optimizes the case where there is nothing to rewrite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the check in canContainDeletedFiles
a bit more efficient by checking if manifestDeletedPositions is not empty before doing any of the more expensive checks.
This change optimizes merging snapshot producer logic to determine if a file has been deleted in a manifest in the case
deleteFile(ContentFile file)
is called; optimizing this logic means that determining if a manifest needs to be rewritten is now optimized.Previously determining a manifest file needs to be rewritten has 2 high level steps:
The optimization in this PR will optimize step 1 in the case
deleteFile(ContentFile file)
is called by keeping track of the data/delete file's manifests and the position in the manifest that is being deleted. Using this the logic for doing the first pass over the manifest is no longer necessary since the presence of a referenced manifest means that manifest must be rewritten.Another (possible, still benchmarking) optimization is since we're tracking positions for the referenced manifests, when we go to write the new manifest we don't need to compare based on file path to determine if we should write out the manifest entry or not. We can compare if the file position is deleted or not which should be lighter weight set check compared to the string comparison which has to compare character by character in the worst case.
ToDo Benchmark results here: