Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten #11131

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Sep 13, 2024

This change optimizes merging snapshot producer logic to determine if a file has been deleted in a manifest in the case deleteFile(ContentFile file) is called; optimizing this logic means that determining if a manifest needs to be rewritten is now optimized.

Previously determining a manifest file needs to be rewritten has 2 high level steps:

  1. Open up the manifest, iterate through entries until an entry matches a criteria for deletion (either by partition expression, or a path based deletion). Stop iterating through the manifest if one of these criteria is hit.
  2. If 1 ends up yielding manifests that need to be rewritten, a new manifest is rewritten with the same contents as the old manifests minus any deleted files; for delete manifests if there are delete files older than the min sequence number those are also dropped as part of the writing.

The optimization in this PR will optimize step 1 in the case deleteFile(ContentFile file) is called by keeping track of the data/delete file's manifests and the position in the manifest that is being deleted. Using this the logic for doing the first pass over the manifest is no longer necessary since the presence of a referenced manifest means that manifest must be rewritten.

Another (possible, still benchmarking) optimization is since we're tracking positions for the referenced manifests, when we go to write the new manifest we don't need to compare based on file path to determine if we should write out the manifest entry or not. We can compare if the file position is deleted or not which should be lighter weight set check compared to the string comparison which has to compare character by character in the worst case.

ToDo Benchmark results here:

@github-actions github-actions bot added the core label Sep 13, 2024
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 13, 2024

Publishing a draft so I can test against entire CI. I need to think more about a good way to benchmark this and if there's even more reasonable optimizations that I can do here

Well another possible optimization to think through:

If we have both the manifestLocation and the ordinal position of the content file in the manifest AND there are no delete expressions or deletions by pure paths, we can possibly just write the new manifest with entries that are not part of the referenced positions without having to evaluate file paths or predicates against manifest entries.

We could just evaluate against the positions (every entry would be compared against the "deleted" pos set) as opposed to file paths/partition values, which should be a bit more performant.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 7 times, most recently from ad1dd92 to b230f1e Compare September 16, 2024 14:54
@@ -421,7 +433,7 @@ private ManifestFile filterManifestWithDeletedFiles(
entry -> {
F file = entry.file();
boolean markedForDelete =
deletePaths.contains(file.path())
fileIsDeleted(file, manifest)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to better names here, I was more so creating a helper method to group the position based and the path based check, since there's a lot of conditions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just noticed for the pos based comparison we may not be really optimizing much since for the duplicate file detection case we do another string comparison on line 457. So we really only remove one path string comparison instead of all of them in the current implementation.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 2 times, most recently from 47cda8a to 4099608 Compare September 16, 2024 15:48
if (file.manifestLocation() != null) {
deletedManifestPositions
.computeIfAbsent(file.manifestLocation(), key -> Sets.newHashSet())
.add(file.pos());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The case where file.pos() is defined is the same as where file.manifestLocation() is defined; as long as the file is read from a manifest both of these will be set, so I don't think we need an additional null check for pos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could optimize memory consumption by not populating deletePaths if the manifest location is defined. Basically put line 162 to the rest of the method behind an else block?

Comment on lines +319 to +321
boolean hasDeletedFiles = manifestDeletedPositions.containsKey(manifest.path());
if (hasDeletedFiles) {
return filterManifestWithDeletedFiles(evaluator, manifest, reader);
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The !canContainDeletedFiles(manifest) check on 310 already optimizes the case where there is nothing to rewrite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the check in canContainDeletedFiles a bit more efficient by checking if manifestDeletedPositions is not empty before doing any of the more expensive checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant