Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR adds support for combing historical position deletes in writers, enabling sync maintenance.

}
}

public static <T extends StructLike> CharSequenceMap<PositionDeleteIndex> toPositionIndexes(
Copy link
Contributor Author

@aokolnychyi aokolnychyi Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely to avoid breaking the API.

private final Supplier<FileWriter<PositionDelete<T>, DeleteWriteResult>> writers;
private final DeleteGranularity granularity;
private final CharSequenceMap<Roaring64Bitmap> positionsByPath;
private final CharSequenceMap<PositionDeleteIndex> positionsByPath;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an area we'd want to explore using Map<String, PositionDeleteIndex> instead of the CharSequenceMap? Doesn't need to be in this PR, more so just wondering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it'll probably make more sense to look at that when I do the update to use location instead of the deprecated path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to keep this as CharSequenceMap as writers may use arbitrary CharSequence implementations and it is a bit different from DataFile/DeleteFile structs.


try {
PositionDelete<T> positionDelete = PositionDelete.create();
for (CharSequence path : sort(paths)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect I'm curious about, have we ever compared with using a TreeMap instead of sorting? It'll be the same time complexity in the end but interested in seeing if there's any significant differences in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say TreeMap starts to make sense if we access the collection in sorted order more than once. Otherwise, paying the extra cost during inserts may not be worth it.

@aokolnychyi aokolnychyi force-pushed the existing-deletes-in-writers branch from 41fd3b0 to 19c1779 Compare September 30, 2024 20:06
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aokolnychyi! Had minor comments but I think this looks great overall.

}

private PositionDeleteIndex loadPreviousDeletes(CharSequence path) {
return loadPreviousDeletes != null ? loadPreviousDeletes.apply(path) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it make sense to default loadPreviousDeletes to be a function implementation which just returns null (I think it'd be a one line lambda charSequence -> null? Then I think we could remove this helper method and directly use loadPreviousDeletes.apply(path) on line 150.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, let me update.

}

FileWriter<PositionDelete<T>, DeleteWriteResult> writer = writers.get();
List<DeleteFile> rewrittenDeleteFile = Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrittenDeleteFiles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@aokolnychyi aokolnychyi merged commit c8fe01e into apache:main Oct 1, 2024
@aokolnychyi
Copy link
Contributor Author

Thanks, @amogh-jahagirdar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants