Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented Nov 11, 2021

This PR replaces Set<Long> with Roaring64Bitmap to store the deleted positions for filtering. Bitmap use significant less memory than a set, here is a benchmark. With that, we don't have to use streaming filter in case of large amount of positions, we can keep all deleted positions in memory.
cc @aokolnychyi @rdblue @RussellSpitzer @karuppayya @szehon-ho

}

@Override
public boolean deleted(long position) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we didn't use isDeleted? That is the normal convention for boolean methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change in the new commit.

return positionDeleteIndex;
} catch (IOException e) {
throw new UncheckedIOException("Failed to close position delete source", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method be named toPositionDeleteIndex? The way PositionDeleteIndex is modeled, it could be implemented by a set.

@rdblue
Copy link
Contributor

rdblue commented Nov 14, 2021

@flyrain, the data on #3287 supports the idea of using a bitmap instead of a set of long values. But this PR actually removes the streaming delete filter, which doesn't load all of the deletes into memory at the same time. Instead, it streams through the delete file to avoid keeping deletes in memory.

Can you provide data comparing the memory consumption and performance of streaming deletes?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs further discussion. Although there is evidence that a bitmap is better for memory consumption than set of positions, the effect of this PR is actually to remove the more memory efficient streaming delete filter.

I think that this needs to either only replace set with bitmap (as in the summary) or we need more justification that removing the streaming version is also a good idea.

@flyrain
Copy link
Contributor Author

flyrain commented Dec 17, 2021

Thanks @rdblue for the review. Sorry for the delayed update. I agree we need more justification to remove streaming version. I will try to create another one with perf test for that. I'll update this PR to replace set with bitmap, and the refactor you mentioned for the class PositionDeleteIndex.

@flyrain flyrain force-pushed the remove-stream-delete-filter branch from 87bd108 to 9edb689 Compare December 17, 2021 01:33
@flyrain flyrain requested a review from rdblue December 17, 2021 01:37
@flyrain
Copy link
Contributor Author

flyrain commented Dec 17, 2021

It'd be nice to get into 0.13 release since it has the change of API PositionDeleteIndex. Otherwise, we need to go though the deprecated process for the API changes. @jackye1995

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly look good to me, while waiting for the performance test results.

There are quite a few interface changes for the public Deletes class, ideally I think we should just preserve the names or at least mark the old one as deprecated for one release, but I don't know if there is really any place that is depending on this. I feel it's public more for the purpose of the ease of use.


@Override
public boolean deleted(long position) {
public boolean isDeleted(long position) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing the name?

Copy link
Contributor Author

@flyrain flyrain Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is suggested by @rdblue. Make sense to me as well. It is less ambiguous.

@flyrain
Copy link
Contributor Author

flyrain commented Dec 17, 2021

Thanks @jackye1995 for the review. Made the change per your suggestion. To be clear, there is no need of perf test in this PR. I will open another PR for replacing the streaming part, which needs the perf test. I agreed with you about the interface. To add to it, the interface PositionDeleteIndex is introduced recently, and not in any existing release.

@flyrain
Copy link
Contributor Author

flyrain commented Jan 4, 2022

Hi @rdblue and @jackye1995, Happy new yea! Do you get a chance to take a look this PR?

@rdblue rdblue merged commit 8dab33c into apache:master Jan 9, 2022
@rdblue
Copy link
Contributor

rdblue commented Jan 9, 2022

Thanks, @flyrain!

@flyrain
Copy link
Contributor Author

flyrain commented Jan 10, 2022

Thanks @rdblue and @jackye1995 for the review!

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants