-
Notifications
You must be signed in to change notification settings - Fork 3k
Make predicates of delete only initialize once #5195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| public Predicate<T> eqDeletedRowFilter() { | ||
| if (eqDeleteRows == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,The function to avoid initialize 'delete row' of eqDeleteRows is the same as eqDeletePredicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found out that we don't use this function right now. How about directly using the original function like this:
private CloseableIterable<T> applyEqDeletes(CloseableIterable<T> records) {
Filter<T> remainingRowsFilter = new Filter<T>() {
@Override
protected boolean shouldKeep(T item) {
return eqDeletedRowFilter().test(item);
}
};
return remainingRowsFilter.filter(records);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to change. The unified predicate and the list of predicates can coexist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get you
|
The DeleteFilter#filter is called once for each scan task, it is created as a new object. How do you call it multiple times with the same DeleteFilter object? |
|
@chenjunjiedada In trino query engine every "page" will call DeleteFilter#filter, The "The DeleteFilter#filter is called once for each scan task" is not guaranteed。 |
chenjunjiedada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Note that the same can be done in Trino without any API change in Iceberg. See trinodb/trino#13112. Always caching delete filters might be detrimental to Spark, which does not require the filters to be cached. |
|
@lhofhansl More for my understanding, could you elaborate on the cases where caching delete filters might be detrimental to Spark? Do you mean in terms of retaining the filters in memory? |
Yep. We'd be losing out on the opportunity to not materialize the complete filter as a set in memory.
An alternative is to add the same API I added to TrinoDeleteFilter (which extends |
|
Thanks, @shidayang! I think this is a good idea for cases where the delete filter is reused. In Spark, this shouldn't affect how long deletes are held in memory because both the filter iterator and the filter are held by the reader. Since there's no negative effect, but there's a very positive effect if this is reused, I think it's a good idea. |
|
Merged. Thanks for catching this, @shidayang! |
|
Thanks for your review, @rdblue |
…ache#5195) (cherry picked from commit 71aa529)
DeleteFilter initialize 'delete rows' multiple times when call DeleteFilter#filter method multiple times, This cause performance of MOR is very low in trino.
In my case, It spend 8 minutes when I query a table, after optimize it only spend 20 seconds.