Skip to content

Conversation

@shidayang
Copy link
Contributor

DeleteFilter initialize 'delete rows' multiple times when call DeleteFilter#filter method multiple times, This cause performance of MOR is very low in trino.
In my case, It spend 8 minutes when I query a table, after optimize it only spend 20 seconds.

@github-actions github-actions bot added the data label Jul 4, 2022
@shidayang shidayang changed the title Make predicates of delete only initiate once Make predicates of delete only initialize once Jul 4, 2022
}

public Predicate<T> eqDeletedRowFilter() {
if (eqDeleteRows == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,The function to avoid initialize 'delete row' of eqDeleteRows is the same as eqDeletePredicates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found out that we don't use this function right now. How about directly using the original function like this:

private CloseableIterable<T> applyEqDeletes(CloseableIterable<T> records) {
  Filter<T> remainingRowsFilter = new Filter<T>() {
    @Override
    protected boolean shouldKeep(T item) {
      return eqDeletedRowFilter().test(item);
    }
  };

  return remainingRowsFilter.filter(records);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to change. The unified predicate and the list of predicates can coexist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get you

@chenjunjiedada
Copy link
Collaborator

The DeleteFilter#filter is called once for each scan task, it is created as a new object. How do you call it multiple times with the same DeleteFilter object?

@shidayang
Copy link
Contributor Author

@chenjunjiedada In trino query engine every "page" will call DeleteFilter#filter, The "The DeleteFilter#filter is called once for each scan task" is not guaranteed。

Copy link
Collaborator

@chenjunjiedada chenjunjiedada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhofhansl
Copy link

Note that the same can be done in Trino without any API change in Iceberg. See trinodb/trino#13112.
(But I do not care where we fix it.)

Always caching delete filters might be detrimental to Spark, which does not require the filters to be cached.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Jul 12, 2022

@lhofhansl More for my understanding, could you elaborate on the cases where caching delete filters might be detrimental to Spark? Do you mean in terms of retaining the filters in memory?

@rdblue rdblue added this to the Iceberg 0.14.0 Release milestone Jul 12, 2022
@shidayang shidayang requested a review from rdblue July 12, 2022 04:48
@lhofhansl
Copy link

lhofhansl commented Jul 12, 2022

could you elaborate on the cases where caching delete filters might be detrimental to Spark? Do you mean in terms of retaining the filters in memory?

Yep. We'd be losing out on the opportunity to not materialize the complete filter as a set in memory.
(see:

Deletes.streamingFilter(records, this::pos, Deletes.deletePositions(filePath, deletes));
)

An alternative is to add the same API I added to TrinoDeleteFilter (which extends DeleteFilter<TrinoRow>) in the Trino PR, namely a filterCached method (could call it cacheAndFilter as well). That way we can leave the Spark code in place and Trino can use the caching method.

@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2022

Thanks, @shidayang!

I think this is a good idea for cases where the delete filter is reused. In Spark, this shouldn't affect how long deletes are held in memory because both the filter iterator and the filter are held by the reader. Since there's no negative effect, but there's a very positive effect if this is reused, I think it's a good idea.

@rdblue rdblue merged commit 71aa529 into apache:master Jul 12, 2022
@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2022

Merged. Thanks for catching this, @shidayang!

@shidayang shidayang deleted the DeleteFilter-optimize branch July 13, 2022 01:59
@shidayang
Copy link
Contributor Author

Thanks for your review, @rdblue

@shidayang shidayang mentioned this pull request Jul 13, 2022
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants