Cache Iceberg equality and positional delete filters.#13112
Cache Iceberg equality and positional delete filters.#13112lhofhansl wants to merge 2 commits intotrinodb:masterfrom
Conversation
|
Looking at the failures. ... should be fixed now. |
d53f3a6 to
b1d01c5
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/TrinoDeleteFilter.java
Outdated
Show resolved
Hide resolved
|
Can we add tests similar to what is found Maybe add a preparatory commit with the corresponding tests before your changes so that in the main commit of this PR can be seen the improvements that are coming in this PR. |
There was a problem hiding this comment.
I am having a hard time to understand what is actually being cached here.
Can you please add a comment or change the 'xxxCached` method names to better match their purpose?
There was a problem hiding this comment.
Yeah. For that you'll need to look at Iceberg's DeleteFilter. Let me think about to best comment that.
There was a problem hiding this comment.
Added some comments. Lemme know what you think.
There was a problem hiding this comment.
Thank you for taking the time to point out the memoization technique in applying the delete filters.
I'm curious what are the downsides of using the memoization technique that make org.apache.iceberg.data.DeleteFilter#filter not using this technique instead.
cc @rdblue
There was a problem hiding this comment.
I assume this has to do with Spark's partition at a time operation. In that case it can evaluate the filters in a streaming fashion.
So for many filtered rows this might be slower in Spark. In Trino it's "disaster" in either case as it is doing Page at a time.
Although I will say that when the set of deleted rows is large one should try to formulate the delete as an equality delete anyway.
Note that the fix the Iceberg folks are proposing is to always memoize the filters.
Let me look at that test. Since this is performance improvement with no correctness implications, it might be a bit tricky as what we should cache. Perhaps we can count the number of time the filter is re-read and parsed (update: I see that's exactly what TrackingFileIoProvider does). I'll think about. (I'll be on the road the next few days, so might go a bit more slowly) |
b1d01c5 to
07ba39c
Compare
|
Added comments and fixed and fixed |
Note that the |
|
@lhofhansl FYI related work in also underway on Iceberg apache/iceberg#5195 |
It seems to track only metadata operations (not data operations)...?
That would solve the problem in the exact same way. Should we wait for that instead? Looking at the Iceberg code... If we do not cache the filter we can stream the data rows through the filters without completely materializing them. So for Spark it seems the caching could be detrimental. I'm fine either way :) |
|
Since Iceberg is considering the same strategy (cache the filters) I removed the WIP annotation. |
When the Iceberg PR lands, it may still take a while until the new Iceberg version is being released and integrated into Trino. Having a test in this PR would ensure that when eventually switching to the Iceberg fix there will be no performance penalty. |
07ba39c to
33c46e4
Compare
|
Rebased for another test run. Still looking into the test. |
33c46e4 to
0ce2711
Compare
|
OK.. Added a test. Took me a while to realize that (a) delete filters are not read via Then I had to fix up |
|
@lhofhansl, I merged the Iceberg change to only load equality deletes once per |
|
@rdblue So the Iceberg change won't fix the Trino problem with positional deletes. Curious, why only equality deletes and not also positional deletes? |
The PR only covered equality. Doing the same for positional deletes is next, although those aren't always held in memory. |
|
I see. Thanks. This is mostly about positional deletes.
That's the part where I think could be detrimental to Spark. |
There was a problem hiding this comment.
I think this is a good short term improvement but I think we should try to avoid keeping all the deleted row numbers in memory if possible.
If we can assume splits for the same file will always come in together, and in order we could use a streaming/iterative read approach.
- First Split for a file arrives, initialize a DeleteFilter
- Second Split arrives for the same file, reuse the existing DeleteFilter
- First Split arrives for a different file, close the old DeleteFilter and open a new one for the new file
- repeat
But I'd want some validation from @findepi on if we can rely on ordering like that
There was a problem hiding this comment.
A single delete filter file can affect any data file in the partition. So I think we need to load all filters for the partition.
(Unless we open them ahead of time, Analyse them and the determine which ones affect which files)
There was a problem hiding this comment.
They should be filtered using the min/max values of the path column, so only some of the partition's filters are fully opened
There was a problem hiding this comment.
This is for positional deletes specifically
There was a problem hiding this comment.
Thanks @alexjo2144 . Are you saying that is already happening or something we need to add?
(In any case, whatever it is doing it is doing for every page without this change. :) )
|
I suggest we merge this. As is, Iceberg delete filters are useless with Trino (tiny queries just time out.) |
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergDeleteFilter.java
Outdated
Show resolved
Hide resolved
0ce2711 to
aba2dd8
Compare
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergDeleteFilter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergDeleteFilter.java
Outdated
Show resolved
Hide resolved
85f0349 to
cdf0ad6
Compare
|
nit: remove |
|
Another relevant PR from the Iceberg side: apache/iceberg#5264 |
|
It looks like the Iceberg community is considering a release soon anyway. Maybe we can just wait for that? |
|
apache/iceberg#5264 won't fix their case when they decide not a materialize in a set. (hardcoded to 100k rows) I am also not sure that this should be fixed in Iceberg itself. It should just provide the right API to implement it as we see fit. I prefer an explicit implementation in Trino. But that's just a preference. In the end I do not really care as long as it gets fixed. :) |
cdf0ad6 to
d2e73c0
Compare
|
Another relevant Issue I filed with the Iceberg community that would help this case apache/iceberg#5272 Not to say we should not make this change. Reopening the delete files for every page is definitely an issue |
d2e73c0 to
896a231
Compare
|
Is this waiting for something from me? (Just making sure) |
|
So, I was thinking about this more and I think we can do something better here without waiting for the next Iceberg release. The main thing I'm worried about here is keeping all of the deleted rows in memory for each Split. The Iceberg cutoff at 100,000 records seems pretty reasonable so I'm not sure we should circumvent that. Can we try refactoring the code a bit so that there's only one call to I think that with the other improvements in the Iceberg codebase should help a lot. Does that make sense? |
|
Yep. If you can make that work. How would you match up the new delete positions with the new rows passed by the current page? Wanna open another PR and propose the change? And let's also make sure we do not allow the perfect to be the enemy of the good. As is, V2 deletes are dangerous in Trino and we should disable them or fail immediately until we have it fixed. Caching at least makes them usable. If there's a is a large set of deleted rows, yes, there's a risk of high memory usage; folks should use a predicate (i.e. equality deletes) instead of the positional deletes anyway. |
|
And how would you avoid that each page worth if rows is now passed through the entire set filters each time? We'd be trading more CPU for less memory. |
|
I don't think keeping all the positions in memory for the split is a problem. RoaringBitmap is very efficient and most Iceberg data files won't have more than a few million rows. I'm working on a change to reimplement delete handling natively in Trino and didn't bother with a streaming approach. |
|
@electrum Should I close this one in favor of the coming native implementation? |
|
@lhofhansl thanks for your work on this. I started my native implementation before I was aware of your fix. It should be ready now: #13219 If you are able to test it out on real data, that would be much appreciated. |
|
Closing in favor of #13219. |
Description
This follows the apparent design choice of iceberg's delete filters operating on partitions at a time.
Before this change Iceberg DeleteFilters were reloaded and reparsed for each page. This PR keeps filters for the split, which is how Iceberg delete filters are designed.
This uses only existing API from Iceberg's
DeleteFilter.This speeds up some queries involving delete filters by a factor of 1000 or more. See #13092 for an explanation.
Performance Fix
Iceberg Connector
Deleting rows in Iceberg V2 leads to very slow read performance following the delete.
Related issues, pull requests, and links
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: