Bundle Iceberg reads into combined scan tasks#12359
Bundle Iceberg reads into combined scan tasks#12359alexjo2144 wants to merge 5 commits intotrinodb:masterfrom
Conversation
electrum
left a comment
There was a problem hiding this comment.
Doing this for UPDATE and DELETE makes the code complicated and hard to follow, and the amount of test coverage is not clear. After we have MERGE, we'll eventually replace the existing UPDATE and DELETE implementations to use it, so we won't need UpdatablePageSource anymore.
We can support this only for reads like we do in Raptor by only packing for reads. The table handle can have a flag to know if it's a write (it already does for UPDATE, just needs one for DELETE). Then the split source can skip packing for writes. On the page source provider side, skip creating the combined page source if there is only one split.
I suspect almost all of this complexity goes away if we only support reads.
There was a problem hiding this comment.
This recomputes the values for every call. Instead, we can compute as we go, since the stats for completed page sources will not change. See io.trino.plugin.raptor.legacy.util.ConcatPageSource
There was a problem hiding this comment.
Simply add them all to a com.google.common.io.Closer which handles this automatically.
There was a problem hiding this comment.
No need for a builder class. This is basically just an ImmutableList.Builder.
There was a problem hiding this comment.
With the recently added support for multiple handles, no need to make changes to IcebergSplit. You can add a new class CombinedIcebergSplit that contains List<IcebergSplit>. Then just do a simple instanceof check in IcebergPageSourceProvider and turn the existing createPageSource() method into a private helper method that you call in a loop.
599a255 to
300bc6d
Compare
Yeah, pretty much all of it is just there to get the rows back to the PageSource that generated them. I can try going with the original Split handling for update and delete. Alternatively we could write one new data file and one delete file per batch, and not worry about delegating at all. The complexity there is just that the position delete file needs to be sorted by the data file path. |
That sounds undesirable (more data to read at read-time, than if one deletion file per one input data file) I like the idea to keep this simple, and apply it to read queries only. We may always revisit this later if needed. |
b696472 to
3398ae3
Compare
|
@alexjo2144 after: So its like 297s vs 30s, around 10x better. |
|
So with packing splits there is less FileScanTasks? Can single FileScanTask read multiple files? Why FileScanTask is a bottleneck? I suppose the same amount of IO needs to happen. Generally we now have weighted splits support. It's used in Hive. Would it help here? |
There was a problem hiding this comment.
Did you try assigning split weight (as in hive) instead?
There was a problem hiding this comment.
How do weighted split work?
does it allow splits to be combined to avoid scheduling overhead?
There was a problem hiding this comment.
How do weighted split work?
Check #9059. It's described there pretty well.
does it allow splits to be combined to avoid scheduling overhead?
Split scheduling overhead itself is minimal. Split queues might be too short if there is a lot of splits (so workers could get starved), but that's what weighted split scheduling solves. Overall I would use it instead of gluing splits together. Additional complexity of weighted splits is minimal and also splits could be balanced better.
There was a problem hiding this comment.
I like that provided this solves the problem.
@alexjo2144 can you please make another version of the improvement, based on weighted splits?
There was a problem hiding this comment.
I took a stab at this here. Results are:
trino> select count(*) from iceberg.iceberg_tpch_sf1000_orc.lineitem_small_files;
_col0
------------
5999989709
(1 row)
Query 20220519_184353_00008_68mz2, FINISHED, 6 nodes
Splits: 237,965 total, 237,965 done (100.00%)
1:02 [6B rows, 151GB] [97.4M rows/s, 2.46GB/s]
So still faster than we have on master, but there seems to still be some extra overhead from the number of splits being so high.
I'm going to try a third branch with both the code here and SplitWeights
There was a problem hiding this comment.
Here's combined splits + split weights:
trino> select count(*) from iceberg.iceberg_tpch_sf1000_orc.lineitem_small_files;
_col0
------------
5999989709
(1 row)
Query 20220519_200949_00011_empm9, FINISHED, 6 nodes
Splits: 7,469 total, 7,469 done (100.00%)
24.89 [6B rows, 7.25TB] [241M rows/s, 298GB/s]
There was a problem hiding this comment.
So still faster than we have on master, but there seems to still be some extra overhead from the number of splits being so high.
How do you assign split weight? Is it similar to Hive connector approach?
but there seems to still be some extra overhead from the number of splits being so high.
That is an assumption. It can either be:
- some kind of overhead (not yet known what)
- workers still get starved as queue size is not sufficient even with weighted splits. You can actually try increasing split queue size (
node-scheduler.max-splits-per-node,node-scheduler.max-pending-splits-per-task) to see if it helps. Is Iceberg generating splits fast enough? Checkquery.schedule-split-batch-size.
I would like to learn where the overhead comes from since we already have a mechanism (weighted splits) that should address the issue of multiple small splits. It would be better to fix/tune it rather than add yet another mechanism.
If the conclusion is that the overhead is X and it can be only addressed in Iceberg with merging splits, that's fine. However, if we start packing splits together for other connectors we should know what overhead we address that cannot be addressed by weighted splits.
There was a problem hiding this comment.
I would like to learn where the overhead comes from since we already have a mechanism (weighted splits) that should address the issue of multiple small splits. It would be better to fix/tune it rather than add yet another mechanism.
That's a good point
If the conclusion is that the overhead is X and it can be only addressed in Iceberg with merging splits, that's fine.
or we can even think of making engine-side equivalent of these changes, so connectors don't have to do this
There was a problem hiding this comment.
or we can even think of making engine-side equivalent of these changes, so connectors don't have to do this
Sure, but then you probably don't need weighted split scheduling. Also, large splits are not neccecerly good because they provide less granularity of work which leads to work skewness (especially when there are more nodes)
|
cc @losipiuk |
There was a problem hiding this comment.
What's the semantics of the (undocumented) io.trino.spi.connector.ConnectorSplit#getAddresses?
Do we need to union or intersect these lists?
(concatenation is unlikely what we want)
There was a problem hiding this comment.
initialize delegates lazily, they may allocate resources
There was a problem hiding this comment.
Would it make sense to unify this with io.trino.plugin.raptor.legacy.util.ConcatPageSource within plugin toolkit?
There was a problem hiding this comment.
for simplicity we may checkArgument(!delegatePageSources.isEmpty() to avoid dead code
There was a problem hiding this comment.
Can you please extract some refactorings?
This shows up as a single huge change, but i guess most code is just being moved around.
Please make it apparent.
There was a problem hiding this comment.
finish has this.splitIterator = CloseableIterator.empty();
let's decide whether splitIterator is nullable and simplify finish or isfinished
There was a problem hiding this comment.
Sounds like something engine could (should?) provide directly to ConnectorSplitManager.
cc @losipiuk
findepi
left a comment
There was a problem hiding this comment.
Pending discussion #12359 (comment)
Reuse the bin packing done by the Iceberg table scan planning by keeping small reads bundled together in a single Split.
0b48662 to
a149426
Compare
|
Bundling did not end up having a significant improvement over using SplitWeights. Replacing this PR with #12579 |
Reuse the bin packing done by the Iceberg table scan
planning by keeping small reads bundled together in
a single Split.
Description
Related issues, pull requests, and links
Fixes: #12162
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: