Conversation
1830b51 to
d3fe71f
Compare
There was a problem hiding this comment.
drop comment (not super helpful) and rename method to offerAll.
Or maybe we can just drop method at all and do:
synchronized (this) {
for (T element : borrowResult.getElementsToInsert()) {
offer(element);
}
}
WDYT?
There was a problem hiding this comment.
Comment removed, method ranmed to offerAll.
We still need to check the finishing && borrowCount == 0 condition as well as the notEmptySignal notification criteria, so a private method seems warranted to me instead of inlining the logic. Additionall, ArrayDeque does specialize addAll to ensure that the capacity grows to accomodate the entire collection of new inserted elements which individual separate offer calls does not.
There was a problem hiding this comment.
is there a chance that split startin at 0 could be filtered out alltogether? While still another splits for same file are present? Can this happen with DF for example? cc: @raunaqmorarka ?
There was a problem hiding this comment.
It should not be possible, no- the first split generated for a file will always have start == 0. Now, it is possible that a file will get eliminated by dynamic filters after the first split is produced which could then eliminate the remainder of the file from being scanned, but that could happen even with the previous implementation. As far as I can tell, this logic is only necessary for "optimizing small files" and collecting the files scanned to delete them after the contents are rewritten- which should not have dynamic filters in the first place because it would be unsafe for this reason with or without this improvement.
d3fe71f to
65165ee
Compare
Adds a method to handle batch insertion of all elements to insert returned from AsyncQueue.BorrowResult. This simplifies the implementation and threads contending to synchronize on the queue for each individual element inserted.
Only records the Path for scanned files in HiveSplitSource when the split start point is at the beginning of the file. The only consumer of the scanned files list is only interested in unique paths scanned, so enqueuing the same path repeatedly for the same InternalHiveSplit is unnecessarily expensive.
65165ee to
e9bfb3d
Compare
|
Do we want release notes for this? Seems like a niche change, but it could be worth mentioning. |
Probably not, it’s a fairly specific internal implementation detail. I would vote “no”, but @losipiuk can weigh in if he disagrees |
Description
Addresses two minor inefficiencies in hive split generation path:
AsyncQueuelock when inserting new elements returned fromAsyncQueue.BorrowResultby acquiring the lock, enqueueing all elements to be inserted, and firing any required notifications once.HiveSplitscanned (when enabled) multiple times and instead only records it once for each file since all consumers only care about the unique paths scanned.Refactoring performance improvement
Refactoring of some core utilities in
trino-hivesplit generation.A non-technical user should not need these changes explained to them.
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: