Do not allow RLE or Dictionary to be nested in an RLE or Dictionary#14092
Do not allow RLE or Dictionary to be nested in an RLE or Dictionary#14092dain merged 7 commits intotrinodb:masterfrom
Conversation
core/trino-main/src/main/java/io/trino/operator/output/Int96PositionsAppender.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/output/RleAwarePositionsAppender.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/main/java/io/trino/spi/block/RunLengthEncodedBlock.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/main/java/io/trino/spi/block/DictionaryBlock.java
Outdated
Show resolved
Hide resolved
|
What's the rationale or expected benefit of this change? |
|
@findepi in most cases it is nonsensical to wrap a performance block in a performance block, because many of these are really noops. For example, and RLE in and RLE or a dictionary, is just an RLE, so this just adds an extra level of indirection for no gain. The only compute extra work here is the case where a dictionary is unwrapped because you need to reindex. This case should be rare since most critical places that create dictionaries are already dictionary aware (and any not, we should make aware), and I believe this is well worth the reduced indirection cost, and the developer complexity of dealing with deep nested perf blocks. |
Bypass rle when 0 or 1 positions are used.
| return new RunLengthEncodedBlock(rle.getValue(), positionCount); | ||
| } | ||
|
|
||
| // unwrap dictionary in dictionary |
There was a problem hiding this comment.
This is not a correct unwrap as you cannot preserve dictionarySourceId after unnest.
Take a look at:
topDictBlock1 topDictBlock2
sourceId:A sourceId:A
dictionary: dictionary:
nestedDictBlock1 nestedDictBlock2
sourceId:1 sourceId:2
(such situation happens at join)
You cannot unwrap it to:
unwrappedDictBlock1 unwrappedDictBlock2
sourceId:A sourceId:A
as you cannot for example compact them in same way (as in compactRelatedBlocks method)
You should assign a new sourceId if unwrapping is done implicitly by Dictionary constructor
There was a problem hiding this comment.
How can topDictBlock1 and topDictBlock2 have the same sourceId if the underlying dictionaries are different?
There was a problem hiding this comment.
How can topDictBlock1 and topDictBlock2 have the same sourceId if the underlying dictionaries are different?
same sourceId implies same ids, but it’s stronger than that. Two DictionaryBlocks might have same ids coincidentally, but different sourceIds
If you have columns of
page=[
dictA(source: S, ids: X, dict: nestedA),
dictB(source: S, ids: X, dict: nestedB),
dictC(source: S, ids: X, dict: nestedC)]
then you essentially process it like:
page=MultiChannelDict(
source:S,
ids:X
dict=[nestedA, nestedB, nestedC])
I hope this analogy makes it clearer
| * This should not only be used when creating a projection of another dictionary block. | ||
| */ | ||
| public DictionaryBlock(int positionCount, Block dictionary, int[] ids, DictionaryId dictionarySourceId) | ||
| public static Block createProjectedDictionaryBlock(int positionCount, Block dictionary, int[] ids, DictionaryId dictionarySourceId) |
There was a problem hiding this comment.
This method is very similar to DictionaryBlock#getPositions when it unwraps dictionaries. Yet getPositions has more optimizations like taking compactness into account or evaluating uniqueIds.
These optimizations improve serialization for example
|
|
||
| // unwrap dictionary in dictionary | ||
| if (dictionary instanceof DictionaryBlock dictionaryBlock) { | ||
| int[] newIds = new int[positionCount]; |
There was a problem hiding this comment.
Unnesting is not neccecerly always beneficial without looking at context, e.g: consider join:
left_col1 | left_col2 | right_col1
===================================
row42 | row42 | rightRow1
row42 | row42 | rightRow2
...
row42 | row42 | rightRowN
row42 from probe is repeated N times. Right now in join we will use dictionary (getPosition) to avoid copying row42 N times. This means that dictionaryId for blocks left_col1 and left_col2 can be same.
If there is now dictionary aware operator on left_col1, left_col2, then because left_col1, left_col2 have same dictionaryId we can process it once rather than N times.
However, if you unnest dictionary always, then you have to drop common dictionaryId for left_col1, left_col2 in this method (see #14092 (comment))
There was a problem hiding this comment.
This seems like a very rare scenario compared to the benefits due to reduced complexity and being able to avoid megamorphic calls in certain places (all part of the effort tracked under #14237)
Description
Simplify RLE and Dictionary blocks by not allowing the nested block to be an RLE or Dictionary block.
When an RLE or Dictionary block is zero or one positions, return
getRegionover the nested block instead.Release notes
( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: