Skip to content

Do not allow RLE or Dictionary to be nested in an RLE or Dictionary#14092

Merged
dain merged 7 commits intotrinodb:masterfrom
dain:block-cleanup
Sep 18, 2022
Merged

Do not allow RLE or Dictionary to be nested in an RLE or Dictionary#14092
dain merged 7 commits intotrinodb:masterfrom
dain:block-cleanup

Conversation

@dain
Copy link
Copy Markdown
Member

@dain dain commented Sep 11, 2022

Description

Simplify RLE and Dictionary blocks by not allowing the nested block to be an RLE or Dictionary block.
When an RLE or Dictionary block is zero or one positions, return getRegion over the nested block instead.

Release notes

( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# SPI
* Replace DictionaryBlock constructors with factory method. ({issue}`14092`)
* Replace RunLengthEncodedBlock constructors with factory method. ({issue}`14092`)

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 17, 2022

What's the rationale or expected benefit of this change?

@dain
Copy link
Copy Markdown
Member Author

dain commented Sep 17, 2022

@findepi in most cases it is nonsensical to wrap a performance block in a performance block, because many of these are really noops. For example, and RLE in and RLE or a dictionary, is just an RLE, so this just adds an extra level of indirection for no gain. The only compute extra work here is the case where a dictionary is unwrapped because you need to reindex. This case should be rare since most critical places that create dictionaries are already dictionary aware (and any not, we should make aware), and I believe this is well worth the reduced indirection cost, and the developer complexity of dealing with deep nested perf blocks.

@dain dain merged commit 52eb874 into trinodb:master Sep 18, 2022
@github-actions github-actions bot added this to the 397 milestone Sep 18, 2022
@dain dain deleted the block-cleanup branch September 18, 2022 16:35
return new RunLengthEncodedBlock(rle.getValue(), positionCount);
}

// unwrap dictionary in dictionary
Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a correct unwrap as you cannot preserve dictionarySourceId after unnest.

Take a look at:

topDictBlock1        topDictBlock2
  sourceId:A             sourceId:A
  dictionary:            dictionary:
    nestedDictBlock1       nestedDictBlock2
      sourceId:1             sourceId:2

(such situation happens at join)

You cannot unwrap it to:

unwrappedDictBlock1 unwrappedDictBlock2
  sourceId:A          sourceId:A

as you cannot for example compact them in same way (as in compactRelatedBlocks method)

You should assign a new sourceId if unwrapping is done implicitly by Dictionary constructor

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can topDictBlock1 and topDictBlock2 have the same sourceId if the underlying dictionaries are different?

Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can topDictBlock1 and topDictBlock2 have the same sourceId if the underlying dictionaries are different?

same sourceId implies same ids, but it’s stronger than that. Two DictionaryBlocks might have same ids coincidentally, but different sourceIds
If you have columns of

page=[
  dictA(source: S, ids: X, dict: nestedA),
  dictB(source: S, ids: X, dict: nestedB),
  dictC(source: S, ids: X, dict: nestedC)]

then you essentially process it like:

page=MultiChannelDict(
  source:S,
  ids:X
  dict=[nestedA, nestedB, nestedC])

I hope this analogy makes it clearer

* This should not only be used when creating a projection of another dictionary block.
*/
public DictionaryBlock(int positionCount, Block dictionary, int[] ids, DictionaryId dictionarySourceId)
public static Block createProjectedDictionaryBlock(int positionCount, Block dictionary, int[] ids, DictionaryId dictionarySourceId)
Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is very similar to DictionaryBlock#getPositions when it unwraps dictionaries. Yet getPositions has more optimizations like taking compactness into account or evaluating uniqueIds.
These optimizations improve serialization for example


// unwrap dictionary in dictionary
if (dictionary instanceof DictionaryBlock dictionaryBlock) {
int[] newIds = new int[positionCount];
Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnesting is not neccecerly always beneficial without looking at context, e.g: consider join:

left_col1 | left_col2 | right_col1
===================================
    row42 |     row42 |     rightRow1
    row42 |     row42 |     rightRow2
    ...
    row42 |     row42 |     rightRowN

row42 from probe is repeated N times. Right now in join we will use dictionary (getPosition) to avoid copying row42 N times. This means that dictionaryId for blocks left_col1 and left_col2 can be same.

If there is now dictionary aware operator on left_col1, left_col2, then because left_col1, left_col2 have same dictionaryId we can process it once rather than N times.

However, if you unnest dictionary always, then you have to drop common dictionaryId for left_col1, left_col2 in this method (see #14092 (comment))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very rare scenario compared to the benefits due to reduced complexity and being able to avoid megamorphic calls in certain places (all part of the effort tracked under #14237)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants