Skip to content

Flush PagePartitioner dictionaries before release#19806

Merged
pettyjamesm merged 1 commit intotrinodb:masterfrom
pettyjamesm:flush-partitioner-dictionaries
Dec 15, 2023
Merged

Flush PagePartitioner dictionaries before release#19806
pettyjamesm merged 1 commit intotrinodb:masterfrom
pettyjamesm:flush-partitioner-dictionaries

Conversation

@pettyjamesm
Copy link
Copy Markdown
Member

@pettyjamesm pettyjamesm commented Nov 17, 2023

Description

PagePartitioner instances should either flatten their dictionary-mode appenders and transition to direct mode, or force their current page to preserve the dictionary encoding and force the current page to be flushed being released to the pool for reuse since it is not possible for the appender to observe the same dictionary input when reused by a different driver.

Also fixes an issue where dictionary encoded output pages significantly under-report their output size for PartitionedOutputOperator.

Additional context and related issues

Follows up from #19762 which mitigates similar issues for RLE blocks, but using a different strategy that's more appropriate for dictionary block handling.

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only dictionaries and not RLEs too?

nit: Generally, I've been thinking that we could have something like:

boolean UnnestingPositionsAppender#offer(io.trino.spi.block.Block source)

which would return false if appender would be flattened. This way we could flush RLE or dictionaries even if full page was not collected.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only dictionaries and not RLEs too?

I thought about doing that but decided against it because in theory, RLE's could see the same value on the next input from another driver after reuse whereas it's not possible for the same to occur for dictionaries- and we already have the "direct size limit if flattened" logic for RLE's too which somewhat mitigates the worst case of what will happen if we don't see the same RLE value when reused.

nit: Generally, I've been thinking that we could have something like:

boolean UnnestingPositionsAppender#offer(io.trino.spi.block.Block source)

which would return false if appender would be flattened. This way we could flush RLE or dictionaries even if full page was not collected.

I gave that some thought, but decided against it because flushing the current page instead of flattening is something that you would have to check for each column on every input and could result in constant small page flushes due to single columns. This approach is a compromise in that we're only going to force a flush at the end of a driver processing and not on each input.

@pettyjamesm pettyjamesm force-pushed the flush-partitioner-dictionaries branch from 87affd5 to 7a560bf Compare November 22, 2023 15:25
@pettyjamesm pettyjamesm marked this pull request as ready for review November 22, 2023 15:25
@pettyjamesm pettyjamesm requested a review from sopel39 November 22, 2023 16:44
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems wrong. Both should already be accounted for in partitionPage. Also flushing would make after size smaller than before. partitionPage specifically accounts before flushing because flushing can happen at later time.
PTAL @lukasz-stec

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. This is already accounted for as part of the partitioning operation, sort of. In the case of dictionaries we know the results are very incorrect because of the way that the appender size is calculated, so I guess not reporting the result of any dictionary to direct flattenings that might occur here doesn't really make the situation any worse than it already is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a result of transitioning dictionaries to direct mode

I think we need to simplify this. It's really odd that shouldForceFlushBeforeRelease() has a side effect of flattening block (better name would be flushOrFlattenIfNeccecery).

But I think it would be better to call flatten here explicitly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated shouldForceFlushBeforeRelease() from flattenPendingDictionary() and called those as appropriate here.

@pettyjamesm pettyjamesm force-pushed the flush-partitioner-dictionaries branch 2 times, most recently from f03e732 to ffc4f6a Compare November 27, 2023 17:21
@pettyjamesm pettyjamesm requested a review from sopel39 November 27, 2023 17:26
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why to do it here? It can be lazily done by next operator instance. This will cause output tracking issues as flattening might increase outputSizeInBytes, which is correctly handled by io.trino.operator.output.PagePartitioner#partitionPage

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When done lazily, we are guaranteed to know that all dictionary mode appenders will be forced to flatten. It will correctly report the output size in that situation, but at the cost of losing the dictionary representation (if profitable).

Another thing to note, is that when dictionary appenders flush without flattening, the output size is incorrect both for partitionPage and this new method. PartitionedOutputOperator stats are always incorrect in the presence of dictionary mode outputs. I think I can fix this at the same time, but it will need to reorganize the output bytes reporting logic.

Copy link
Copy Markdown
Member

@sopel39 sopel39 Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When done lazily, we are guaranteed to know that all dictionary mode appenders will be forced to flatten. It will correctly report the output size in that situation, but at the cost of losing the dictionary representation (if profitable).

If the logic is extracted to separate flattenStuff method, then we can measure size "correctly" and don't loose next dictionary, e.g: with code like:

long positionsAppendersSizeBefore = getPositionsAppendersSizeInBytes();
flattenStuff()
long positionsAppendersSizeAfter = getPositionsAppendersSizeInBytes();
operatorContext.recordOutput(positionsAppendersSizeAfter - positionsAppendersSizeBefore, 0 /* no new positions */);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works specifically for this prepare release approach, but doesn't fix the general problem when partitionPage flushes dictionaries since those don't get flattened and never had their actual size accounted for. I think I can fix both in this PR, will publish a new revision shortly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positionsAppender.shouldForceFlushBeforeRelease() should be in separate loop as only some later position appender could return shouldForceFlushBeforeRelease()==true. However, in that case previous appenders would already be flattened

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those dictionaries are going to end up flattened anyways during compaction as part of serialization, but this does relate to the correctness of the output bytes reported (technically, this is more correct but still wrong because all dictionary outputs are reported incorrectly).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR with the new logic. It's currently implemented with a hard assertion that we never incorrectly report the output size so that I can get validation from the CI runs, but that's probably too aggressive of a check to keep in the final version.

@pettyjamesm pettyjamesm force-pushed the flush-partitioner-dictionaries branch 4 times, most recently from 986b806 to 94ff442 Compare December 1, 2023 22:13
@pettyjamesm pettyjamesm requested a review from sopel39 December 4, 2023 16:00
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say I don't understand this logic. Could we make it more straight-forward? Why outputSizeInBytes would be smaller than outputSizeReportedBeforeRelease?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't but this is defensive code that ensures that we never allow the "eagerly reported" quantity to go negative.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would outputSizeReportedBeforeRelease go down?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because outputSizeReportedBeforeRelease represents output bytes that have already been reported in advance of the page being flushed as a result of the partitioner being released by a previous operator. In this logic we're "consuming" from the eagerly reported size since a page flush has occurred.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if this is not zeroed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably nothing, this is just precautionary to meet expected semantics for what close() should do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need outputSizeReportedBeforeRelease as output page (flattened or not) is going to be produced one way or the other eventually? Why can't we just track page.getSizeInBytes() when the page is finally enqueued?

I feel like on one hand we are accounting size from produced pages, but also try to accommodate buffered pages somehow into that

With that in mind I feel like:

before = getPositionsAppendersSizeInBytes();
doSomething()
after = getPositionsAppendersSizeInBytes()
long outputSize = after - before

was easier to grasp

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just track page.getSizeInBytes() when the page is finally enqueued?

Because the final page flush / enqueue could occur outside of the context of an operator to report the size against (as a result of PagePartitionerPool and reuse). That was the reason why the logic was previously changed to eagerly report buffered bytes before flushing, but critically- that approach would incorrectly report output size for all dictionary encoded buffers because the size they report during getPositionsAppendersSizeInBytes() is not the actual size.

The previous logic may have seemed simpler, but was incorrect hence the need to fix it. The new logic can be summarized as:

  1. Report outputSizeInBytes from produced Page#getSizeInBytes() when pages are flushed, just like was previously done before partitioner reuse and just like TaskOutputOperator does.
  2. When releasing a partitioner for reuse (last chance where we are guaranteed to have an OperatorContext to report against) we report Page#getSizeInBytes() for any flushed pages, flatten any dictionary encoded appenders that didn't flush, and then report the appender buffered size in bytes eagerly (in case we don't have another chance to report that size in the future because the partitioner is not reused). Since we no longer have any dictionary encoded appenders at this point, we know that the reported buffer size accurately predicts the size of the page that would be flushed if no additional insertions occur which was not before.
  3. In the case where the partitioner is reused, and new positions are appended- we have already accounted for the buffered size on the last release. We need to subtract that amount before reporting any additional output bytes after reuse.

@pettyjamesm pettyjamesm force-pushed the flush-partitioner-dictionaries branch from 94ff442 to 8b720cd Compare December 14, 2023 19:04
@pettyjamesm pettyjamesm force-pushed the flush-partitioner-dictionaries branch from 8b720cd to e248333 Compare December 14, 2023 21:52
PagePartitioners should either flatten their dictionary mode appenders
into direct mode, or force flush their current page to preserve the
dictionary encoding before being released to the pool for reuse since it
is not possible for the appender to observe the same dictionary input
when used from a different driver and the dictionary appenders do not
accurately report their size and therefore may have been preventing a
flush from occurring up until this point.

Also fixes an issue where dictionary block outputs were severely under
reporting their output size in bytes.
@pettyjamesm pettyjamesm force-pushed the flush-partitioner-dictionaries branch from e248333 to 17045ed Compare December 14, 2023 23:27
@pettyjamesm pettyjamesm merged commit a56752e into trinodb:master Dec 15, 2023
@pettyjamesm pettyjamesm deleted the flush-partitioner-dictionaries branch December 15, 2023 00:44
@github-actions github-actions bot added this to the 436 milestone Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants