Flush PagePartitioner dictionaries before release by pettyjamesm · Pull Request #19806 · trinodb/trino

pettyjamesm · 2023-11-17T17:45:11Z

Description

PagePartitioner instances should either flatten their dictionary-mode appenders and transition to direct mode, or force their current page to preserve the dictionary encoding and force the current page to be flushed being released to the pool for reuse since it is not possible for the appender to observe the same dictionary input when reused by a different driver.

Also fixes an issue where dictionary encoded output pages significantly under-report their output size for PartitionedOutputOperator.

Additional context and related issues

Follows up from #19762 which mitigates similar issues for RLE blocks, but using a different strategy that's more appropriate for dictionary block handling.

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

sopel39 · 2023-11-20T13:51:27Z

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

why only dictionaries and not RLEs too?

nit: Generally, I've been thinking that we could have something like:

boolean UnnestingPositionsAppender#offer(io.trino.spi.block.Block source)

which would return false if appender would be flattened. This way we could flush RLE or dictionaries even if full page was not collected.

why only dictionaries and not RLEs too?

I thought about doing that but decided against it because in theory, RLE's could see the same value on the next input from another driver after reuse whereas it's not possible for the same to occur for dictionaries- and we already have the "direct size limit if flattened" logic for RLE's too which somewhat mitigates the worst case of what will happen if we don't see the same RLE value when reused.

nit: Generally, I've been thinking that we could have something like:

boolean UnnestingPositionsAppender#offer(io.trino.spi.block.Block source)

which would return false if appender would be flattened. This way we could flush RLE or dictionaries even if full page was not collected.

I gave that some thought, but decided against it because flushing the current page instead of flattening is something that you would have to check for each column on every input and could result in constant small page flushes due to single columns. This approach is a compromise in that we're only going to force a flush at the end of a driver processing and not on each input.

sopel39 · 2023-11-23T10:37:20Z

core/trino-main/src/main/java/io/trino/operator/output/PagePartitioner.java

that seems wrong. Both should already be accounted for in partitionPage. Also flushing would make after size smaller than before. partitionPage specifically accounts before flushing because flushing can happen at later time.
PTAL @lukasz-stec

I think you're right. This is already accounted for as part of the partitioning operation, sort of. In the case of dictionaries we know the results are very incorrect because of the way that the appender size is calculated, so I guess not reporting the result of any dictionary to direct flattenings that might occur here doesn't really make the situation any worse than it already is.

sopel39 · 2023-11-23T10:40:23Z

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

as a result of transitioning dictionaries to direct mode

I think we need to simplify this. It's really odd that shouldForceFlushBeforeRelease() has a side effect of flattening block (better name would be flushOrFlattenIfNeccecery).

But I think it would be better to call flatten here explicitly.

Separated shouldForceFlushBeforeRelease() from flattenPendingDictionary() and called those as appropriate here.

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

sopel39 · 2023-11-30T08:52:11Z

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

Why to do it here? It can be lazily done by next operator instance. This will cause output tracking issues as flattening might increase outputSizeInBytes, which is correctly handled by io.trino.operator.output.PagePartitioner#partitionPage

When done lazily, we are guaranteed to know that all dictionary mode appenders will be forced to flatten. It will correctly report the output size in that situation, but at the cost of losing the dictionary representation (if profitable).

Another thing to note, is that when dictionary appenders flush without flattening, the output size is incorrect both for partitionPage and this new method. PartitionedOutputOperator stats are always incorrect in the presence of dictionary mode outputs. I think I can fix this at the same time, but it will need to reorganize the output bytes reporting logic.

When done lazily, we are guaranteed to know that all dictionary mode appenders will be forced to flatten. It will correctly report the output size in that situation, but at the cost of losing the dictionary representation (if profitable).

If the logic is extracted to separate flattenStuff method, then we can measure size "correctly" and don't loose next dictionary, e.g: with code like:

long positionsAppendersSizeBefore = getPositionsAppendersSizeInBytes(); flattenStuff() long positionsAppendersSizeAfter = getPositionsAppendersSizeInBytes(); operatorContext.recordOutput(positionsAppendersSizeAfter - positionsAppendersSizeBefore, 0 /* no new positions */);

That works specifically for this prepare release approach, but doesn't fix the general problem when partitionPage flushes dictionaries since those don't get flattened and never had their actual size accounted for. I think I can fix both in this PR, will publish a new revision shortly.

sopel39 · 2023-11-30T08:53:39Z

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

positionsAppender.shouldForceFlushBeforeRelease() should be in separate loop as only some later position appender could return shouldForceFlushBeforeRelease()==true. However, in that case previous appenders would already be flattened

Those dictionaries are going to end up flattened anyways during compaction as part of serialization, but this does relate to the correctness of the output bytes reported (technically, this is more correct but still wrong because all dictionary outputs are reported incorrectly).

Updated the PR with the new logic. It's currently implemented with a hard assertion that we never incorrectly report the output size so that I can get validation from the CI runs, but that's probably too aggressive of a check to keep in the final version.

sopel39 · 2023-12-14T14:48:59Z

core/trino-main/src/main/java/io/trino/operator/output/PagePartitioner.java

I must say I don't understand this logic. Could we make it more straight-forward? Why outputSizeInBytes would be smaller than outputSizeReportedBeforeRelease?

It shouldn't but this is defensive code that ensures that we never allow the "eagerly reported" quantity to go negative.

sopel39 · 2023-12-14T14:50:00Z

core/trino-main/src/main/java/io/trino/operator/output/PagePartitioner.java

Why would outputSizeReportedBeforeRelease go down?

Because outputSizeReportedBeforeRelease represents output bytes that have already been reported in advance of the page being flushed as a result of the partitioner being released by a previous operator. In this logic we're "consuming" from the eagerly reported size since a page flush has occurred.

sopel39 · 2023-12-14T14:51:46Z

core/trino-main/src/main/java/io/trino/operator/output/PagePartitioner.java

what would happen if this is not zeroed?

Presumably nothing, this is just precautionary to meet expected semantics for what close() should do.

sopel39 · 2023-12-14T14:56:46Z

core/trino-main/src/main/java/io/trino/operator/output/PagePartitioner.java

Do we actually need outputSizeReportedBeforeRelease as output page (flattened or not) is going to be produced one way or the other eventually? Why can't we just track page.getSizeInBytes() when the page is finally enqueued?

I feel like on one hand we are accounting size from produced pages, but also try to accommodate buffered pages somehow into that

With that in mind I feel like:

before = getPositionsAppendersSizeInBytes(); doSomething() after = getPositionsAppendersSizeInBytes() long outputSize = after - before

was easier to grasp

Why can't we just track page.getSizeInBytes() when the page is finally enqueued?

Because the final page flush / enqueue could occur outside of the context of an operator to report the size against (as a result of PagePartitionerPool and reuse). That was the reason why the logic was previously changed to eagerly report buffered bytes before flushing, but critically- that approach would incorrectly report output size for all dictionary encoded buffers because the size they report during getPositionsAppendersSizeInBytes() is not the actual size.

The previous logic may have seemed simpler, but was incorrect hence the need to fix it. The new logic can be summarized as:

Report outputSizeInBytes from produced Page#getSizeInBytes() when pages are flushed, just like was previously done before partitioner reuse and just like TaskOutputOperator does.

When releasing a partitioner for reuse (last chance where we are guaranteed to have an OperatorContext to report against) we report Page#getSizeInBytes() for any flushed pages, flatten any dictionary encoded appenders that didn't flush, and then report the appender buffered size in bytes eagerly (in case we don't have another chance to report that size in the future because the partitioner is not reused). Since we no longer have any dictionary encoded appenders at this point, we know that the reported buffer size accurately predicts the size of the page that would be flushed if no additional insertions occur which was not before.

In the case where the partitioner is reused, and new positions are appended- we have already accounted for the buffered size on the last release. We need to subtract that amount before reporting any additional output bytes after reuse.

PagePartitioners should either flatten their dictionary mode appenders into direct mode, or force flush their current page to preserve the dictionary encoding before being released to the pool for reuse since it is not possible for the appender to observe the same dictionary input when used from a different driver and the dictionary appenders do not accurately report their size and therefore may have been preventing a flush from occurring up until this point. Also fixes an issue where dictionary block outputs were severely under reporting their output size in bytes.

pettyjamesm requested a review from sopel39 November 17, 2023 17:45

cla-bot bot added the cla-signed label Nov 17, 2023

pettyjamesm mentioned this pull request Nov 17, 2023

Fix partitioned output page flushing #19762

Merged

sopel39 requested a review from lukasz-stec November 20, 2023 13:43

sopel39 reviewed Nov 20, 2023

View reviewed changes

pettyjamesm force-pushed the flush-partitioner-dictionaries branch from 87affd5 to 7a560bf Compare November 22, 2023 15:25

pettyjamesm marked this pull request as ready for review November 22, 2023 15:25

pettyjamesm requested a review from sopel39 November 22, 2023 16:44

sopel39 reviewed Nov 23, 2023

View reviewed changes

pettyjamesm force-pushed the flush-partitioner-dictionaries branch 2 times, most recently from f03e732 to ffc4f6a Compare November 27, 2023 17:21

pettyjamesm requested a review from sopel39 November 27, 2023 17:26

sopel39 reviewed Nov 30, 2023

View reviewed changes

pettyjamesm force-pushed the flush-partitioner-dictionaries branch 4 times, most recently from 986b806 to 94ff442 Compare December 1, 2023 22:13

pettyjamesm requested a review from sopel39 December 4, 2023 16:00

sopel39 reviewed Dec 14, 2023

View reviewed changes

pettyjamesm force-pushed the flush-partitioner-dictionaries branch from 94ff442 to 8b720cd Compare December 14, 2023 19:04

sopel39 approved these changes Dec 14, 2023

View reviewed changes

pettyjamesm force-pushed the flush-partitioner-dictionaries branch from 8b720cd to e248333 Compare December 14, 2023 21:52

pettyjamesm force-pushed the flush-partitioner-dictionaries branch from e248333 to 17045ed Compare December 14, 2023 23:27

pettyjamesm merged commit a56752e into trinodb:master Dec 15, 2023

pettyjamesm deleted the flush-partitioner-dictionaries branch December 15, 2023 00:44

github-actions bot added this to the 436 milestone Dec 15, 2023

colebow mentioned this pull request Dec 19, 2023

Add Trino 436 release notes #20166

Merged

pettyjamesm mentioned this pull request Dec 20, 2023

Fix page partitioner output bytes on release handling #20183

Merged

Conversation

pettyjamesm commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

pettyjamesm commented Nov 17, 2023 •

edited

Loading

sopel39 Nov 30, 2023 •

edited

Loading