Support async commit for ExchangeSink by arhimondr · Pull Request #10699 · trinodb/trino

arhimondr · 2022-01-20T01:03:07Z

This came up during a discussion with @linzebing . It looks like currently the noMorePages and destroy methods could be called from a tiny thread pool designed to handle lightweight task notifications, for example:

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L640
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L568
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L602

Where the notificationExecutor is shared between all tasks and by default only has 5 threads in the pool: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java#L79

The commit operation on ExchangeSink could be quite time consuming (as it may require to flush existing buffers, create files and so on). So it looks like it is better to provide a non blocking ExchangeSink interface.

I will update the commit message.

losipiuk · 2022-01-20T11:26:45Z

Can you provide some rationale? Would be nice to have it in commit message anyway.

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java

losipiuk · 2022-01-20T11:34:03Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

It looks like we are not using boolean stateChanged return values most of the time. Would that make sense to return void for methods where we do not care about returned value.

We need a boolean for most of the methods. The return value is not used for noMoreBuffers and fail. But I thought it might be better to be consistent with other methods.

losipiuk · 2022-01-20T11:45:12Z

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

what about ABORTED why is it not expected here? Worth a comment?

core/trino-main/src/main/java/io/trino/execution/buffer/BufferState.java

losipiuk

LGTM

arhimondr · 2022-01-21T17:03:16Z

Can you provide some rationale? Would be nice to have it in commit message anyway.

This came up during a discussion with @linzebing . It looks like currently the noMorePages and destroy methods could be called from a tiny thread pool designed to handle lightweight task notifications, for example:

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L640
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L568
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L602

Where the notificationExecutor is shared between all tasks and by default only has 5 threads in the pool: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java#L79

The commit operation on ExchangeSink could be quite time consuming (as it may require to flush existing buffers, create files and so on). So it looks like it is better to provide a non blocking ExchangeSink interface.

I will update the commit message.

losipiuk

@martint you may want to look at changes in BufferState

losipiuk · 2022-01-21T18:10:41Z

@martint you may want to look at changes in BufferState

or @sopel39 / @findepi maybe :)

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/BroadcastOutputBuffer.java

sopel39 · 2022-01-25T13:42:16Z

core/trino-main/src/main/java/io/trino/execution/buffer/BroadcastOutputBuffer.java

This probably should be BufferState state = stateMachine.get
and then you should perform check. Otherwise state could move from NO_MORE_BUFFERS to FLUSHING between stateMachine.getState() calls, which seem racy

Yeah, it does seem weird. I also thought about that. I don't know exactly why it is implemented this way. At the end of the day I decided not to touch it and keep the change as close to being mechanic as possible.

Still I think this should be fixed (separate commit). I can image, state transitioning from NO_MORE_BUFFERS to FLUSHING and this method will destroy buffers

It's been like that for a very long time. It doesn't seem to be likely that the implementation is incorrect. But I agree, it's super confusing. Let me add a commit that simplifies it.

core/trino-main/src/main/java/io/trino/execution/buffer/LazyOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/PartitionedOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java

sopel39 · 2022-01-25T14:11:21Z

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

In this scenario the following statement is expected to be noop.

why? because task is aborted so this line should never execute?

Failing an aborted task is a noop, as the ABORTED state is a terminal state.

sopel39 · 2022-01-25T14:13:24Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

very that failureCause is not overwritten (it is null)?

This method is allowed to be called multiple times (similar to how it is implemented in other state machines). The contract is that the method has to preserve only the first failure that made the transition.

Could you add a comment: the method has to preserve only the first failure that made the transition.?

The code seems to be self explanatory and aligns with what is done in other state machines. How strongly do you feel about having an explicit comment here?

core/trino-main/src/main/java/io/trino/execution/buffer/SpoolingExchangeOutputBuffer.java

sopel39 · 2022-01-25T14:25:20Z

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSink.java

What should abort do when finish is already running?
What should abort do when another abort is running?

What should abort do when finish is already running?

I think it can be implementation specific. The implementation may decide to keep the finish running, or may decide to cancel finish and abort. It doesn't really make a difference from the engine perspective.

What should abort do when another abort is running?

Same here. It is implementation specific. As long as the sink is properly invalidated the engine doesn't really care what happens underneath as the task is already aborted / failed anyway. Regardless I guess it is better to make the abort method idempotent. I will change it to first transition the buffer to the ABORTED state and then call the ExchangeSink#abort result of which is technically ignored anyway (abort is only called when the task itself is failed or aborted)

sopel39 · 2022-01-25T14:25:38Z

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSink.java

what should finish do when abort is already running? Contract is undefined here
what should finish do when another finish is running?

what should finish do when abort is already running?

finish should never be called after abort. If it is - it's a bug. Let me document it.

Contract is undefined here what should finish do when another finish is running?

finish shouldn't be called when another finish is running. If it is - it's a bug.

Updated java doc

sopel39 · 2022-01-25T14:29:06Z

core/trino-main/src/main/java/io/trino/execution/buffer/SpoolingExchangeOutputBuffer.java

Is it racy with setNoMorePages? E.g. abort can be called after setNoMorePages started running finish

setNoMorePages starts running finish after transitioning the state. If destroy is called before setNoMorePages it means that the task got cancelled prematurely and the buffer has to be invalidated. If there's a race and setNoMorePages is called at the same time when destroy is called it is legit to finish the sink, as the data written to the sink at that point is complete.

I don't fully understand. If say this race is fine, then why do we need if (stateMachine.getState().canAddPages()) { check here? In case of a race (between setNoMorePages and destroy), if would be like this check does not exist.

In normal flow there shouldn't be a race. When the output is completely written and the setNoMorePages is called the task is only finished after ExchangeSink#finish is done and the buffer is transitioned to the FINISHED state. When the task itself is transitioned to FINISHED the destroy method is called and we don't want the sink to be aborted under normal circumstances. That's why there's a check.

However a race is possible when all the data is written but the task is cancelled before ExchangeSink#finish is completed. This shouldn't happen in practice, as the scheduler is not expected to cancel tasks that are writing to a spooling exchange. However from the interface perspective it is possible. I was thinking about what's the best way to handle this situation. When the output is complete and the task is cancelled the output itself is valid. So letting it finish should be perfectly fine. However sending an "abort" to the sink gives the ExchangeSink implementation to cancel commit if possible.

Discussed offline.

Removing the check to ensure abort is always called if the finish hasn't succeeded.

linzebing · 2022-01-26T19:05:58Z

It feels that abort doesn't have to be blocking, as we can just abort the multi part upload asynchronously.

arhimondr · 2022-01-26T19:31:59Z

It feels that abort doesn't have to be blocking, as we can just abort the multi part upload asynchronously.

Currently it is not blocking. It returns a feature and the OutputBuffer doesn't wait for it, only logs an exception if one occurred.

linzebing · 2022-01-27T19:06:21Z

Need to wait for futures to complete here https://github.com/trinodb/trino/blob/master/testing/trino-testing/src/main/java/io/trino/testing/AbstractTestExchangeManager.java#L167,L170

arhimondr · 2022-01-27T19:57:45Z

Need to wait for futures to complete here https://github.com/trinodb/trino/blob/master/testing/trino-testing/src/main/java/io/trino/testing/AbstractTestExchangeManager.java#L167,L170

Good catch

arhimondr · 2022-01-28T04:51:31Z

Rebased on top of #10507

Applied necessary changes to DeduplicatingDirectExchangeBuffer. @losipiuk @sopel39 @linzebing Please take a look

sopel39

lgtm % comments

sopel39 · 2022-01-28T11:05:36Z

core/trino-main/src/main/java/io/trino/operator/DirectExchangeClient.java

nit: can we have a test for this?

That would probably require creating Exchange mocks that can throw an exception on close. I wonder if it's worth it given that we don't have memory counting tests even for happy path scenarios.

sopel39 · 2022-01-28T11:58:31Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

Could you add a comment: the method has to preserve only the first failure that made the transition.?

sopel39 · 2022-01-28T11:59:16Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

failure cause can be set before state transitions to FAILED. Are we sure that won't cause any troubles?

The failureCause is only expected to be explored when the buffer is in the FAILED state. If the buffer transitioned to ABORTED in a meantime the failure cause is not expected to be queried.

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBuffer.java

sopel39 · 2022-01-28T12:03:45Z

core/trino-main/src/main/java/io/trino/execution/buffer/SpoolingExchangeOutputBuffer.java

that description is confusing:

This is possible when a task is cancelled early by the coordinator.

and

Task cancellation is not supported as the task output is expected to be deterministic.

Both can't be true at same time, right?

The task cancellation is not expected to be requested by coordinator. It can only be requested if there's a bug in the scheduler. However if this situation happens (e.g.: due to a bug) it is safer to invalidate the buffer with abort to avoid publishing incomplete data to the exchange service.

Added one more sentence to elaborate it.

core/trino-main/src/main/java/io/trino/operator/DeduplicatingDirectExchangeBuffer.java

sopel39 · 2022-01-28T12:37:55Z

core/trino-main/src/main/java/io/trino/operator/DeduplicatingDirectExchangeBuffer.java

sopel39 · 2022-01-28T12:38:05Z

core/trino-main/src/main/java/io/trino/operator/DeduplicatingDirectExchangeBuffer.java

sopel39 · 2022-01-28T12:40:47Z

core/trino-main/src/main/java/io/trino/operator/DeduplicatingDirectExchangeBuffer.java

Encapsulate state transition logic shared between all output buffers in a single place. This will also help with extending the state machine to support failing a buffer with a specific exception that can be stored in the OutputBufferStateMachine

Preparation needed to allow failure handling

To be consistent with OutputBuffer#destroy() which does essentially the same operation but for all the buffers.

It seems to be more consistent with the naming in other places in the codebase (e.g.: abortTask). Also it will help to disambiguate a failure (when something failed inside an output buffer and must be reported) and an abort (when a buffer is explicitly aborted by the engine).

ExchangeSink#finish is called to commit ExchangeSink when noMorePages is set on the SpoolingExchangeOutputBuffer. The setNoMorePages method is assumed to be lightweight and is called from a thread pool designed to handle lightweight task notifications. By default the thread pool size is only 5 threads large. It is not ideal to simply increase thread pool size as it is hard to know what specific output buffer will be used and whether any heavyweight processing on "noMorePages" is needed. Instead this commit changes the finish and abort operations on ExchangeSink to be non blocking. With this approach the ExchangeSink will be free to implement it's own commit strategy without blocking the engine thread pools.

The isBlocked method accesses fields that must be accessed under a lock

github-actions bot added the tests:hive label Jan 20, 2022

arhimondr requested review from linzebing, losipiuk and martint January 20, 2022 01:03

losipiuk reviewed Jan 20, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java Outdated Show resolved Hide resolved

losipiuk reviewed Jan 20, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/buffer/BufferState.java Outdated Show resolved Hide resolved

losipiuk reviewed Jan 20, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/buffer/BufferState.java Outdated Show resolved Hide resolved

losipiuk reviewed Jan 20, 2022

View reviewed changes

arhimondr force-pushed the output-buffer-failure-handling branch from e95aacb to af6b14f Compare January 21, 2022 17:23

cla-bot bot added the cla-signed label Jan 21, 2022

losipiuk approved these changes Jan 21, 2022

View reviewed changes

linzebing approved these changes Jan 21, 2022

View reviewed changes

github-actions bot removed the tests:hive label Jan 21, 2022

sopel39 requested a review from raunaqmorarka January 24, 2022 16:29

arhimondr force-pushed the output-buffer-failure-handling branch from af6b14f to d62fd13 Compare January 25, 2022 05:08

sopel39 reviewed Jan 25, 2022

View reviewed changes

arhimondr force-pushed the output-buffer-failure-handling branch from d62fd13 to 5d2d3a0 Compare January 25, 2022 22:29

arhimondr force-pushed the output-buffer-failure-handling branch 2 times, most recently from c2ec962 to df84888 Compare January 26, 2022 21:57

arhimondr force-pushed the output-buffer-failure-handling branch 2 times, most recently from e7d4c69 to 2a9a495 Compare January 28, 2022 04:50

sopel39 reviewed Jan 28, 2022

View reviewed changes

arhimondr force-pushed the output-buffer-failure-handling branch from 2a9a495 to 9c3fde7 Compare January 28, 2022 18:31

linzebing approved these changes Jan 28, 2022

View reviewed changes

sopel39 approved these changes Jan 28, 2022

View reviewed changes

arhimondr added 8 commits January 28, 2022 17:35

Add OutputBufferStateMachine

7d63de2

Encapsulate state transition logic shared between all output buffers in a single place. This will also help with extending the state machine to support failing a buffer with a specific exception that can be stored in the OutputBufferStateMachine

Expose output buffer state in the interface

7bb882f

Preparation needed to allow failure handling

Rename OutputBuffer#abort(bufferId) to destroy

7c7fbd1

To be consistent with OutputBuffer#destroy() which does essentially the same operation but for all the buffers.

Simplify checkFlushComplete for Broadcast/PartitionedOutputBuffer

ef157d2

Ensure memory is always released upon DirectExchangeClient#close

1e6cbde

Fix synchronization in DeduplicatingDirectExchangeBuffer

12ede0b

The isBlocked method accesses fields that must be accessed under a lock

arhimondr force-pushed the output-buffer-failure-handling branch from 9c3fde7 to 12ede0b Compare January 28, 2022 22:39

losipiuk merged commit 3d593fc into trinodb:master Jan 29, 2022

github-actions bot added this to the 370 milestone Jan 29, 2022

arhimondr deleted the output-buffer-failure-handling branch January 31, 2022 17:18

mosabua mentioned this pull request Jan 31, 2022

Add Trino 370 release notes #10793

Merged

Conversation

arhimondr commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

losipiuk commented Jan 20, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

losipiuk left a comment

Choose a reason for hiding this comment

Uh oh!

arhimondr commented Jan 21, 2022

Uh oh!

losipiuk left a comment

Choose a reason for hiding this comment

Uh oh!

losipiuk commented Jan 21, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

arhimondr commented Jan 20, 2022 •

edited

Loading

arhimondr Jan 25, 2022 •

edited

Loading

sopel39 Jan 25, 2022 •

edited

Loading

arhimondr Jan 25, 2022 •

edited

Loading

sopel39 Jan 25, 2022 •

edited

Loading

arhimondr Jan 25, 2022 •

edited

Loading

sopel39 Jan 26, 2022 •

edited

Loading