Skip to content

Fix task failure signal race condition#15917

Merged
arhimondr merged 6 commits intotrinodb:masterfrom
pettyjamesm:fix-task-failure-race
Feb 13, 2023
Merged

Fix task failure signal race condition#15917
arhimondr merged 6 commits intotrinodb:masterfrom
pettyjamesm:fix-task-failure-race

Conversation

@pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Jan 31, 2023

Description

Fixes the sequence of operations around TaskResource checking for task failure until after the buffer response future is completed. Otherwise, tasks could inappropriately announce completion rather than failure to consumers which can yield incorrect query results.

Additional context and related issues

Discovered while attempting to debug a swath of test failures producing incorrect results in #15478 which made the sequence of interleaving operations necessary to trigger the correctness bug more likely to occur.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Fix a race condition that could allow queries to indicate normal completion instead of failure and produce incorrect results.

@pettyjamesm pettyjamesm force-pushed the fix-task-failure-race branch from b707a8f to 6693557 Compare February 3, 2023 21:46
@electrum electrum requested a review from losipiuk February 6, 2023 19:14
@losipiuk losipiuk requested a review from sopel39 February 6, 2023 20:36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move to a separate commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is all related code. The movement to here alone would cause issues with scheduleDriversWithTaskLifecycle() not being called when taskHandle == null. This is "fine" because it doesn't break the task state machine model today, but causes problems in a world with terminating task states.

@pettyjamesm pettyjamesm force-pushed the fix-task-failure-race branch from 6693557 to 4beeb65 Compare February 6, 2023 23:59
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that sending emptyResults(taskInstanceId, 0, completed: false) is safe, even though the previous comment suggested otherwise, because TaskResource will respond similarly when the timeout threshold is reached. Sending these responses instead of letting them time out should propagate task failure information more quickly than allowing the timeout to expire.

@pettyjamesm pettyjamesm force-pushed the fix-task-failure-race branch 2 times, most recently from 3fb5438 to 221d471 Compare February 8, 2023 22:00
@pettyjamesm pettyjamesm force-pushed the fix-task-failure-race branch 2 times, most recently from 953a388 to 31684aa Compare February 9, 2023 22:07
Previously, SqlTaskExecution registered an outputBuffer callback for
CheckTaskCompletionOnBufferFinish in the constructor, which published
a reference to "this" that could become visible to another thread in
an unsafe way. This callback was moved to the start() method instead.

SqlTask is also reordered to call SqlTaskExecution#start() before
publishing a reference to it via the TaskHolder reference, which was
unsafe under concurrent state machine activity on another thread.
Instead, taskHolderReference updates must now synchronize on a single
taskHolderLock before mutating the holder reference.
Sends responses eagerly to pending readers in LazyOutputBuffer when a
task aborts the output buffer.
@pettyjamesm pettyjamesm force-pushed the fix-task-failure-race branch from 31684aa to 465d500 Compare February 10, 2023 15:33
@pettyjamesm pettyjamesm force-pushed the fix-task-failure-race branch from 465d500 to 93e5561 Compare February 10, 2023 15:44
@arhimondr arhimondr merged commit 7c79c4c into trinodb:master Feb 13, 2023
@pettyjamesm pettyjamesm deleted the fix-task-failure-race branch February 13, 2023 17:07
@github-actions github-actions bot added this to the 407 milestone Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants