Use lock to ensure consistency during runaway split detection by groupcache4321 · Pull Request #13272 · trinodb/trino

groupcache4321 · 2022-07-20T23:58:59Z

Improve detection of runaway splits and related task killing code to
ensure that we do not kill a thread which we suppose hung, but moved to
execute on behalf of another query, just before we issue kill command.

Description

Bugfix (for a very low probablity race condition)

Related issues, pull requests, and links

Improvement on top of #12392

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

cla-bot · 2022-07-20T23:59:01Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

groupcache4321 · 2022-07-21T00:17:59Z

Same purpose as: #13262

groupcache4321 · 2022-07-21T00:30:59Z

Want to make sure this is only case we are avoiding:

A threadA processing Joni for 10mins+ and finished per part of processing (split A) then move to other (split B)(so both splitA and splitB's RunningSplitInfo hold the same reference of threadA). At the meantime, the interrupter thread check the stacktrace of JONI(from splitA) and then failed the SplitB.

losipiuk

This is tricky to follow but seems to work fine. I would suggest to sprinkle it with some comments to make life of future readers easier.

losipiuk · 2022-07-21T10:21:08Z

Want to make sure this is only case we are avoiding:

A threadA processing Joni for 10mins+ and finished per part of processing (split A) then move to other (split B)(so both splitA and splitB's RunningSplitInfo hold the same reference of threadA). At the meantime, the interrupter thread check the stacktrace of JONI(from splitA) and then failed the SplitB.

@leetcode-1533 Not really. The scenario you are describing I think is not possible right now. With the current code, we would always kill splitA (we are iterating over splits and killing split's tasks; not interrupting underlying threads directly).

The problematic flow is that we may kill a task that is long-running, but not really stuck on JONI code. Consider the following example:

We find long-running splits; we get A, B, C.
None of those is actually running JONI code.
just before when we investigate stack trace for A, the underlying thread already switched to some other unrelated split D; and D is actually running JONI
we get the stacktrace for what we believe is A, but it is for D, and we decide we should kill the task that A belongs to
(clash!!!) wrong decision is made

I think you solution here prevents that from happening.

groupcache4321 · 2022-07-21T18:14:01Z

Want to make sure this is only case we are avoiding:
A threadA processing Joni for 10mins+ and finished per part of processing (split A) then move to other (split B)(so both splitA and splitB's RunningSplitInfo hold the same reference of threadA). At the meantime, the interrupter thread check the stacktrace of JONI(from splitA) and then failed the SplitB.

@leetcode-1533 Not really. The scenario you are describing I think is not possible right now. With the current code, we would always kill splitA (we are iterating over splits and killing split's tasks; not interrupting underlying threads directly).

The problematic flow is that we may kill a task that is long-running, but not really stuck on JONI code. Consider the following example:
* We find long-running splits; we get A, B, C.

* None of those is actually running JONI code.

* just before when we investigate stack trace for A, the underlying thread already switched to some other unrelated split D; and D is actually running JONI

* we get the stacktrace for what we believe is A, but it is for D, and we decide we should kill the task that A belongs to

* (clash!!!) wrong decision is made
I think you solution here prevents that from happening.

Make sense, since we are failing the task. And we are finding the thread from the driver's context.

phd3

good catch @losipiuk @leetcode-1533

core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java

phd3 · 2022-07-21T19:30:47Z

core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java

RunningSplitInfo currently has this state "Thread" which can change what it's doing through a mechanism external to the class. Now the addition of taskRunner provides better (and sufficient) encapsulation here, so IMO we should remove the direct Thread access from here, and instead let all usecases go through to taskRunner, unless there's a particular reason.

in fact, may be better to make this check while getting the stacktrace itself. i.e. only give a valid stacktrace if the the runner is indeed processing the split's code.

Hey, taskRunner is a runnable and don't the thread object saved. Are you suggesting getting the currentThread from TaskRunner?

For the taskRunner, we still can't guarantee it is always doing processing for the Split.

we do need the check to make sure that the taskRunner is processing the same split, that this PR is doing. my original suggestion was around simplifying it by not having to store Thread. But I tried it out, and seems that we can't get away without having to stash it somewhere.

The current code that gets the stacktrace is potentially getting an "incorrect" stacktrace. And this PR is avoiding this mistake later through a callback when trying to kill the task. I was trying to think if we can stop making this mistake in the first place, by making this check when you return the stacktrace (check whether we're still on the same split). In that case, you wouldn't need to pass in the callback, but simply fail the task like your previous code was doing.

RunningSplitInfo { ...... public Optional<Thread> getThreadStackTrace() { return taskRunner.getThreadStackTrace(this, thread); } } TaskRunner { ........ synchronized StackTrace isStillExecuting(RunningSplitInfo stuckSplitInfo, Thread thread) { if (stuckSplitInfo != null && currentThreadSplitInfo == stuckSplitInfo) { return thread.getStackTrace(); } } }

I guess the question is, is there an advantage to executing the failTask within the TaskRunner lock or just getting the right stacktrace is enough? (Haven't looked into whether failTask may be locking something else that'd potentially cause a deadlock. cc @losipiuk @arhimondr)

Even though we have to store the thread, if possible, we can just expose the needed stacktrace from the splitinfo and leverage taskRunner for locking. However, having gone through the exercise, I don't feel super strongly about readability in one or the other (advtg will be that we avoid the callback)

"stop making this mistake in the first place,": there is always a performance trade off when moving holding locks early. Essentially, it means we increased the impacted threads, adding more works in the critical section, added more delay/works for more accurate result. This code intentionally delay the lock, since JONI stuck is really a corner case.

What's more, when dealing with stacktrace, there is always a window of race condition. Since we can't freeze the thread trace

Deadlock means two threads hold each other's locks, Since currentRunningSplitInfo is newly added. I don't think failTask will acquire currentRunningSplitInfo and causing deadlock. Especially in the all 3 places that acquired currentRunningSplitInfo, the code didn't ask for more locks.

Probability of a split running for 10-15 minutes switching to a different split at the exact same moment when the checker is running is practically "non existent". The consequences of this event are nowhere near significant. The worst thing that could happen is that a task running a stuck split that is not uninterruptible will get killed. IMO we are trying to solve a small problem that could happen with an extremely tiny probability by introducing extra synchronization complexity that may backfire in a significantly more impactful way. It looks like killing a task under a lock is probably safe today (I'm not 100% sure though due to complexity of task management code), but it may backfire if the task execution internals ever change (e.g.: when failing a task is changed to be synchronous vs asynchronous). If it was up to me - I would probably suggest to leave it as is and instead of trying to do our best on interrupting only tasks that stuck on Joni try to make other long running operations interruptible (can be done iteratively and over time) and eventually remove the predicate.

IMO we are trying to solve a small problem that could happen with an extremely tiny probability by introducing extra synchronization complexity that may backfire in a significantly more impactful way. It looks like killing a task under a lock is probably safe today (I’m not 100% sure though due to complexity of task management code), but it may backfire if the task execution internals ever change (e.g.: when failing a task is changed to be synchronous vs asynchronous)

+1. I do see value in fixing it, but worried from future changes/maintainability perspective. One option is that we can leave a really extensive comment about this potential corner case, and leave it at that. OR, instead of failing the task inside the lock - just log warning inside the lock, but fail outside of the lock.

core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java

groupcache4321 · 2022-07-21T20:01:30Z

hey, I was able to reproduce @losipiuk described race condition in a unit test, will update later.

groupcache4321 · 2022-07-21T20:07:03Z

Current code logic is that: all checks: delay, stacktrace are not threadsafe.
However, once the check made the decision to interrupt a split, it will use a threadsafe method to fail that task. If the underlying thread switched split, it will do nothing.

To ensure the consistency between 1. the split and associate the interrupt thread is failing. 2. and the thread stack trace that the interrupt thread inspected.

phd3 · 2022-07-29T14:27:26Z

Had a discussion with @arhimondr , @losipiuk and @leetcode-1533. We decided on leaving a comment explaining the potential race condition, but not actually fix it. The reason being (1) super low likelihood of getting into the problematic scenario and (2) the complexity and synchronization that we introduce for this small corner case may backfire with the changes in task execution and management framework.

groupcache4321 force-pushed the FixRackConditionInterrupter branch from 1be5561 to 036ac0a Compare July 20, 2022 23:59

cla-bot bot added the cla-signed label Jul 20, 2022

groupcache4321 force-pushed the FixRackConditionInterrupter branch from 036ac0a to 2ce7947 Compare July 21, 2022 00:01

groupcache4321 marked this pull request as ready for review July 21, 2022 00:05

groupcache4321 mentioned this pull request Jul 21, 2022

Harden runaway split detection #13262

Closed

losipiuk reviewed Jul 21, 2022

View reviewed changes

phd3 reviewed Jul 21, 2022

View reviewed changes

Use lock to ensure consistency during runaway split detection

b795c9a

To ensure the consistency between 1. the split and associate the interrupt thread is failing. 2. and the thread stack trace that the interrupt thread inspected.

groupcache4321 force-pushed the FixRackConditionInterrupter branch from 2ce7947 to b795c9a Compare July 22, 2022 10:45

groupcache4321 closed this Aug 17, 2022

groupcache4321 mentioned this pull request Nov 19, 2022

Add code comment for StuckSplitTasksInterrupter race condition #15116

Merged

Conversation

groupcache4321 commented Jul 20, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

cla-bot bot commented Jul 20, 2022

Uh oh!

groupcache4321 commented Jul 21, 2022

Uh oh!

groupcache4321 commented Jul 21, 2022

Uh oh!

losipiuk left a comment

Choose a reason for hiding this comment

Uh oh!

losipiuk commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

groupcache4321 commented Jul 21, 2022

Uh oh!

phd3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

phd3 Jul 21, 2022

Choose a reason for hiding this comment

Uh oh!

phd3 Jul 21, 2022

Choose a reason for hiding this comment

Uh oh!

groupcache4321 Jul 21, 2022

Choose a reason for hiding this comment

Uh oh!

groupcache4321 Jul 21, 2022

Choose a reason for hiding this comment

Uh oh!

phd3 Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

groupcache4321 Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phd3 Jul 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

groupcache4321 commented Jul 21, 2022

Uh oh!

groupcache4321 commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phd3 commented Jul 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

losipiuk commented Jul 21, 2022 •

edited

Loading

groupcache4321 Jul 25, 2022 •

edited

Loading

arhimondr Jul 25, 2022 •

edited

Loading

groupcache4321 commented Jul 21, 2022 •

edited

Loading