Skip to content

Use lock to ensure consistency during runaway split detection#13272

Closed
groupcache4321 wants to merge 1 commit intotrinodb:masterfrom
groupcache4321:FixRackConditionInterrupter
Closed

Use lock to ensure consistency during runaway split detection#13272
groupcache4321 wants to merge 1 commit intotrinodb:masterfrom
groupcache4321:FixRackConditionInterrupter

Conversation

@groupcache4321
Copy link
Contributor

Improve detection of runaway splits and related task killing code to
ensure that we do not kill a thread which we suppose hung, but moved to
execute on behalf of another query, just before we issue kill command.

Description

Bugfix (for a very low probablity race condition)

Related issues, pull requests, and links

Improvement on top of #12392

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

@cla-bot
Copy link

cla-bot bot commented Jul 20, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@groupcache4321 groupcache4321 force-pushed the FixRackConditionInterrupter branch from 1be5561 to 036ac0a Compare July 20, 2022 23:59
@cla-bot cla-bot bot added the cla-signed label Jul 20, 2022
@groupcache4321 groupcache4321 force-pushed the FixRackConditionInterrupter branch from 036ac0a to 2ce7947 Compare July 21, 2022 00:01
@groupcache4321 groupcache4321 marked this pull request as ready for review July 21, 2022 00:05
@groupcache4321
Copy link
Contributor Author

Same purpose as: #13262

@groupcache4321
Copy link
Contributor Author

Want to make sure this is only case we are avoiding:

A threadA processing Joni for 10mins+ and finished per part of processing (split A) then move to other (split B)(so both splitA and splitB's RunningSplitInfo hold the same reference of threadA). At the meantime, the interrupter thread check the stacktrace of JONI(from splitA) and then failed the SplitB.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky to follow but seems to work fine. I would suggest to sprinkle it with some comments to make life of future readers easier.

@losipiuk
Copy link
Member

losipiuk commented Jul 21, 2022

Want to make sure this is only case we are avoiding:

A threadA processing Joni for 10mins+ and finished per part of processing (split A) then move to other (split B)(so both splitA and splitB's RunningSplitInfo hold the same reference of threadA). At the meantime, the interrupter thread check the stacktrace of JONI(from splitA) and then failed the SplitB.

@leetcode-1533 Not really. The scenario you are describing I think is not possible right now. With the current code, we would always kill splitA (we are iterating over splits and killing split's tasks; not interrupting underlying threads directly).

The problematic flow is that we may kill a task that is long-running, but not really stuck on JONI code. Consider the following example:

  • We find long-running splits; we get A, B, C.
  • None of those is actually running JONI code.
  • just before when we investigate stack trace for A, the underlying thread already switched to some other unrelated split D; and D is actually running JONI
  • we get the stacktrace for what we believe is A, but it is for D, and we decide we should kill the task that A belongs to
  • (clash!!!) wrong decision is made

I think you solution here prevents that from happening.

@groupcache4321
Copy link
Contributor Author

Want to make sure this is only case we are avoiding:
A threadA processing Joni for 10mins+ and finished per part of processing (split A) then move to other (split B)(so both splitA and splitB's RunningSplitInfo hold the same reference of threadA). At the meantime, the interrupter thread check the stacktrace of JONI(from splitA) and then failed the SplitB.

@leetcode-1533 Not really. The scenario you are describing I think is not possible right now. With the current code, we would always kill splitA (we are iterating over splits and killing split's tasks; not interrupting underlying threads directly).

The problematic flow is that we may kill a task that is long-running, but not really stuck on JONI code. Consider the following example:

* We find long-running splits; we get A, B, C.

* None of those is actually running JONI code.

* just before when we investigate stack trace for A, the underlying thread already switched to some other unrelated split D; and D is actually running JONI

* we get the stacktrace for what we believe is A, but it is for D, and we decide we should kill the task that A belongs to

* (clash!!!) wrong decision is made

I think you solution here prevents that from happening.

Make sense, since we are failing the task. And we are finding the thread from the driver's context.

Copy link
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @losipiuk @leetcode-1533

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunningSplitInfo currently has this state "Thread" which can change what it's doing through a mechanism external to the class. Now the addition of taskRunner provides better (and sufficient) encapsulation here, so IMO we should remove the direct Thread access from here, and instead let all usecases go through to taskRunner, unless there's a particular reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, may be better to make this check while getting the stacktrace itself. i.e. only give a valid stacktrace if the the runner is indeed processing the split's code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, taskRunner is a runnable and don't the thread object saved. Are you suggesting getting the currentThread from TaskRunner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the taskRunner, we still can't guarantee it is always doing processing for the Split.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do need the check to make sure that the taskRunner is processing the same split, that this PR is doing. my original suggestion was around simplifying it by not having to store Thread. But I tried it out, and seems that we can't get away without having to stash it somewhere.

The current code that gets the stacktrace is potentially getting an "incorrect" stacktrace. And this PR is avoiding this mistake later through a callback when trying to kill the task. I was trying to think if we can stop making this mistake in the first place, by making this check when you return the stacktrace (check whether we're still on the same split). In that case, you wouldn't need to pass in the callback, but simply fail the task like your previous code was doing.

RunningSplitInfo {

        ......

        public Optional<Thread> getThreadStackTrace()
        {
            return taskRunner.getThreadStackTrace(this, thread);
        }
}

TaskRunner {
        ........

        synchronized StackTrace isStillExecuting(RunningSplitInfo stuckSplitInfo, Thread thread) {
            if (stuckSplitInfo != null && currentThreadSplitInfo == stuckSplitInfo) {
                     return thread.getStackTrace();
            } 
        }
}

I guess the question is, is there an advantage to executing the failTask within the TaskRunner lock or just getting the right stacktrace is enough? (Haven't looked into whether failTask may be locking something else that'd potentially cause a deadlock. cc @losipiuk @arhimondr)

Even though we have to store the thread, if possible, we can just expose the needed stacktrace from the splitinfo and leverage taskRunner for locking. However, having gone through the exercise, I don't feel super strongly about readability in one or the other (advtg will be that we avoid the callback)

Copy link
Contributor Author

@groupcache4321 groupcache4321 Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stop making this mistake in the first place,": there is always a performance trade off when moving holding locks early. Essentially, it means we increased the impacted threads, adding more works in the critical section, added more delay/works for more accurate result. This code intentionally delay the lock, since JONI stuck is really a corner case.

What's more, when dealing with stacktrace, there is always a window of race condition. Since we can't freeze the thread trace

Deadlock means two threads hold each other's locks, Since currentRunningSplitInfo is newly added. I don't think failTask will acquire currentRunningSplitInfo and causing deadlock. Especially in the all 3 places that acquired currentRunningSplitInfo, the code didn't ask for more locks.

Copy link
Contributor

@arhimondr arhimondr Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probability of a split running for 10-15 minutes switching to a different split at the exact same moment when the checker is running is practically "non existent". The consequences of this event are nowhere near significant. The worst thing that could happen is that a task running a stuck split that is not uninterruptible will get killed. IMO we are trying to solve a small problem that could happen with an extremely tiny probability by introducing extra synchronization complexity that may backfire in a significantly more impactful way. It looks like killing a task under a lock is probably safe today (I'm not 100% sure though due to complexity of task management code), but it may backfire if the task execution internals ever change (e.g.: when failing a task is changed to be synchronous vs asynchronous). If it was up to me - I would probably suggest to leave it as is and instead of trying to do our best on interrupting only tasks that stuck on Joni try to make other long running operations interruptible (can be done iteratively and over time) and eventually remove the predicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we are trying to solve a small problem that could happen with an extremely tiny probability by introducing extra synchronization complexity that may backfire in a significantly more impactful way. It looks like killing a task under a lock is probably safe today (I’m not 100% sure though due to complexity of task management code), but it may backfire if the task execution internals ever change (e.g.: when failing a task is changed to be synchronous vs asynchronous)

+1. I do see value in fixing it, but worried from future changes/maintainability perspective. One option is that we can leave a really extensive comment about this potential corner case, and leave it at that. OR, instead of failing the task inside the lock - just log warning inside the lock, but fail outside of the lock.

@groupcache4321
Copy link
Contributor Author

hey, I was able to reproduce @losipiuk described race condition in a unit test, will update later.

@groupcache4321
Copy link
Contributor Author

groupcache4321 commented Jul 21, 2022

Current code logic is that: all checks: delay, stacktrace are not threadsafe.
However, once the check made the decision to interrupt a split, it will use a threadsafe method to fail that task. If the underlying thread switched split, it will do nothing.

To ensure the consistency between
	1. the split and associate the interrupt thread is failing.
	2. and the thread stack trace that the interrupt thread inspected.
@groupcache4321 groupcache4321 force-pushed the FixRackConditionInterrupter branch from 2ce7947 to b795c9a Compare July 22, 2022 10:45
@phd3
Copy link
Member

phd3 commented Jul 29, 2022

Had a discussion with @arhimondr , @losipiuk and @leetcode-1533. We decided on leaving a comment explaining the potential race condition, but not actually fix it. The reason being (1) super low likelihood of getting into the problematic scenario and (2) the complexity and synchronization that we introduce for this small corner case may backfire with the changes in task execution and management framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants