-
Notifications
You must be signed in to change notification settings - Fork 440
TEZ-4334: Resolve deadlock in ShuffleScheduler which occurs when ShufflePenaltyReferee waits for the lock of ShuffleScheduler #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Referee waits for the lock of huffleScheduler. Fix: do not hold ShuffleScheduler.this when calling exceptionReporter.reportException() remove synchronized in copyFailed()
This comment was marked as outdated.
This comment was marked as outdated.
|
💔 -1 overall
This message was automatically generated. |
| final float MAX_ALLOWED_STALL_TIME_PERCENT = maxStallTimeFraction; | ||
|
|
||
| int doneMaps = numInputs - remainingMaps.get(); | ||
| String errorMsg = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of creating synchronized block here, it's time to refactor this strange and confusing piece of code...we're reporting exceptions and returning with a vague boolean, which is not okay, instead, we should do it in copyFailed: catch all IOExceptions and report them, what about:
//Restart consumer in case shuffle is not healthy
try{
checkIfShuffleIsHealthy(fetchFailure)
} catch(IOExcepion e){
exceptionReporter.reportException(e);
return;
}
in checkIfShuffleIsHealthy we can do separate things in a synchronized way and don't have to return a boolean:
void checkIfShuffleIsHealthy(){
checkIfAbortLimitIsExceeedFor(srcAttempt)
checkWhateverTheRestOfTheMethodDoes(srcAttempt)
}
regarding checkWhateverTheRestOfTheMethodDoes:
- this is the logic you're about to make synchronized, you can make it I guess (as exception reporting is handled in caller method copyFailed as suggested above
- find a proper method name for checkWhateverTheRestOfTheMethodDoes, which reflects what it actually does
with the method refactoring, you don't have to introduce huge synchronized block, instead you can make it clear with a convenient method name what is it to be syncronized
does it make sense @glapark ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractdog The goal of the patch I submitted was to prevent deadlock, and it was not about simplifying the logic in check???() methods. At the time of submitting the patch, I didn't fully understand the details of the logic in check???() and related methods. Another plan could be to create a new patch for simplifying the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the original goal has already been achieved I guess
what I really meant was to make a simple refactor in the very same codepath without making the patch bigger, please refer to #273
| return false; | ||
| } | ||
|
|
||
| final float MIN_REQUIRED_PROGRESS_PERCENT = minReqProgressFraction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you touch this codepath, don't forget to remove these lines of redeclaring stuff:
final float MIN_REQUIRED_PROGRESS_PERCENT = minReqProgressFraction;
final float MAX_ALLOWED_STALL_TIME_PERCENT = maxStallTimeFraction;
I don't get what's the purpose...I guess they are leftovers from a code that needed final variables (in-place Runnables or whatever)
|
merged by #273 |
Fix:
do not hold ShuffleScheduler.this when calling exceptionReporter.reportException()
remove synchronized in copyFailed()