Skip to content

Conversation

@glapark
Copy link

@glapark glapark commented Sep 3, 2021

Fix:
do not hold ShuffleScheduler.this when calling exceptionReporter.reportException()
remove synchronized in copyFailed()

…Referee waits for the lock of huffleScheduler.

Fix:
  do not hold ShuffleScheduler.this when calling exceptionReporter.reportException()
  remove synchronized in copyFailed()
@tez-yetus

This comment was marked as outdated.

@tez-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 12m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+1 💚 mvninstall 12m 44s master passed
+1 💚 compile 0m 35s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 32s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 8s master passed
+1 💚 javadoc 0m 42s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 29s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗 spotbugs 1m 25s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 23s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 19s the patch passed
+1 💚 compile 0m 22s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 22s the patch passed
+1 💚 compile 0m 19s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 19s the patch passed
-0 ⚠️ checkstyle 0m 15s tez-runtime-library: The patch generated 4 new + 35 unchanged - 4 fixed = 39 total (was 39)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 18s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 16s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 findbugs 0m 52s the patch passed
_ Other Tests _
+1 💚 unit 5m 22s tez-runtime-library in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
39m 12s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-150/2/artifact/out/Dockerfile
GITHUB PR #150
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux adc4fcac2a31 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / c875b82
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-150/2/artifact/out/diff-checkstyle-tez-runtime-library.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-150/2/testReport/
Max. process+thread count 2099 (vs. ulimit of 5500)
modules C: tez-runtime-library U: tez-runtime-library
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-150/2/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog changed the title Resolve deadlock in ShuffleScheduler which occurs when ShufflePenalty… TEZ-4334: Resolve deadlock in ShuffleScheduler which occurs when ShufflePenaltyReferee waits for the lock of ShuffleScheduler Feb 26, 2023
final float MAX_ALLOWED_STALL_TIME_PERCENT = maxStallTimeFraction;

int doneMaps = numInputs - remainingMaps.get();
String errorMsg = null;
Copy link
Contributor

@abstractdog abstractdog Feb 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of creating synchronized block here, it's time to refactor this strange and confusing piece of code...we're reporting exceptions and returning with a vague boolean, which is not okay, instead, we should do it in copyFailed: catch all IOExceptions and report them, what about:

    //Restart consumer in case shuffle is not healthy
try{
   checkIfShuffleIsHealthy(fetchFailure)
} catch(IOExcepion e){
   exceptionReporter.reportException(e);
   return;
}

in checkIfShuffleIsHealthy we can do separate things in a synchronized way and don't have to return a boolean:

void checkIfShuffleIsHealthy(){
   checkIfAbortLimitIsExceeedFor(srcAttempt)
   checkWhateverTheRestOfTheMethodDoes(srcAttempt)
}

regarding checkWhateverTheRestOfTheMethodDoes:

  1. this is the logic you're about to make synchronized, you can make it I guess (as exception reporting is handled in caller method copyFailed as suggested above
  2. find a proper method name for checkWhateverTheRestOfTheMethodDoes, which reflects what it actually does

with the method refactoring, you don't have to introduce huge synchronized block, instead you can make it clear with a convenient method name what is it to be syncronized

does it make sense @glapark ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog The goal of the patch I submitted was to prevent deadlock, and it was not about simplifying the logic in check???() methods. At the time of submitting the patch, I didn't fully understand the details of the logic in check???() and related methods. Another plan could be to create a new patch for simplifying the logic here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the original goal has already been achieved I guess
what I really meant was to make a simple refactor in the very same codepath without making the patch bigger, please refer to #273

return false;
}

final float MIN_REQUIRED_PROGRESS_PERCENT = minReqProgressFraction;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you touch this codepath, don't forget to remove these lines of redeclaring stuff:

    final float MIN_REQUIRED_PROGRESS_PERCENT = minReqProgressFraction;
    final float MAX_ALLOWED_STALL_TIME_PERCENT = maxStallTimeFraction;

I don't get what's the purpose...I guess they are leftovers from a code that needed final variables (in-place Runnables or whatever)

@abstractdog
Copy link
Contributor

merged by #273

@abstractdog abstractdog closed this Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants