Skip to content

Conversation

@abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Feb 28, 2023

This patch resolved a deadlock between different parts of ShuffleSheduler:

  1. exceptionReporter.reportException(new IOException(errorMsg, fetchFailure.getCause()));

The patch
a) removed synchronized keyword from copyFailed method, but let parts of that method remain sync
b) removed the boolean return mess from methods that called from copyFailed and introduced a clearer IOException pattern and let the exceptionReporter be called from a non-sync part, at one single place

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@abstractdog abstractdog requested a review from rbalamohan March 3, 2023 09:11
@abstractdog abstractdog changed the title TEZ-4334: Fix deadlock in ShuffleScheduler - wip TEZ-4334: Fix deadlock in ShuffleScheduler Mar 3, 2023
@abstractdog abstractdog changed the title TEZ-4334: Fix deadlock in ShuffleScheduler TEZ-4334: Fix deadlock in ShuffleScheduler between ShuffleScheduler.close() and the ShufflePenaltyReferee thread Mar 3, 2023
@tez-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 25m 21s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+1 💚 mvninstall 15m 47s master passed
+1 💚 compile 0m 38s master passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 compile 0m 29s master passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+1 💚 checkstyle 1m 2s master passed
+1 💚 javadoc 0m 38s master passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 javadoc 0m 27s master passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+0 🆗 spotbugs 1m 22s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 18s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 22s the patch passed
+1 💚 compile 0m 23s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 javac 0m 23s the patch passed
+1 💚 compile 0m 21s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+1 💚 javac 0m 21s the patch passed
+1 💚 checkstyle 0m 14s tez-runtime-library: The patch generated 0 new + 37 unchanged - 2 fixed = 37 total (was 39)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 18s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 javadoc 0m 17s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+1 💚 findbugs 0m 59s the patch passed
_ Other Tests _
+1 💚 unit 5m 30s tez-runtime-library in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
55m 11s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-273/3/artifact/out/Dockerfile
GITHUB PR #273
JIRA Issue TEZ-4334
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 7d957bdff81b 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 6bd6f9c
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-273/3/testReport/
Max. process+thread count 2090 (vs. ulimit of 5500)
modules C: tez-runtime-library U: tez-runtime-library
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-273/3/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@tez-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 42s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+1 💚 mvninstall 16m 7s master passed
+1 💚 compile 0m 36s master passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 compile 0m 30s master passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+1 💚 checkstyle 0m 59s master passed
+1 💚 javadoc 0m 39s master passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 javadoc 0m 25s master passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+0 🆗 spotbugs 1m 28s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 25s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 21s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 javac 0m 24s the patch passed
+1 💚 compile 0m 20s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+1 💚 javac 0m 20s the patch passed
+1 💚 checkstyle 0m 13s tez-runtime-library: The patch generated 0 new + 37 unchanged - 2 fixed = 37 total (was 39)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 17s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04
+1 💚 javadoc 0m 16s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
+1 💚 findbugs 0m 58s the patch passed
_ Other Tests _
+1 💚 unit 5m 15s tez-runtime-library in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
30m 32s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-273/4/artifact/out/Dockerfile
GITHUB PR #273
JIRA Issue TEZ-4334
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 302f69617d1e 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 6bd6f9c
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu122.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~22.04-b09
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-273/4/testReport/
Max. process+thread count 2091 (vs. ulimit of 5500)
modules C: tez-runtime-library U: tez-runtime-library
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-273/4/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@rbalamohan
Copy link
Contributor

Good catch. This should happen more in clusters with bad nodes (in exception codepath).

LGTM. +1

@abstractdog abstractdog merged commit 25a9536 into apache:master Mar 6, 2023
zhuxt2015 pushed a commit to zhuxt2015/tez that referenced this pull request May 14, 2024
…lose() and the ShufflePenaltyReferee thread (apache#273) (Laszlo Bodor,  Sungwoo Park, reviewed by Rajesh Balamohan)

(cherry picked from commit 25a9536)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 11, 2024
…lose() and the ShufflePenaltyReferee thread (apache#273) (Laszlo Bodor,  Sungwoo Park, reviewed by Rajesh Balamohan)

(cherry picked from commit 25a9536)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
…lose() and the ShufflePenaltyReferee thread (apache#273) (Laszlo Bodor,  Sungwoo Park, reviewed by Rajesh Balamohan)

(cherry picked from commit 25a9536)
(cherry picked from commit 6b5355a)
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
…cheduler.close() and the ShufflePenaltyReferee thread (apache#273) (Laszlo Bodor, Sungwoo Park, reviewed by Rajesh Balamohan) (#17)

(cherry picked from commit 25a9536)
(cherry picked from commit 6b5355a)

Co-authored-by: Bodor Laszlo <[email protected]>
shubhluck pushed a commit to acceldata-io/tez that referenced this pull request Nov 21, 2024
…cheduler.close() and the ShufflePenaltyReferee thread (apache#273) (Laszlo Bodor, Sungwoo Park, reviewed by Rajesh Balamohan) (#17)

(cherry picked from commit 25a9536)
(cherry picked from commit 6b5355a)

Co-authored-by: Bodor Laszlo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants