Skip to content

Conversation

@abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Dec 9, 2022

  1. Fixed issues in ShuffleHandler
    actual fixes are "EMPTY_LAST_CONTENT" and "ch.writeAndFlush"
  2. Refactored TestShuffleHandler and introduced a new test case that calls the ShuffleHandler thoroughly

tested with a couple of queries on TPCDS 100GB (which had frequent timeouts before the patch)

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@abstractdog abstractdog force-pushed the TEZ-4460 branch 6 times, most recently from 84dd081 to 9ad016b Compare December 10, 2022 22:04
@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 35m 23s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 15m 59s master passed
+1 💚 compile 0m 30s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 0m 28s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 0m 58s master passed
+1 💚 javadoc 0m 34s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 22s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+0 🆗 spotbugs 1m 9s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 5s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 22s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javac 0m 22s the patch passed
+1 💚 compile 0m 18s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 javac 0m 17s the patch passed
-0 ⚠️ checkstyle 0m 11s tez-plugins/tez-aux-services: The patch generated 1 new + 65 unchanged - 2 fixed = 66 total (was 67)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 14s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 13s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 findbugs 0m 45s the patch passed
_ Other Tests _
+1 💚 unit 3m 4s tez-aux-services in the patch passed.
+1 💚 asflicense 0m 13s The patch does not generate ASF License warnings.
61m 59s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/6/artifact/out/Dockerfile
GITHUB PR #257
JIRA Issue TEZ-4460
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 510f31b45ab0 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 34d6810
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/6/artifact/out/diff-checkstyle-tez-plugins_tez-aux-services.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/6/testReport/
Max. process+thread count 1399 (vs. ulimit of 5500)
modules C: tez-plugins/tez-aux-services U: tez-plugins/tez-aux-services
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/6/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog requested a review from jteagles December 11, 2022 07:20
@abstractdog
Copy link
Contributor Author

abstractdog commented Dec 11, 2022

@jteagles : can you please review this? serious bug with shufflehandler, unit test added, tested on cluster

@abstractdog
Copy link
Contributor Author

@shameersss1 , @rbalamohan : can you take a look at this? serious issue, leftover from netty3->netty4 upgrade
the actual fix is easy, a couple of lines in ShuffleHandler

  • added UT which hangs without my patch (the very-same way as it did on the cluster)

}
int waitCount = this.reduceContext.getMapsToWait().decrementAndGet();
if (waitCount == 0) {
LOG.debug("Finished with all map outputs");
Copy link
Contributor

@shameersss1 shameersss1 Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in the hotpath should we enclose this in
if (LOG.isDebugEnabled()) { } ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens once per every shuffle request I guess, and logging parameters don't include expensive operations, so LOG.isDebugEnabled vs. LOG.debug is mostly a method call vs. method call, I don't feel we need to be extremely cautious in this case

int waitCount = this.reduceContext.getMapsToWait().decrementAndGet();
if (waitCount == 0) {
LOG.debug("Finished with all map outputs");
ch.writeAndFlush(LastHttpContent.EMPTY_LAST_CONTENT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add comment why this is required here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, this is the most important part of this patch, added a commit with a code comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue mainly due to nettty upgrade (4.x?)

Copy link
Contributor Author

@abstractdog abstractdog Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, absolutely, this issue is because of the incorrect usage of netty4 APIs (investigation details are on Jira ticket)
most interestingly, there were no unit tests that showed this issue so far (added one now), which reproduces when we fetch more inputs in the same request: due to this issue, the new UT completely hung, and a real TPCDS query on the cluster became very slow, as composite fetch requests hung and timed out eventually (didn't cause a query failure, just an extremely slow query)

@shameersss1
Copy link
Contributor

In general the changes LGTM +1, Let's re-run the tests as well

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 20s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 15m 33s master passed
+1 💚 compile 0m 30s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 compile 0m 30s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 checkstyle 1m 0s master passed
+1 💚 javadoc 0m 37s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 0m 25s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+0 🆗 spotbugs 1m 3s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 2s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 24s the patch passed
+1 💚 compile 0m 16s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javac 0m 16s the patch passed
+1 💚 compile 0m 15s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 javac 0m 15s the patch passed
-0 ⚠️ checkstyle 0m 10s tez-plugins/tez-aux-services: The patch generated 1 new + 65 unchanged - 2 fixed = 66 total (was 67)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 12s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 0m 11s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 findbugs 0m 33s the patch passed
_ Other Tests _
+1 💚 unit 2m 56s tez-aux-services in the patch passed.
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
26m 6s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/10/artifact/out/Dockerfile
GITHUB PR #257
JIRA Issue TEZ-4460
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 10ddf931d34a 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / be99489
Default Java Private Build-1.8.0_352-8u352-ga-1~22.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~22.04-b08
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/10/artifact/out/diff-checkstyle-tez-plugins_tez-aux-services.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/10/testReport/
Max. process+thread count 1519 (vs. ulimit of 5500)
modules C: tez-plugins/tez-aux-services U: tez-plugins/tez-aux-services
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-257/10/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@shameersss1 shameersss1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@abstractdog abstractdog merged commit 6bd6f9c into apache:master Feb 28, 2023
@BsoBird
Copy link

BsoBird commented Nov 8, 2023

@abstractdog
hi,Could you explain why request.release() was removed? It seems that removing request.release() causes leak in Netty.

     public void channelRead(ChannelHandlerContext ctx, Object message)
         throws Exception {
-      FullHttpRequest request = (FullHttpRequest) message;
+      HttpRequest request = (HttpRequest) message;
       handleRequest(ctx, request);
-      request.release();
     }

prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 11, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 12, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 12, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
(cherry picked from commit 0c3cbb1)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 12, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
(cherry picked from commit 0c3cbb1)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 12, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
(cherry picked from commit 0c3cbb1)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 12, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
(cherry picked from commit 0c3cbb1)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 12, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
(cherry picked from commit 0c3cbb1)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman)

(cherry picked from commit 6bd6f9c)
(cherry picked from commit 0c3cbb1)
(cherry picked from commit 798abf2)
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
…ge of EMPTY_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (#22)
shubhluck pushed a commit to acceldata-io/tez that referenced this pull request Nov 21, 2024
…ge of EMPTY_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (#22)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants