-
Notifications
You must be signed in to change notification settings - Fork 440
TEZ-4460: Read timed out in shuffle handler - incorrect usage of EMPTY_LAST_CONTENT and channel write #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
8d77344 to
1aa0906
Compare
This comment was marked as outdated.
This comment was marked as outdated.
84dd081 to
9ad016b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
9ad016b to
a5ba17c
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
🎊 +1 overall
This message was automatically generated. |
|
@jteagles : can you please review this? serious bug with shufflehandler, unit test added, tested on cluster |
|
@shameersss1 , @rbalamohan : can you take a look at this? serious issue, leftover from netty3->netty4 upgrade
|
| } | ||
| int waitCount = this.reduceContext.getMapsToWait().decrementAndGet(); | ||
| if (waitCount == 0) { | ||
| LOG.debug("Finished with all map outputs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in the hotpath should we enclose this in
if (LOG.isDebugEnabled()) { } ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this happens once per every shuffle request I guess, and logging parameters don't include expensive operations, so LOG.isDebugEnabled vs. LOG.debug is mostly a method call vs. method call, I don't feel we need to be extremely cautious in this case
| int waitCount = this.reduceContext.getMapsToWait().decrementAndGet(); | ||
| if (waitCount == 0) { | ||
| LOG.debug("Finished with all map outputs"); | ||
| ch.writeAndFlush(LastHttpContent.EMPTY_LAST_CONTENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add comment why this is required here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, this is the most important part of this patch, added a commit with a code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this issue mainly due to nettty upgrade (4.x?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, absolutely, this issue is because of the incorrect usage of netty4 APIs (investigation details are on Jira ticket)
most interestingly, there were no unit tests that showed this issue so far (added one now), which reproduces when we fetch more inputs in the same request: due to this issue, the new UT completely hung, and a real TPCDS query on the cluster became very slow, as composite fetch requests hung and timed out eventually (didn't cause a query failure, just an extremely slow query)
|
In general the changes LGTM +1, Let's re-run the tests as well |
…Y_LAST_CONTENT and channel write
a5ba17c to
e321bb7
Compare
|
🎊 +1 overall
This message was automatically generated. |
shameersss1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
|
@abstractdog |
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c) (cherry picked from commit 0c3cbb1)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c) (cherry picked from commit 0c3cbb1)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c) (cherry picked from commit 0c3cbb1)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c) (cherry picked from commit 0c3cbb1)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c) (cherry picked from commit 0c3cbb1)
…Y_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (cherry picked from commit 6bd6f9c) (cherry picked from commit 0c3cbb1) (cherry picked from commit 798abf2)
…ge of EMPTY_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (#22)
…ge of EMPTY_LAST_CONTENT and channel write (apache#257) (Laszlo Bodor reviewed by Rajesh Balamohan, Syed Shameerur Rahman) (#22)
actual fixes are "EMPTY_LAST_CONTENT" and "ch.writeAndFlush"
tested with a couple of queries on TPCDS 100GB (which had frequent timeouts before the patch)