-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: fix broken/hanging tests #1934
Conversation
setTimeout(function() { | ||
socket.write('.'); | ||
connectionDone = true; | ||
setTimeout(function(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wfm on linux, what system is this hanging on? why is this timeout needed? putting random timeouts in to solve race conditions is fragile, and in this case, I don't see what is being waited for... the worker is listening, the tcp connection is established... what's the pause for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, see this comment: #1400 (comment)
ADD: This is because of internal implementation details of SCHED_RR scheduling policy.
Again, this is not imaginary problem. See: #1896 (comment)
And I can reproduce this at least in my Mac OS from current master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Olegas I didn't say it was imaginary! I said that inserting arbitrary timeouts to solve undescribed race conditions is poor practice. You didn't answer my question: what is the race condition?
Can you describe what ordering of scheduling events is causing this to break on OS X? Usually, rather than timeouts, its possible to explicitly wait for precise events to avoid race conditions and timeouts. If node lacks such events, then timeouts are our last resort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed my previous comment. I found an error in my explanation.
This test has been timing out on multiple buildbots on several occasions as of late. |
@jbergstroem Any idea how to convince ci to build a PR? I tried putting the pr number in the paramaterized build, and I tried removing master, and putting the PR number in, and it always seems to be the master that is tested. |
https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/802/, for example. Anyhow, I'm fine with merging this, but since I didn't run it through CI last time, maybe I should this time. Thoughts, @Fishrock123 @jbergstroem ? |
@sam-github I always go for user/branch when setting up the parameterised build -- in this case "Olegas" as user and "fix-cluster-test" for branch. Edit: cancel the current run and get a new one going with above suggestion. Lets see how that one pans out! |
OK, figured it out, test running here: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/806/ |
I got a lint issue:
Edit: some of the bots seems to still time out (freebsd101-32, osx1010, freebsd101-64, centos5-32, smartos14-32, smartos14-64) |
setTimeout(function() { | ||
socket.write('.'); | ||
connectionDone = true; | ||
worker.disconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what's the difference between using setTimeout
vs passing a callback to .disconnect()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically have it how it was originally except put the call into the .disconnect()
. if that callback is never called then there must be an issue with the patch itself on some platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the issue with my approach. still working through it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trevnorris the disconnect callback will never be called, not while there is an open connection to the worker. the test calls disconnect, and waits one second to confirm that disconnect did NOT in fact, cause disconnection... only when the socket is closed, triggered by the write('.'), does the disconnect proceed.
@jbergstroem I think we should revert 9c0a1b8 on master, and re-merge this PR when CI is good. Is that the right thing here? |
@Olegas I cleaned up the lint when I merged the first version, you should |
I have an alternative test that gets around the need for the |
@trevnorris sounds good. |
@sam-github that is normally the correct thing to do. unless it's critical (e.g. a patch of mine recently caused a segfault when using one module) it's not abnormal that we give the author of the patch warning and a day or so to figure it out. not uncommon that a small thing was missed and can be easily fixed. |
CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/808/ Feel free to review my changes to the test: Change: trevnorris@768e6b9 |
crap. scrap that one. missed a '.' in a path. canceling last one, and might have brought down jenkins in the process... |
@trevnorris investigating jenkins - stalling over here as well. Unfortunately lack access to the host machine though (for now). |
Thanks. Working again. New CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/809/ |
@trevnorris its late, I must be misreading your version: no writes occur except in on data handlers... which is a chicken or the egg scenario, noone will ever write. The essential test script is this:
So, some kind of timeout is necessary... unless you wait a bit, how do you know something has not/will not happen? |
Still seems to be failing [by timeout], unfortunately. Output:
|
yup. didn't commit a line I changed. this is what I get for coding so late. have a fix. one min. @sam-github you are correct. |
Alright, one last time: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/810/ |
@sam-github @trevnorris I don't understand the difference ( And it seems we have another problem here... I've picked @trevnorris 's test and run it - evething is OK, it passed. Then I just comment out ALL I've started investigation and now I can leave all |
@Olegas what platform? |
Is there any native code involved here? That kind of problem usually shows up when doing |
@kkoopa no native code. here's some interesting output showing the test pass on freebsd when run directly, but then fails running it through I'm running another slight variation of the test now to get a better idea of where it's failing. |
@trevnorris Mac OS X |
@Olegas Thanks. Here's the latest run from my test: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/812/nodes=centos5-32/tapTestReport/test.tap-105/ So it seems there's an issue with the server accepting the client's connection. I'm not sure where to go from there in terms of troubleshooting. /cc @jbergstroem |
I'm testing @Olegas 's theory that the connection is accepted by the cluster master, but the sockfd is not sent to the worker yet: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/813/ |
My branch passed on osx 1010! That looks promising, but I have to go catch the bus to nodeconf now. |
Will the "onconnection" callback for the server not fire in that case? |
I've added Next, I can't see the
|
So, I think there are two problems:
|
My test shows the server never actually accepts the connection. Otherwise messages would have been passed between them. I don't see how that relates to (2) and unsure if it could relate to (1). |
It sounds like the bug is hitting problems with robustness of master/worker protocol by doing a disconnect during a connection. |
Can confirm that the commit in |
replaced by #1953 |
Fixes: #1933