Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout and retry logic for session creation #1232

Closed
wants to merge 1 commit into from

Conversation

NickStefan
Copy link
Contributor

This never happens locally, but in CI environments, the HTTP request that gets a session between selenium and nightwatch, may hang indefinitely. This was happening to me at least 1 to 3 times across 20 test group every single run on CI.

related:
#1223

Why:

  • Without a timeout, nigthwatch can hang for 10minutes+ trying to get a session
  • Merely failing on the timeout, is unnecessary, because the selenium server is actually still responsive!

Changes:

  • add the node.js timeout event listener to lib/http/request and emit an event called timeout.
  • add a listener for a timeout event on Nightwatch.startSession.
  • converted existing anonymous callbacks to named functions, and re-pass them to a recursive retry function
  • retries are limited to 5. timeout for session is set at 15000.
  • this timeout does not affect when selenium is off_. It only affects when the server is alive, but just hanging on response. I cannot imagine how 15000 would be too short for an existing consumer of this library.
 var retries = 5;
 /* ... */
 function createRequest(){
    if (retries > 0){
      retries = retries - 1;
      var request = new HttpRequest(options);
      request
      .on('success', createSuccessCb(request))
      .on('error', errorCb)
      .on('timeout', createRequest) 
      .send();
    } else {
      self.emit('error', {});
    }
  }

  createRequest();

@NickStefan
Copy link
Contributor Author

trying to fix the broken tests:

Mocha tests with failures
    1) demoTest
    2) demoTest2


  2 passing (568ms)
  2 failing

  1) Mocha tests with failures demoTest:
     Testing if element <#badElement> is present. - Expected "present" but got: "not present"
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:16:15)
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:10:5)

  2) Mocha tests with failures demoTest2:
     Testing if element <#badElement> is present. - Expected "present" but got: "not present"
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:23:15)
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:10:5)

These seem totally unrelated. I'm having a hard time figuring out how this is related to retrying the session.

Same for this other failing tests:

536 passing (20s)
  1 failing

  1) test NightwatchIndex test session response with status success and no sessionId:
     Error: timeout of 2000ms exceeded. Ensure the done() callback is being called in this test.

@beatfactor
Copy link
Member

The only test failing is "test session response with status success and no sessionId". The Mocha test failures are intentional. I would suggest to only add the timeout handler and leave the retry out for now.

@NickStefan
Copy link
Contributor Author

NickStefan commented Oct 21, 2016

I took another stab at it, and got all of the tests passing.

Re: only merging the timeout:

Only merging the timeout logic would only solve half of the bug I'm trying to solve. I would write a test, but it only ever happens on our private travis ci. In order to originally investigate it I did something like this:

chromedriver.sh

#!/bin/bash

while true;
    do sleep 10;
    HEALTH=$(curl -sL -w "%{http_code}\\n" "localhost:9515" -o /dev/null)
    if [ $HEALTH == 000 ]; then
        echo "chromedriver dead"
        ./node_modules/.bin/chromedriver &
    fi
done

The curl health checks to chromedriver / selenium would always return alive, and the chromedriver would always note the port was busy (e.g. it was alive), and yet nightwatch's request would hang until timeout. This would happen on travis 1 to 3 times per 30 suites every CI run.

Thanks for considering!

@naissa12
Copy link

I'm also having the same problem. Can we please merge this ASAP?

@silverrajat88
Copy link

@naissa12
Copy link

does this code cover cases where you run selenium manually separate from nightwatch? ie you do not let nightwatch start selenium? I'm seeing the same issue where tests are hanging when running selenium separate from nightwatch

@NickStefan
Copy link
Contributor Author

NickStefan commented Oct 25, 2016

@naissa12 this pr takes the existing nightwatch request logic and adds timeout + retry. So it should not make a difference whether you or nightwatch spawns selenium. You could try this PR out just to see.

@silverrajat88 it may fix that issue. Assuming you npm install nightwatch, you can do a temporary hack to test this PR for your problem by pointing your package.json to this PR's commit:

nightwatch: https://github.com/NickStefan/nightwatch/tarball/patch-1

@diegoprd
Copy link

diegoprd commented Nov 1, 2016

@beatfactor I am struggling with the same problem... may we ask you to merge this asap?
Best

@diegoprd
Copy link

diegoprd commented Nov 1, 2016

also related with #1141

@beatfactor
Copy link
Member

Yes, this is the same problem as reported in #1141, and I think it should be implemented as discussed there (with a config setting and a default higher than 15s). I am still not in favour of the retry being part of this PR, especially that it doesn't have tests. I think the retry can be dealt with in a separate fix.

@NickStefan
Copy link
Contributor Author

@beatfactor Well, I think its fair to say though that people have confirmed that retying will need to be a part of some solution, right? Simply timing out on timeout, and then failing the tests, isn't a great default. Its forcing consumers of this library to care about low level http issues between test runner and selenium. Those issues really aren't a part of their own tests failing.

So I see that the linked related issue #1141 is dated from August, and is for Circle CI (?). Is there a related fix already in development that this PR is stepping on?

If this PR can't be merged as is, can you give some guidance on what sort of solution would be acceptable?

  1. If it's coding style, making certain parameters configable, I can do that! Would you mind pointing me to another part of the codebase that makes use of those preferred config objects and coding styles?
  2. For testing, it sounds like we need to get those timeouts to happen with a test for nightwatchjs own tests in this repo. Then we'll need to have some assertions that a timeout has indeed happened, been captured, and possibly handled. Without exposing this library's http module, in order to listen for the timeout, it seems like it might be tricky to prove a timeout from the test runner. Do you have any suggestions?

@diegoprd
Copy link

diegoprd commented Nov 1, 2016

@beatfactor , @NickStefan I created this PR: #1253 adding the configurable settings for timeout prop.

@beatfactor I do believe retries should be part of this solution so that we solve the continues integration issues we are facing.

@NickStefan sorry to create another PR feel free to bring anything you find useful from that one to yours and we can continue with this one.

Best

@NickStefan
Copy link
Contributor Author

NickStefan commented Nov 2, 2016

@beatfactor I am still committed to getting some tests for this, but I did go ahead and pull out the number of retry attempts and timeouts into configurable options. (added some new code and commits!)

It would work like default_path_prefix

request_timeout_options: {
    duration: 60000,
    attempts: 1
}

If its not passed, the default is a timeout of 60000, and the number of attempts is defaulted to 1 (e.g. no retries).

Is this more along the lines of what you're looking for?

@diegoprd No worries; I was already working on this update while you were working on yours :).

@beatfactor
Copy link
Member

@NickStefan yeah, it seems to be heading into the right direction. For the options, I would suggest changing "attempts" into "retry_attempts" or "retries".

@@ -165,6 +169,10 @@ module.exports = (function() {
self.emit('error', {}, response);
});

this.request.setTimeout(this.timeoutDuration || 60000, function(){
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could/should we put the default value at the top of request.js as part of the settings (line 9)?

and then we can make the setTimeout change the settings like the proxy is set up for example:

HttpRequest.setProxy = function(proxy) {
    Settings.proxy = proxy;
  };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegoprd Agreed. I went ahead and implemented this for the timeout. in a latest commit

data = '';
}
var attempts = (this.options.request_timeout_options || {}).attempts || 1;
var timeout = (this.options.request_timeout_options || {}).duration || 60000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are setting the 60000 again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gone in the latest

@NickStefan
Copy link
Contributor Author

So I was able to pull the timeout defaults and configuring deep into the http/request file. Still thinking about how (or if) I should pull the retry logic. Right now it only retries at the startSession in lib/index. Adding a retry to the http/request is a little tricky with the use of streams. Going to poke around a bit more...

@NickStefan
Copy link
Contributor Author

NickStefan commented Nov 3, 2016

So all of the ugly new code I had added to Nightwatch.prototype.startSession is gone now. All of the retry and timeout logic, defaults and config happen inhttp/request.js.

Docs PR: nightwatchjs/nightwatch-docs#60

All of the existing tests pass. I've been manually testing the retry logic against my company's CI. Tests for this repo (to prove the timeout and retry logic) will have to wait until tomorrow.

@diegoprd
Copy link

diegoprd commented Nov 4, 2016

great job @NickStefan !

@NickStefan
Copy link
Contributor Author

We're on an old version of nock (0.47 out of 9.0.0) and in order to test for timeouts we need to use their latest version. Nock permanently changed a few things at version 0.48. In order to develop the tests here, this PR is pointing at a fork of nock (off 9.0.0): nock/nock#744 . Hopefully, they'll merge... if not there's no real solution besides a huge reworking of the nightwatch tests...

Otherwise, this PR is ready @beatfactor

@beatfactor
Copy link
Member

Could you also squash your commits?

@@ -4,6 +4,7 @@ var Logger = common.require('util/logger');
var nock = require('nock');
var assert = require('assert');
var mockery = require('mockery');
var once = require('lodash.once');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this dependency be avoided? In any case it should be added to devDependencies.

var seleniumHost = this.options.seleniumHost || this.options.selenium_host;
var useSSL = this.options.useSsl || this.options.use_ssl;
var proxy = this.options.proxy;
var timeoutOptions = (this.options.request_timeout_options || {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary brackets.

@@ -165,6 +169,16 @@ module.exports = (function() {
self.emit('error', {}, response);
});

this.request.setTimeout(this.timeout, function(){
if (self.retryAttempts) {
console.log('retrying...');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the Logger here.

sessionId: '123456'
};

HttpRequest.setTimeout(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could do with a smaller timeout here (e.g. 100ms)? One full second adds quite a bit to the tests run.

@beatfactor
Copy link
Member

What is making the latest nock version incompatible with our tests? And also you need to edit the commit message to be something descriptive.

@beatfactor
Copy link
Member

We could point to a fork of nock if necessary, but maybe they might still merge your PR? There's still the issue with the 1s timeout and the use of lodash.once that I'd like we address.

@DoHe it would be quite useful I think if you could get a verbose log of the retry problem and even a selenium debug log. Do you think that's possible?

@DoHe
Copy link

DoHe commented Nov 15, 2016

@beatfactor Of course. I created a gist with both logs here. At the end of the logs you can also see what happens when I kill the selenium server after the tests freeze. I have test_workers set to true, maybe that's connected (although I'm running a single file, so it shouldn't make much of a difference). Hope that helps.

@DoHe
Copy link

DoHe commented Nov 15, 2016

In the documentation it says on a time out we have to close the socket manually. I'll try to write a fix for that myself.

@DoHe
Copy link

DoHe commented Nov 15, 2016

@NickStefan I opened a PR against your branch that fixed that problem for me.

@beatfactor
Copy link
Member

@DoHe is the problem of the nightwatch runner hanging at the end also fixed but your PR?

@DoHe
Copy link

DoHe commented Nov 15, 2016

@beatfactor It's the only thing it fixes (in comparison to this PR)

The Nightwatch.startSession http action often times out despite the selenium server
being totally responsive. This PR adds timeout and retry parameters to
the http module, allowing the tests to continue despite a failed session
attempt.
@NickStefan
Copy link
Contributor Author

@DoHe I added the this.socket.unref, back in, but with a guard. Nock just doesn't implement that method.

@beatfactor removed the lodash.once dependency. also set the timeout in the testing to much lower than 1000ms (200ms).

@DoHe
Copy link

DoHe commented Nov 17, 2016

@NickStefan it might be nicer to add a dummy unref to the mocked socket and not have an if in the actual code. I have never worked with nock so I don't know how feasible that would be.

@naissa12
Copy link

Hey nick I tried your code but the tests are still hanging when trying to request session from selenium. What's the best way to handle?

@NickStefan
Copy link
Contributor Author

@naissa12 did you set the options? I made a PR to the docs nightwatchjs/nightwatch-docs#60

request_timeout_options: {
    timeout: 10000,
    retry_attempts: 5
}

retry attempts defaults to 0, so you need to set the options to get this PR to do anything.

@joel-airspring
Copy link
Contributor

There hasn't been any movement on this PR in a while, so I thought I'd mention that I was pursuing what I think is the same problem but from the Chrome side. I believe I've proved that this failure to get sessions began with Chrome 53 (Linux only), but I'd be interested to know if anyone has other data.

  1. Here's the case I created: https://bugs.chromium.org/p/chromedriver/issues/detail?id=1620
  2. Can anyone confirm (or disconfirm) my Linux Chrome 53+ hypothesis?
  3. My testing environment in CentOS 7.2. What versions of Linux are others seeing this problem on? On any other Linux distros?

@DoHe
Copy link

DoHe commented Dec 7, 2016

@joel-airspring I ran your script on my setup where I experienced problems with nightwatch: Ubuntu 16.04.1 LTS on EC2, using xvfb, Chrome version 55.0.2883.75 and chromedriver version 2.25.426924
I ran it three times and it always ran through.

@beatfactor
Copy link
Member

Merged this into master and managed to rewrite the tests without nock: f3bc41c. Please have a look, it will be released soon in npm.

@NickStefan
Copy link
Contributor Author

👏 nice! Good to hear we got around nock.

@beatfactor
Copy link
Member

Added to v0.9.11.

@beatfactor beatfactor closed this Dec 14, 2016
@DoHe
Copy link

DoHe commented Dec 21, 2016

@beatfactor I'm not sure your rewritten version is actually doing the same as this PR. Since moving from this branch to 0.9.11., we're getting following error (that we didn't have before):
{ [Error: socket hang up] code: 'ECONNRESET' }
I think this might be connected to using self.request.abort instead of using this.socket.unref. I remember seeing similar problems, when I wrote the fix with unref.
It could of course also be unconnected to this change. @NickStefan does the new version fix the problem reliably for you?

@beatfactor
Copy link
Member

Yes, the socket hang up is from aborting the request, which is timed out. The abort is a more suitable way of destroying the socket. Maybe the error message should be improved, but the request should be retried regardless of the error being there or not. Isn't that the case for you?

@DoHe
Copy link

DoHe commented Dec 21, 2016

It does not look as if it's retrying (I haven't had time to investigate in too much detail yet, but it happens on our productive CI tests quite regularly):

functional/tests/web Error retrieving a new session from the selenium server
functional/tests/web
functional/tests/web Connection refused! Is selenium server started?
functional/tests/web { [Error: socket hang up] code: 'ECONNRESET' }
functional/tests/web
>> functional/tests/web finished.

As you can see there is no tally of the number of successful tests.
And at the very end:

./nightwatch -e chrome returned exit code 1

@beatfactor
Copy link
Member

That's odd, retrying should work. How does your config look like?

@DoHe
Copy link

DoHe commented Dec 21, 2016

{
    "src_folders": ["functional"],
    "output_folders": "functional/reports",

    "live_output" : true,

    "selenium": {
        "start_process": true,
        "server_path": "node_modules/selenium-server-standalone-jar/jar/selenium-server-standalone-3.0.1.jar",
        "log_path": "./",
        "host:": "127.0.0.1",
        "port": 4444,
        "cli_args": {
            "webdriver.chrome.driver": "node_modules/chromedriver/bin/chromedriver"
        }
    },

    "test_settings": {
        "default": {
            "selenium_host": "localhost",
            "selenium_port": 4444,
            "request_timeout_options": {
                "timeout": 20000,
                "retry_attempts": 10
            },
            "silent": true,
            "screenshots": {
                "enabled": false,
                "path": ""
            },
            "globals": {
                "waitForConditionTimeout": 10000
            },
            "end_session_on_fail": true,
            "test_workers": true
        },

        "chrome": {
            "desiredCapabilities": {
                "browserName": "chrome"
            }
        }
    }
}

@NickStefan
Copy link
Contributor Author

@DoHe I haven't had a chance to try out v0.9.11 yet. Our Ci is still running the PR / fork.

It's on my todo list, but ive got a lot of other features to work on! Currently wrestling with celery and rabbitmq on our backend (my preferred home 😉 )

@beatfactor
Copy link
Member

@DoHe Hmm, so it looks like the runner is exiting at the socket hang up error. I'll look into it.

@beatfactor
Copy link
Member

@DoHe Pushed a fix for the problem you reported, in 0.9.12. Thanks.

@DoHe
Copy link

DoHe commented Jan 4, 2017

@beatfactor Thanks for that, it works perfectly for us now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants