Add timeout and retry logic for session creation #1232

NickStefan · 2016-10-21T04:58:03Z

This never happens locally, but in CI environments, the HTTP request that gets a session between selenium and nightwatch, may hang indefinitely. This was happening to me at least 1 to 3 times across 20 test group every single run on CI.

related:
#1223

Why:

Without a timeout, nigthwatch can hang for 10minutes+ trying to get a session
Merely failing on the timeout, is unnecessary, because the selenium server is actually still responsive!

Changes:

add the node.js timeout event listener to lib/http/request and emit an event called timeout.
add a listener for a timeout event on Nightwatch.startSession.
converted existing anonymous callbacks to named functions, and re-pass them to a recursive retry function
retries are limited to 5. timeout for session is set at 15000.
this timeout does not affect when selenium is off_. It only affects when the server is alive, but just hanging on response. I cannot imagine how 15000 would be too short for an existing consumer of this library.

 var retries = 5;
 /* ... */
 function createRequest(){
    if (retries > 0){
      retries = retries - 1;
      var request = new HttpRequest(options);
      request
      .on('success', createSuccessCb(request))
      .on('error', errorCb)
      .on('timeout', createRequest) 
      .send();
    } else {
      self.emit('error', {});
    }
  }

  createRequest();

NickStefan · 2016-10-21T05:33:27Z

trying to fix the broken tests:

Mocha tests with failures
    1) demoTest
    2) demoTest2


  2 passing (568ms)
  2 failing

  1) Mocha tests with failures demoTest:
     Testing if element <#badElement> is present. - Expected "present" but got: "not present"
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:16:15)
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:10:5)

  2) Mocha tests with failures demoTest2:
     Testing if element <#badElement> is present. - Expected "present" but got: "not present"
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:23:15)
      at Context.<anonymous> (test/mochatests/sampleWithFailures.js:10:5)

These seem totally unrelated. I'm having a hard time figuring out how this is related to retrying the session.

Same for this other failing tests:

536 passing (20s)
  1 failing

  1) test NightwatchIndex test session response with status success and no sessionId:
     Error: timeout of 2000ms exceeded. Ensure the done() callback is being called in this test.

beatfactor · 2016-10-21T09:01:31Z

The only test failing is "test session response with status success and no sessionId". The Mocha test failures are intentional. I would suggest to only add the timeout handler and leave the retry out for now.

NickStefan · 2016-10-21T21:27:16Z

I took another stab at it, and got all of the tests passing.

Re: only merging the timeout:

Only merging the timeout logic would only solve half of the bug I'm trying to solve. I would write a test, but it only ever happens on our private travis ci. In order to originally investigate it I did something like this:

chromedriver.sh

#!/bin/bash

while true;
    do sleep 10;
    HEALTH=$(curl -sL -w "%{http_code}\\n" "localhost:9515" -o /dev/null)
    if [ $HEALTH == 000 ]; then
        echo "chromedriver dead"
        ./node_modules/.bin/chromedriver &
    fi
done

The curl health checks to chromedriver / selenium would always return alive, and the chromedriver would always note the port was busy (e.g. it was alive), and yet nightwatch's request would hang until timeout. This would happen on travis 1 to 3 times per 30 suites every CI run.

Thanks for considering!

naissa12 · 2016-10-24T04:44:52Z

I'm also having the same problem. Can we please merge this ASAP?

silverrajat88 · 2016-10-24T11:25:44Z

@NickStefan , Do you think your PR will fix https://github.com/nightwatchjs/nightwatch/issues/1198 ?

naissa12 · 2016-10-24T20:14:28Z

does this code cover cases where you run selenium manually separate from nightwatch? ie you do not let nightwatch start selenium? I'm seeing the same issue where tests are hanging when running selenium separate from nightwatch

NickStefan · 2016-10-25T00:12:21Z

@naissa12 this pr takes the existing nightwatch request logic and adds timeout + retry. So it should not make a difference whether you or nightwatch spawns selenium. You could try this PR out just to see.

@silverrajat88 it may fix that issue. Assuming you npm install nightwatch, you can do a temporary hack to test this PR for your problem by pointing your package.json to this PR's commit:

nightwatch: https://github.com/NickStefan/nightwatch/tarball/patch-1

diegoprd · 2016-11-01T21:42:57Z

@beatfactor I am struggling with the same problem... may we ask you to merge this asap?
Best

diegoprd · 2016-11-01T21:53:15Z

also related with #1141

beatfactor · 2016-11-01T22:09:16Z

Yes, this is the same problem as reported in #1141, and I think it should be implemented as discussed there (with a config setting and a default higher than 15s). I am still not in favour of the retry being part of this PR, especially that it doesn't have tests. I think the retry can be dealt with in a separate fix.

NickStefan · 2016-11-01T22:52:54Z

@beatfactor Well, I think its fair to say though that people have confirmed that retying will need to be a part of some solution, right? Simply timing out on timeout, and then failing the tests, isn't a great default. Its forcing consumers of this library to care about low level http issues between test runner and selenium. Those issues really aren't a part of their own tests failing.

So I see that the linked related issue #1141 is dated from August, and is for Circle CI (?). Is there a related fix already in development that this PR is stepping on?

If this PR can't be merged as is, can you give some guidance on what sort of solution would be acceptable?

If it's coding style, making certain parameters configable, I can do that! Would you mind pointing me to another part of the codebase that makes use of those preferred config objects and coding styles?
For testing, it sounds like we need to get those timeouts to happen with a test for nightwatchjs own tests in this repo. Then we'll need to have some assertions that a timeout has indeed happened, been captured, and possibly handled. Without exposing this library's http module, in order to listen for the timeout, it seems like it might be tricky to prove a timeout from the test runner. Do you have any suggestions?

diegoprd · 2016-11-01T23:23:06Z

@beatfactor , @NickStefan I created this PR: #1253 adding the configurable settings for timeout prop.

@beatfactor I do believe retries should be part of this solution so that we solve the continues integration issues we are facing.

@NickStefan sorry to create another PR feel free to bring anything you find useful from that one to yours and we can continue with this one.

Best

NickStefan · 2016-11-02T00:12:40Z

@beatfactor I am still committed to getting some tests for this, but I did go ahead and pull out the number of retry attempts and timeouts into configurable options. (added some new code and commits!)

It would work like default_path_prefix

request_timeout_options: {
    duration: 60000,
    attempts: 1
}

If its not passed, the default is a timeout of 60000, and the number of attempts is defaulted to 1 (e.g. no retries).

Is this more along the lines of what you're looking for?

@diegoprd No worries; I was already working on this update while you were working on yours :).

beatfactor · 2016-11-02T14:09:13Z

@NickStefan yeah, it seems to be heading into the right direction. For the options, I would suggest changing "attempts" into "retry_attempts" or "retries".

diegoprd · 2016-11-02T14:36:36Z

lib/http/request.js

@@ -165,6 +169,10 @@ module.exports = (function() {
      self.emit('error', {}, response);
    });

+    this.request.setTimeout(this.timeoutDuration || 60000, function(){


could/should we put the default value at the top of request.js as part of the settings (line 9)?

and then we can make the setTimeout change the settings like the proxy is set up for example:

HttpRequest.setProxy = function(proxy) { Settings.proxy = proxy; };

@diegoprd Agreed. I went ahead and implemented this for the timeout. in a latest commit

diegoprd · 2016-11-02T14:52:05Z

lib/index.js

-      data = '';
-    }
+  var attempts = (this.options.request_timeout_options || {}).attempts || 1;
+  var timeout = (this.options.request_timeout_options || {}).duration || 60000;


here we are setting the 60000 again.

Gone in the latest

NickStefan · 2016-11-02T18:39:29Z

So I was able to pull the timeout defaults and configuring deep into the http/request file. Still thinking about how (or if) I should pull the retry logic. Right now it only retries at the startSession in lib/index. Adding a retry to the http/request is a little tricky with the use of streams. Going to poke around a bit more...

NickStefan · 2016-11-03T00:37:43Z

So all of the ugly new code I had added to Nightwatch.prototype.startSession is gone now. All of the retry and timeout logic, defaults and config happen inhttp/request.js.

Docs PR: nightwatchjs/nightwatch-docs#60

All of the existing tests pass. I've been manually testing the retry logic against my company's CI. Tests for this repo (to prove the timeout and retry logic) will have to wait until tomorrow.

diegoprd · 2016-11-04T15:17:43Z

great job @NickStefan !

NickStefan · 2016-11-04T22:39:53Z

We're on an old version of nock (0.47 out of 9.0.0) and in order to test for timeouts we need to use their latest version. Nock permanently changed a few things at version 0.48. In order to develop the tests here, this PR is pointing at a fork of nock (off 9.0.0): nock/nock#744 . Hopefully, they'll merge... if not there's no real solution besides a huge reworking of the nightwatch tests...

Otherwise, this PR is ready @beatfactor

beatfactor · 2016-11-05T13:07:33Z

Could you also squash your commits?

beatfactor · 2016-11-05T13:09:43Z

test/src/http/testRequest.js

@@ -4,6 +4,7 @@ var Logger = common.require('util/logger');
 var nock = require('nock');
 var assert = require('assert');
 var mockery = require('mockery');
+var once = require('lodash.once');


can't this dependency be avoided? In any case it should be added to devDependencies.

beatfactor · 2016-11-05T13:10:09Z

lib/index.js

+  var seleniumHost     = this.options.seleniumHost || this.options.selenium_host;
+  var useSSL           = this.options.useSsl || this.options.use_ssl;
+  var proxy            = this.options.proxy;
+  var timeoutOptions   = (this.options.request_timeout_options || {});


unnecessary brackets.

beatfactor · 2016-11-05T13:10:34Z

lib/http/request.js

@@ -165,6 +169,16 @@ module.exports = (function() {
      self.emit('error', {}, response);
    });

+    this.request.setTimeout(this.timeout, function(){
+      if (self.retryAttempts) {
+        console.log('retrying...');


Should use the Logger here.

beatfactor · 2016-11-10T20:20:14Z

test/src/http/testRequest.js

+        sessionId: '123456'
+      };
+
+      HttpRequest.setTimeout(1000);


Any chance we could do with a smaller timeout here (e.g. 100ms)? One full second adds quite a bit to the tests run.

beatfactor · 2016-11-10T20:24:15Z

What is making the latest nock version incompatible with our tests? And also you need to edit the commit message to be something descriptive.

beatfactor · 2016-11-14T19:36:41Z

We could point to a fork of nock if necessary, but maybe they might still merge your PR? There's still the issue with the 1s timeout and the use of lodash.once that I'd like we address.

@DoHe it would be quite useful I think if you could get a verbose log of the retry problem and even a selenium debug log. Do you think that's possible?

DoHe · 2016-11-15T09:40:07Z

@beatfactor Of course. I created a gist with both logs here. At the end of the logs you can also see what happens when I kill the selenium server after the tests freeze. I have test_workers set to true, maybe that's connected (although I'm running a single file, so it shouldn't make much of a difference). Hope that helps.

DoHe · 2016-11-15T10:59:19Z

In the documentation it says on a time out we have to close the socket manually. I'll try to write a fix for that myself.

DoHe · 2016-11-15T11:44:51Z

@NickStefan I opened a PR against your branch that fixed that problem for me.

beatfactor · 2016-11-15T12:43:07Z

@DoHe is the problem of the nightwatch runner hanging at the end also fixed but your PR?

DoHe · 2016-11-15T12:44:29Z

@beatfactor It's the only thing it fixes (in comparison to this PR)

The Nightwatch.startSession http action often times out despite the selenium server being totally responsive. This PR adds timeout and retry parameters to the http module, allowing the tests to continue despite a failed session attempt.

NickStefan · 2016-11-16T19:39:41Z

@DoHe I added the this.socket.unref, back in, but with a guard. Nock just doesn't implement that method.

@beatfactor removed the lodash.once dependency. also set the timeout in the testing to much lower than 1000ms (200ms).

DoHe · 2016-11-17T09:06:17Z

@NickStefan it might be nicer to add a dummy unref to the mocked socket and not have an if in the actual code. I have never worked with nock so I don't know how feasible that would be.

naissa12 · 2016-11-18T15:30:15Z

Hey nick I tried your code but the tests are still hanging when trying to request session from selenium. What's the best way to handle?

NickStefan · 2016-11-18T18:03:11Z

@naissa12 did you set the options? I made a PR to the docs nightwatchjs/nightwatch-docs#60

request_timeout_options: {
    timeout: 10000,
    retry_attempts: 5
}

retry attempts defaults to 0, so you need to set the options to get this PR to do anything.

joel-airspring · 2016-12-06T02:24:34Z

There hasn't been any movement on this PR in a while, so I thought I'd mention that I was pursuing what I think is the same problem but from the Chrome side. I believe I've proved that this failure to get sessions began with Chrome 53 (Linux only), but I'd be interested to know if anyone has other data.

Here's the case I created: https://bugs.chromium.org/p/chromedriver/issues/detail?id=1620
Can anyone confirm (or disconfirm) my Linux Chrome 53+ hypothesis?
My testing environment in CentOS 7.2. What versions of Linux are others seeing this problem on? On any other Linux distros?

DoHe · 2016-12-07T12:59:17Z

@joel-airspring I ran your script on my setup where I experienced problems with nightwatch: Ubuntu 16.04.1 LTS on EC2, using xvfb, Chrome version 55.0.2883.75 and chromedriver version 2.25.426924
I ran it three times and it always ran through.

beatfactor · 2016-12-13T11:03:15Z

Merged this into master and managed to rewrite the tests without nock: f3bc41c. Please have a look, it will be released soon in npm.

NickStefan · 2016-12-13T18:10:33Z

👏 nice! Good to hear we got around nock.

beatfactor · 2016-12-14T14:04:41Z

Added to v0.9.11.

DoHe · 2016-12-21T10:35:56Z

@beatfactor I'm not sure your rewritten version is actually doing the same as this PR. Since moving from this branch to 0.9.11., we're getting following error (that we didn't have before):
{ [Error: socket hang up] code: 'ECONNRESET' }
I think this might be connected to using self.request.abort instead of using this.socket.unref. I remember seeing similar problems, when I wrote the fix with unref.
It could of course also be unconnected to this change. @NickStefan does the new version fix the problem reliably for you?

beatfactor · 2016-12-21T13:11:59Z

Yes, the socket hang up is from aborting the request, which is timed out. The abort is a more suitable way of destroying the socket. Maybe the error message should be improved, but the request should be retried regardless of the error being there or not. Isn't that the case for you?

DoHe · 2016-12-21T13:39:55Z

It does not look as if it's retrying (I haven't had time to investigate in too much detail yet, but it happens on our productive CI tests quite regularly):

functional/tests/web Error retrieving a new session from the selenium server
functional/tests/web
functional/tests/web Connection refused! Is selenium server started?
functional/tests/web { [Error: socket hang up] code: 'ECONNRESET' }
functional/tests/web
>> functional/tests/web finished.

As you can see there is no tally of the number of successful tests.
And at the very end:

./nightwatch -e chrome returned exit code 1

beatfactor · 2016-12-21T14:52:41Z

That's odd, retrying should work. How does your config look like?

DoHe · 2016-12-21T15:23:49Z

{
    "src_folders": ["functional"],
    "output_folders": "functional/reports",

    "live_output" : true,

    "selenium": {
        "start_process": true,
        "server_path": "node_modules/selenium-server-standalone-jar/jar/selenium-server-standalone-3.0.1.jar",
        "log_path": "./",
        "host:": "127.0.0.1",
        "port": 4444,
        "cli_args": {
            "webdriver.chrome.driver": "node_modules/chromedriver/bin/chromedriver"
        }
    },

    "test_settings": {
        "default": {
            "selenium_host": "localhost",
            "selenium_port": 4444,
            "request_timeout_options": {
                "timeout": 20000,
                "retry_attempts": 10
            },
            "silent": true,
            "screenshots": {
                "enabled": false,
                "path": ""
            },
            "globals": {
                "waitForConditionTimeout": 10000
            },
            "end_session_on_fail": true,
            "test_workers": true
        },

        "chrome": {
            "desiredCapabilities": {
                "browserName": "chrome"
            }
        }
    }
}

NickStefan · 2016-12-21T15:30:09Z

@DoHe I haven't had a chance to try out v0.9.11 yet. Our Ci is still running the PR / fork.

It's on my todo list, but ive got a lot of other features to work on! Currently wrestling with celery and rabbitmq on our backend (my preferred home 😉 )

beatfactor · 2016-12-21T15:57:44Z

@DoHe Hmm, so it looks like the runner is exiting at the socket hang up error. I'll look into it.

beatfactor · 2017-01-01T21:35:50Z

@DoHe Pushed a fix for the problem you reported, in 0.9.12. Thanks.

DoHe · 2017-01-04T19:48:04Z

@beatfactor Thanks for that, it works perfectly for us now.

NickStefan force-pushed the patch-1 branch from 348e08c to 3442dd7 Compare October 21, 2016 21:16

naissa12 approved these changes Oct 24, 2016

View reviewed changes

diegoprd mentioned this pull request Nov 1, 2016

Issue/1141 overwritable request timeout #1253

Closed

diegoprd reviewed Nov 2, 2016

View reviewed changes

beatfactor reviewed Nov 5, 2016

View reviewed changes

NickStefan force-pushed the patch-1 branch from 43937fd to 55caa66 Compare November 7, 2016 19:17

beatfactor reviewed Nov 10, 2016

View reviewed changes

NickStefan force-pushed the patch-1 branch from 654f158 to 25bbe7b Compare November 15, 2016 18:27

NickStefan force-pushed the patch-1 branch from 25bbe7b to 82f46b7 Compare November 16, 2016 19:35

This was referenced Dec 14, 2016

Test hangs #1223

Closed

Nightwatch hangs when Selenium fails to create a session #1141

Closed

beatfactor closed this Dec 14, 2016

Add timeout and retry logic for session creation #1232

Add timeout and retry logic for session creation #1232

Conversation

NickStefan commented Oct 21, 2016

NickStefan commented Oct 21, 2016

beatfactor commented Oct 21, 2016

NickStefan commented Oct 21, 2016 • edited Loading

naissa12 commented Oct 24, 2016

silverrajat88 commented Oct 24, 2016

naissa12 commented Oct 24, 2016

NickStefan commented Oct 25, 2016 • edited Loading

diegoprd commented Nov 1, 2016

diegoprd commented Nov 1, 2016

beatfactor commented Nov 1, 2016

NickStefan commented Nov 1, 2016

diegoprd commented Nov 1, 2016

NickStefan commented Nov 2, 2016 • edited Loading

beatfactor commented Nov 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickStefan commented Nov 2, 2016

NickStefan commented Nov 3, 2016 • edited Loading

diegoprd commented Nov 4, 2016

NickStefan commented Nov 4, 2016

beatfactor commented Nov 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beatfactor commented Nov 10, 2016

beatfactor commented Nov 14, 2016

DoHe commented Nov 15, 2016

DoHe commented Nov 15, 2016

DoHe commented Nov 15, 2016 • edited Loading

beatfactor commented Nov 15, 2016

DoHe commented Nov 15, 2016

NickStefan commented Nov 16, 2016

DoHe commented Nov 17, 2016

naissa12 commented Nov 18, 2016

NickStefan commented Nov 18, 2016

joel-airspring commented Dec 6, 2016

DoHe commented Dec 7, 2016

beatfactor commented Dec 13, 2016

NickStefan commented Dec 13, 2016

beatfactor commented Dec 14, 2016

DoHe commented Dec 21, 2016

beatfactor commented Dec 21, 2016

DoHe commented Dec 21, 2016

beatfactor commented Dec 21, 2016

DoHe commented Dec 21, 2016

NickStefan commented Dec 21, 2016

beatfactor commented Dec 21, 2016

beatfactor commented Jan 1, 2017

DoHe commented Jan 4, 2017

NickStefan commented Oct 21, 2016 •

edited

Loading

NickStefan commented Oct 25, 2016 •

edited

Loading

NickStefan commented Nov 2, 2016 •

edited

Loading

NickStefan commented Nov 3, 2016 •

edited

Loading

DoHe commented Nov 15, 2016 •

edited

Loading