Use asynchronous childProcess commands in Network tests - Closes #2334 #2382

jondubois · 2018-09-07T08:05:38Z

What was the problem?

Functions used by network tests to spawn nodes were synchronous.
Timeouts were used to estimate when nodes would be ready (before running test cases).

How did I fix it?

Made test cases asynchronous.
Wait for nodes to be ready before executing tests.

How to test it?

grunt mocha:default:network

Review checklist

The PR resolves Use asynchronous childProcess commands in Network tests #2334
All new code is covered with unit tests
All new code was formatted with Prettier
Linting passes
Tests pass
Commit messages follow the commit guidelines
Documentation has been added/updated

… to be asynchronous

nazarhussain · 2018-09-07T13:50:08Z

test/network/scenarios/common.js

@@ -17,6 +17,9 @@
 const childProcess = require('child_process');
 const utils = require('../utils');

+const NODE_READY_REGEX = /Finished sync/;


If you want to check loading status by log message then better would be Blockchain ready

https://github.com/LiskHQ/lisk/blob/ed88a37008d92b82260d86db956ec543c5c71eca/modules/loader.js#L430

If I only wait for 'Blockchain ready', it breaks the test since it doesn't give enough time for all the nodes to sync up.

The name of regular expression you are using is NODE_READY_REGEX https://github.com/LiskHQ/lisk/blob/ed88a37008d92b82260d86db956ec543c5c71eca/test/network/scenarios/common.js#L20

Which clearly is not the case, you are not waiting for the node to be ready, you are waiting for node to be synced at least once. If that's the case please use proper naming for the variables.

I am also not a big fan of grepping the log message to verify if the node is ready, the reliable way is to either subscribe to the blockchainReady event which clearly indicated the blockchain is loaded or using the API endpoint to check if the syncing is true or false and make the decision based on the boolean result.

Which clearly is not the case

It's not so black and white. 'Node ready' can mean a lot of things depending on whether we're looking at it from the perspective of the node or from the perspective of the test.

I'll rename to NODE_FINISHED_SYNC_REGEX to avoid any confusion.

nazarhussain · 2018-09-07T13:56:23Z

test/network/scenarios/common.js

+						pm2LogProcess.stdout.removeAllListeners('data');
+						resolve();
+					}
+				});


Better practice to handle this use case would be to use pm2 builtin feature for it. PM2 have a CLI option –wait-ready and then from the process you can send an event process.send('ready') to notify PM2 that app is ready.

This would be the best way to in the core and then either PM2 or some other process manager can also use it.

I didn't know about this feature. I'll give it a try.

That feature would be perfect but I can't get it to work for some reason.

May be we can sit together to check it out.

nazarhussain · 2018-09-07T13:57:46Z

test/network/scenarios/p2p/peer_disconnect.js

+						})
+						.catch(err => {
+							done(err.message);
+						});


I believe this code will have similar impact.

before(done => { common.stopNode('node_1').then(done).catch(done); });

or simply returning the promise will also work here.

before(() => common.stopNode('node_1'))

The suggested approach looks cleaner but it's more brittle; if the stopNode() function changes in the future (e.g. returns a value from the promise), it could break stuff in an unexpected way. I think it's better to be explicit about what is passed to the done function.

If stopNode change in the future we should should its usage. So why not use the cleaner approach which fits in current circumstances.

A function should be a blackbox as much as possible. The stopNode function shouldn't have to worry about how its output is being used by dependent functions (done in this case). That's why the extra fat-arrow functions are important in this case. It's also more readable because looking at the test you can see exactly what parameters go into the done function without having to dig into the implementation details of the stopNode function.

There is nothing to do with internal functionality of the function.

Yes function should be blackbox but also follow the documented interface. If its documented that the function will return a Promise then it should always return promise. And all relevant code should be written in that perspective that it is returning the Promise.

And once function return type is change, its reference usage must be updated accordingly.

I'll do it since most people seem to disagree with me :'(

nazarhussain · 2018-09-07T13:58:56Z

test/network/scenarios/p2p/peer_disconnect.js

+						})
+						.catch(err => {
+							done(err.message);
+						});


nazarhussain · 2018-09-07T14:03:39Z

test/network/scenarios/p2p/peer_disconnect.js

+						})
+						.catch(err => {
+							done(err.message);
+						});


This would be much compact option.

it('stop all the nodes in the network except node_0', () => { return Promise.map(new Array(TOTAL_PEERS), (value, index) => { return common.stopNode(`node_${index}`); }); });

That wouldn't work. There is no node_10 - We need to loop from node_0 to node_9.
Also, a for loop is easier to read in this case.

Just remove the +1 from the index.

But then it stops node_0 which we want to keep running.

nazarhussain · 2018-09-07T14:03:52Z

test/network/scenarios/p2p/peer_disconnect.js

+						})
+						.catch(err => {
+							done(err.message);
+						});


ManuGowda · 2018-09-10T08:52:02Z

test/network/scenarios/common.js

@@ -17,6 +17,9 @@
 const childProcess = require('child_process');
 const utils = require('../utils');

+const NODE_READY_REGEX = /Finished sync/;


I am also not a big fan of grepping the log message to verify if the node is ready, the reliable way is to either subscribe to the blockchainReady event which clearly indicated the blockchain is loaded or using the API endpoint to check if the syncing is true or false and make the decision based on the boolean result.

ManuGowda · 2018-09-10T09:01:44Z

test/network/scenarios/common.js

@@ -17,6 +17,9 @@
 const childProcess = require('child_process');
 const utils = require('../utils');

+const NODE_READY_REGEX = /Finished sync/;
+const NODE_READY_TIMEOUT = 20000;


the disadvantage of having the timeout as a way to check if the application starts within this time frame is that it will be indeterministic if the performance of the application degrades then the tests will fail and if the performance improves then still it has to wait 20 seconds, as I said before if we subscribe to blockchainReady event then regardless of the time the tests will be consistent.

The timeout specifies the upper bound; there needs to be a timeout in case the node is taking way too long to launch (due to an unforeseen issue with the node itself); then it should fail the test with an error. It should be a large value though. I can make it higher. 20s is a bit short.

I made it 40 seconds.

I agree that it's not the best but we're running the node with PM2 so we can't listen to the node's events directly. I investigated using the HTTP API to check the sync status of the node but that would add a lot of complexity because of how we're running the nodes, it's difficult to associate them with a URL.

…nfusion

SargeKhan · 2018-09-10T15:34:43Z

test/network/scenarios/common.js

+				let pm2LogProcess = childProcess.spawn('pm2', ['logs', nodeName]);
+				pm2LogProcess.once('error', err => {
+					clearTimeout(nodeReadyTimeout);
+					pm2LogProcess.stdout.removeAllListeners('data');


why not remove error listener here as well?

It uses the once('error', ...) handler so if the handler executes once then it gets removed implicitly. I can change it to on('error', ...) and then explicitly remove the 'error' event handler if you prefer.

Also, based on your earlier feedback, I moved up the let pm2LogProcess = ... declaration since that's more clear. Also I made it const.

SargeKhan · 2018-09-10T15:56:42Z

test/network/scenarios/common.js

@@ -17,6 +17,9 @@
 const childProcess = require('child_process');
 const utils = require('../utils');

+const NODE_FINISHED_SYNC_REGEX = /Finished sync/;


Why are we using Finished sync as a regular expression?

I like doing regex.test(...) instead of string.indexOf(...) !== ... it really doesn't matter for me though.

I added a comment in the code.

Addressed key issues and justified the ones which couldn't be resolved.

nazarhussain

I am approving the PR as its hanging for long. But I am not convinced with the approach to wait for syncing as part of the node startup.

That particular part should be done explicitly only in the cases where needed. With detail comments that why we need that in that particular test case.

MaciejBaj · 2018-09-11T15:22:24Z

test/network/scenarios/p2p/peer_disconnect.js

+					common
+						.stopNode('node_1')
+						.then(done)
+						.catch(done);


Please check if pm2 kill is being executed after the tests failed.

We were running the network tests with the mocha --bail flag which meant that our after() hook was skipped (when an error was thrown inside our before() hook) and so the test suite would not cleanup after itself.

This is an open issue with mocha: mochajs/mocha#3398

Removing the --bail flag fixed the issue.

… after() hook from running which prevents cleanup of nodes

jondubois · 2018-09-13T13:40:18Z

The purpose of the --bail flag in mocha is to terminate the test suite as soon as any one of the test cases fails; this feature saves time and it would have been nice to keep it but none of the alternative solutions I investigated could be used.

Using process.on('exit', cleanupPm2Processes) doesn't work because we need to launch the pm2 kill command in a separate process; by the time the 'exit' event triggers, the main process is already shutting down and so this doesn't give our process enough time to execute the pm2 kill command as a child process (in Node.js, child processes are killed with their parent); this means that the cleanup doesn't happen and the PM2 nodes are left running.
Using process.on('SIGTERM', cleanupPm2Processes) also doesn't work; same reason as above.

The best way to detect if the tests have finished is to use Mocha's after hook but currently this is not possible if we use the --bail flag. Until the issue is fixed in Mocha, I recommend to not use the --bail flag for the network tests.

Network tests: Modified stopNode, startNode and restartNode functions…

425b2b4

… to be asynchronous

jondubois requested review from diego-G and MaciejBaj September 7, 2018 08:05

jondubois self-assigned this Sep 7, 2018

jondubois added the 👁️ pending review label Sep 7, 2018

diego-G requested review from ManuGowda, nazarhussain and 4miners and removed request for MaciejBaj and diego-G September 7, 2018 08:43

Merge branch 'development' into 2334-network_tests_async

ed88a37

nazarhussain previously requested changes Sep 7, 2018

View reviewed changes

diego-G mentioned this pull request Sep 10, 2018

Syncing from scratch randomly slows down #352

Closed

ManuGowda previously requested changes Sep 10, 2018

View reviewed changes

jondubois added 2 commits September 10, 2018 12:36

Rename constant to be more specific

8e82426

Shorten the code related to the done callback in network tests

55fc17d

jondubois force-pushed the 2334-network_tests_async branch 2 times, most recently from 5fc911f to 55fc17d Compare September 10, 2018 14:17

diego-G requested review from nazarhussain, SargeKhan and ManuGowda and removed request for 4miners September 10, 2018 15:08

Move pm2LogProcess declaration up above the timeout block to avoid co…

8e93e43

…nfusion

SargeKhan reviewed Sep 10, 2018

View reviewed changes

Renamed constant in network tests

b5dbb4a

SargeKhan approved these changes Sep 10, 2018

View reviewed changes

SargeKhan reviewed Sep 10, 2018

View reviewed changes

Add comment about why we wait for the 'Finished sync' event on nodes

fd40d9b

nazarhussain approved these changes Sep 11, 2018

View reviewed changes

ManuGowda approved these changes Sep 11, 2018

View reviewed changes

diego-G added ✅ ready and removed 👁️ pending review labels Sep 11, 2018

diego-G requested a review from MaciejBaj September 11, 2018 12:13

MaciejBaj reviewed Sep 11, 2018

View reviewed changes

jondubois added 2 commits September 12, 2018 16:54

Use pm2 module from node_modules directory

b4220f8

Do not --bail if one of the test cases fails; this prevents the mocha…

bf16d9e

… after() hook from running which prevents cleanup of nodes

MaciejBaj added 🏗️ in progress and removed ✅ ready labels Sep 13, 2018

Merge branch 'development' into 2334-network_tests_async

b835a01

MaciejBaj added ✅ ready labels Sep 14, 2018

MaciejBaj approved these changes Sep 14, 2018

View reviewed changes

MaciejBaj removed 🏗️ in progress labels Sep 14, 2018

MaciejBaj merged commit d9a9ba9 into development Sep 14, 2018

MaciejBaj deleted the 2334-network_tests_async branch September 14, 2018 09:34

Use asynchronous childProcess commands in Network tests - Closes #2334 #2382

Use asynchronous childProcess commands in Network tests - Closes #2334 #2382

Conversation

jondubois commented Sep 7, 2018

What was the problem?

How did I fix it?

How to test it?

Review checklist

Choose a reason for hiding this comment

jondubois Sep 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazarhussain Sep 7, 2018 • edited Loading

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazarhussain left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jondubois Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

jondubois commented Sep 13, 2018 • edited Loading

jondubois Sep 7, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 7, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

nazarhussain Sep 7, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

jondubois Sep 10, 2018 •

edited

Loading

nazarhussain left a comment •

edited

Loading

jondubois Sep 12, 2018 •

edited

Loading

jondubois commented Sep 13, 2018 •

edited

Loading