Prototype for running a subset of benchmarks on a pull request #58

gareth-ellis · 2016-08-23T10:41:10Z

fyi @mhdawson @AndreasMadsen
Following on from nodejs/node#8157 here is a prototype for running a subset of benchmarks on two builds of node - .

We'll need to have the ability to set :
mandatory BRANCH
mandatory PULL_ID
mandatory CATEGORY
optional RUNS
optional FILTER

in Jenkins - the script will then use those environment variables to do the right thing.

I can't add these myself, due to access permissions on Jenkins.

I'd suggest we give this a go at least, and have agreement regarding the amount of work the benchmark machine can reasonably do - I believe the priority should be the daily regression runs, though expanding these to run benchmarks from node source should also be a priority. I believe running this prototype will give us some insight into how the runs perform on the benchmark machine, and how best to implement the run.

I should also add some clean up work - perhaps archiving away the output files for future use.

Thoughts?

AndreasMadsen · 2016-08-23T12:07:11Z

Great, this is exactly what I would like to see. Perhaps we can estimate the running time and schedule the job accordingly. That would allow for the daily regression runs.

mhdawson · 2016-08-23T14:59:20Z

I think we want to be able to test the current head so can we make PULL_ID optional ? In terms of a comparison point maybe we to make the BASE (currently BRANCH), TARGET (currently PULL_REQUEST) more flexible so they can be a git hash

…ripts in node/benchmark

AndreasMadsen · 2016-09-02T10:42:07Z

@gareth-ellis thanks for spending your time on this.

gareth-ellis · 2016-09-02T10:53:17Z

Right,

I started the write a response, but decided it sounded too hacky. I've reworked it so there's two current use cases. They're described in the usage, however:

This script has two use cases:
Use case 1: We want to test the impact of a PR on a branch.
To run this, declare:

CATEGORY = a category of tests to run - folders in benchmark
BRANCH = the branch the test should be based off. e.g. master
PULL_ID = the pull request that contains changes to test

Use case 2: We want to compare two branches, tags or commits.
To run this, declare:
CATEGORY = a category of tests to run - folders in benchmark
BASE = the branch/tag/commit the test should be based off. e.g. master
TARGET = the branch/tag/commit to compare against base

The following are optional across both use cases
RUNS = defaults to empty
FILTER = defaults to empty
MACHINE_THREADS - used for building node. Defaults to all threads on machine

gareth-ellis · 2016-09-04T18:42:07Z

I thought it would be a good idea to do a run and time each benchmark folder/category to give us an idea of how long this may take. It's not run on the benchmark machine, but it will give us a ball park number:

So far (the test is running still):
domain took 17 seconds
crypto took 9360 seconds
buffers took 107026 seconds (yes, nearly 30 hours...)
querystring took 1926 seconds
streams took 1898 seconds
assert took 2336 seconds
dgram took 21892 seconds
path took 1954 seconds
arrays took 538 seconds
net took 13691 seconds
doc_img took 0 seconds
module took 1446 seconds
process took 220 seconds

Will update as more of the tests complete, however we will need to make sure at a minimum that only a small subset of buffer tests are run, otherwise we'd risk having the benchmark machine miss the regular regression runs

AndreasMadsen · 2016-09-04T20:06:58Z

Here is a list of the time usage for each benchmark that I made some time ago:
https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34

Note that this is for a single run and not comparing, so you need to multiply by 60.

AndreasMadsen · 2016-10-01T20:06:31Z

@gareth-ellis what is the status of this? I got reminded by nodejs/node#8873.

gareth-ellis · 2016-10-01T20:09:14Z

I believe we're waiting for the OK from @mhdawson and then it will need adding to the jenkins CI by someone with access

mhdawson · 2016-10-03T14:11:28Z

I was waiting to see the full list mentioned by gareth and then we can decide what's reasonable to let people run. To start I don't think we want to let people start 30 hour runs... I'm thinking a limit of something like an hour makes sense. Is there a way to shorten the ones that run longer yet still get useful data ?

AndreasMadsen · 2016-10-03T15:20:32Z

@mhdawson please see https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34 for the full list. Just multiply by 60 for default parameters.

edit: the list can be generated with:

node benchmark/run.js --format csv -- \
  arrays assert buffers child_process crypto dgram domain \
  es events fs http misc module net path process querystring \
  streams string_decoder timers tls url util

mhdawson · 2016-10-04T14:40:12Z

Ok, when I find time I'll add the numbers together in https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34 to get the totals for the individual groups. Still interested in any input on this question:
Is there a way to shorten the ones that run longer yet still get useful data ?

AndreasMadsen · 2016-10-06T13:14:33Z

Here are the aggregated results: https://gist.github.com/AndreasMadsen/9f9739514116488c27c06ebaff813af1

remember to multiple by 60 for default parameters. We can also add a --dry-run option to benchmark/run.js if that helps. That way you could run:

node benchmark/run.js $FILTER -- $CATEGORY

to get a list of all the executed files / parameters. This could then be cross referenced with some estimated time list. Of cause such a time estimate could be far off if something introduced a deoptimization or the default parameters where changed.

mhdawson · 2016-10-20T21:36:00Z

Thanks for putting together the overview. I like the idea of a --dry-run that provides an running time estimate, we could then start off by limiting runs to something like 4 hours. I know people could launch multiples but I'm hoping people would avoid abusing the job.

Assuming the aggregate numbers are in seconds a multiplier of 60 still seems like we'd get very long runs for some of the groupings. I'm assuming that often people will want to run for one of the groupings that they think might be affected by change. Would some shorter number be possible while still getting useful results ?

AndreasMadsen · 2016-10-20T21:57:50Z

we could then start off by limiting runs to something like 4 hours.

My "person memory" is quite poor, but I think I talked to @mcollina and @thealphanerd about getting more resources for benchmarking at NodeConf EU.

Assuming the aggregate numbers are in seconds a multiplier of 60 still seems like we'd get very long runs for some of the groupings.

All correct.

I'm assuming that often people will want to run for one of the groupings that they think might be affected by change.

For some groupings yes, for other no. The groupings are based on each node core module, however an optimization rarely affects all parts of a module. Thus a subset of a grouping is typically used.

Would some shorter number be possible while still getting useful results ?

I would guess that some of the benchmarks performs many more iterations than they actually need to. But generally speaking it is impossible to say how many runs are actually needed.

I choose 30 (60 = 30 runs * 2 node versions) originally because that is where the error between a t-distribution and normal-distribution becomes fairly small (opinions vary, but this is the most optimistic number I have heard). When this error is small having non-normally distribution benchmark values doesn't affect the statistical calculations that much.

In plain english: 30 is the most optimistic threshold between not safe and maybe safe.

gareth-ellis · 2016-12-25T21:17:20Z

Bump.

@mhdawson , what do we need to do next to get this available for use? I appreciate we need to make sure we don't end up colliding with other benchmark jobs, but having the facility available (even if to a small group of people to start with who are able to launch comparison jobs.) . If we find ourselves needing more time than available, then perhaps we should be looking at possibilities to expand capacity.

Cc / @AndreasMadsen

AndreasMadsen · 2016-12-25T21:42:06Z

@gareth-ellis I talked to some CTC member at NodeConf EU, they though we should just get a dedicated server for this purpose.

mscdex · 2017-01-04T10:11:12Z

+1 for a dedicated server

AndreasMadsen · 2017-01-04T11:01:51Z

I've created a request in the CTC group: nodejs/CTC#55

joyeecheung · 2017-01-04T13:56:26Z

The running time is only going to be longer as more benchmarks are added. Maybe we can figure out a reasonable time limit for new benchmarks added in the future and put this limit in the docs as a suggestion (possibly try to migrate the old ones too)? I too think that some benchmark might perform more iterations than actually needed. And if they do that, people tend to cut down the n or the --run when running those so they can finish the benchmarks in a reasonable amount of time anyway, and that defeats the whole point of all these iterations...

gareth-ellis · 2017-01-04T14:13:37Z

So yes the length of time will increase as more benchmarks are added, however the "issue" is the length of time it would take to run an entire suite. I say issue, but really the benchmarks would be run as part of targetted testing, so if we find ourselves running the whole suite, then this isn't really ideal (see my comment from 4th September - I had to stop it running after a full weekend of running!). Perhaps we could also add a select subset for general testing, but to start with I think we need to make sure that we are narrowing down the benchmarks to be run sufficiently, rather than reducing the number of iterations per test or trying to limit the amount of time an individual benchmark can spend running. If the benchmark gives us useful data in an efficient way, then that should be most important.

I wonder if once we've got this running, I may look at putting together a calculator for a set of settings - I can't imagine it being too tricky.....

joyeecheung · 2017-01-04T14:27:59Z

@gareth-ellis I agree that we should narrow down the subset of benchmarks first, but IMO having a soft limit can help to avoid things getting out of hand. And if the whole benchmark suites takes longer than 24 hours to run then it would be hard to do daily regression runs wouldn't it?

gareth-ellis · 2017-01-04T14:32:04Z

Indeed, but the idea of this item to start with at least is to have the benchmarks available for running against, for example, a proposed PR to verify potentially performance-risky changes.

I agree the next step once this works (and perhaps we learn more...) would then be to implement a subset for regular regression runs

joyeecheung · 2017-01-04T14:35:03Z

@gareth-ellis Ah yes, sorry if I've gone off topic too far, just haven't seen the issue about running time brought up anywhere else :P

mhdawson · 2017-05-18T14:58:12Z

Just a not that I've been told Intel will be donating a machine through Nearform. Once we get that we can use that new machine to re-start work related to this PR.

mcollina · 2017-05-18T15:40:29Z

We are currently in the process of getting that wired up to the network. It's taking a while, but it's out of our hands (physical things), as we are in the process of changing provider.

mhdawson · 2017-06-19T19:37:56Z

@mcollina any update on the progress of getting access to the machine ?

mhdawson · 2017-07-18T21:51:42Z

@gareth-ellis do you have any cycles to work on this now ? We just got 2 new machines setup and so if you have time to get together we could probably review/land this PR, and then setup a CI job that runs the scripts targeted at those machines.

gareth-ellis · 2017-07-19T06:34:48Z

I potentially have some time next week, failing that the week after.

I read the background, it seems its being put in as running tests at the moment, so we can assume it should have all the tools we need (perhaps might need to install Rscript which is used for some of the statistics).

Then will be a matter of creating the job in jenkins, is there any chance of getting access to configure the benchmark jobs? I tried to view the config for a benchmark last week, but don't seem to be able to see it, I guess this means I don't have the right permissions.

gareth-ellis · 2017-07-31T07:12:49Z

Hi @mhdawson ,

When would be a good time to look at this?

mhdawson · 2017-08-01T19:41:17Z

New machines have been donated by Intel which we can use to run these tests without interfering with the regular benchmarking jobs.

Next steps would be

Land thing PR
Michael to create new job
Gareth to setup job to run the script and experiment with how best to make the different options available.

mhdawson · 2017-08-02T22:04:34Z

@gareth-ellis I created this job for you to edit/configure to run the micro-benchmarks and set it so that it will only run on one of the 2 new machines from intel:

https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/

For now it will only run on:

test-nearform_intel-ubuntu1604-x64-1

Later on we can configure to run on either of the 2 new machines if thats appropriate.

I think you should have everything you need to proceed. You can land the PR and then work on configuration the jenkins job. Let me know if there is anything else you need.

gareth-ellis · 2017-08-03T07:00:37Z

@nodejs/benchmarking anyone else want to take a look before I land? Would be good to get a few more LGTMs

mcollina · 2017-08-03T07:14:19Z

I have a couple of PRs I would like to try this on.

gareth-ellis · 2017-08-03T20:14:27Z

Landed in 3d788c2

gareth-ellis · 2017-08-04T09:03:26Z

opened #127 to track progress

gareth-ellis added 2 commits August 23, 2016 11:32

Adding prototype for running test against a pr

13d4649

Setting to use default RUNS too

28bae5c

gareth-ellis changed the title ~~Pr benchmark~~ Prototype for running a subset of benchmarks on a pull request Aug 23, 2016

gareth-ellis added 2 commits September 2, 2016 10:30

Addressing points and also setting default runs to 30 to match the sc…

b69dbec

…ripts in node/benchmark

Updating usage

90c7b48

Rework to make it less hacky

ce1dd33

AndreasMadsen mentioned this pull request Dec 9, 2016

timers: cleanup extraneous property on Immediates nodejs/node#10205

Closed

3 tasks

AndreasMadsen mentioned this pull request Dec 25, 2016

benchmark: allow benchmarks to specify flags nodejs/node#10448

Closed

3 tasks

AndreasMadsen mentioned this pull request Jan 4, 2017

Dedicated benchmark server for running Jenkins Job nodejs/CTC#55

Closed

AndreasMadsen mentioned this pull request Jan 10, 2017

util: improve readability of normalizeEncoding nodejs/node#10439

Closed

3 tasks

AndreasMadsen mentioned this pull request Jul 10, 2017

benchmark jenkins job? nodejs/node#8157

Closed

AndreasMadsen mentioned this pull request Jul 18, 2017

Dedicated machine for benchmarks nodejs/build#791

Closed

This comment was marked as off-topic.

Sign in to view

gareth-ellis merged commit 3d788c2 into nodejs:master Aug 3, 2017

This was referenced Aug 3, 2017

Install of Rscript on new benchmarking machines nodejs/build#821

Closed

Status of running node core benchmarks #127

Open

Prototype for running a subset of benchmarks on a pull request #58

Prototype for running a subset of benchmarks on a pull request #58

Conversation

gareth-ellis commented Aug 23, 2016

AndreasMadsen commented Aug 23, 2016

mhdawson commented Aug 23, 2016

AndreasMadsen commented Sep 2, 2016 • edited Loading

gareth-ellis commented Sep 2, 2016 • edited Loading

gareth-ellis commented Sep 4, 2016 • edited Loading

AndreasMadsen commented Sep 4, 2016

AndreasMadsen commented Oct 1, 2016 • edited Loading

gareth-ellis commented Oct 1, 2016

mhdawson commented Oct 3, 2016

AndreasMadsen commented Oct 3, 2016 • edited Loading

mhdawson commented Oct 4, 2016 • edited Loading

AndreasMadsen commented Oct 6, 2016

mhdawson commented Oct 20, 2016

AndreasMadsen commented Oct 20, 2016 • edited Loading

gareth-ellis commented Dec 25, 2016

AndreasMadsen commented Dec 25, 2016 • edited Loading

mscdex commented Jan 4, 2017

AndreasMadsen commented Jan 4, 2017

joyeecheung commented Jan 4, 2017

gareth-ellis commented Jan 4, 2017

joyeecheung commented Jan 4, 2017 • edited Loading

gareth-ellis commented Jan 4, 2017

joyeecheung commented Jan 4, 2017

mhdawson commented May 18, 2017

mcollina commented May 18, 2017 • edited Loading

mhdawson commented Jun 19, 2017

mhdawson commented Jul 18, 2017

gareth-ellis commented Jul 19, 2017

gareth-ellis commented Jul 31, 2017

This comment was marked as off-topic.

mhdawson commented Aug 1, 2017

mhdawson commented Aug 2, 2017 • edited Loading

gareth-ellis commented Aug 3, 2017

mcollina commented Aug 3, 2017

gareth-ellis commented Aug 3, 2017

gareth-ellis commented Aug 4, 2017

AndreasMadsen commented Sep 2, 2016 •

edited

Loading

gareth-ellis commented Sep 2, 2016 •

edited

Loading

gareth-ellis commented Sep 4, 2016 •

edited

Loading

AndreasMadsen commented Oct 1, 2016 •

edited

Loading

AndreasMadsen commented Oct 3, 2016 •

edited

Loading

mhdawson commented Oct 4, 2016 •

edited

Loading

AndreasMadsen commented Oct 20, 2016 •

edited

Loading

AndreasMadsen commented Dec 25, 2016 •

edited

Loading

joyeecheung commented Jan 4, 2017 •

edited

Loading

mcollina commented May 18, 2017 •

edited

Loading

mhdawson commented Aug 2, 2017 •

edited

Loading