Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Raspberry Pi restructure + integration w/ new Ansible scripts #1199

Merged
merged 4 commits into from
Mar 30, 2018

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Mar 27, 2018

This is all because of the main Node 10 is about to hit us with—having to retire Wheezy support but still needing to test on ARMv6+. Our current ARM infra has the Pi's on:

  • Pi B+ = Wheezy
  • Pi 2 = Wheezy
  • Pi 3 = Jessie

It's worth noting also that Raspbian hasn't even supported Jessie for quite some time now, they've moved on to Stretch for all versions (even B+). Also, the ARMv7 machines we run on Scaleway are Wheezy too. So If we drop support for Wheezy (which we're going to have to do) we have no ARMv6 or ARMv7 infra available for testing or even building natively.

So, I've taken a totally different approach and this PR contains most of the goodies. Highlights:

  • Move all the Pi stuff into ansible/ scripts 🎉
  • Rename Pi's to have their proper arch in their name and then used arch in our scripts to differentiate (no longer just "arm", now "armv6l", "armv7l", "arm64" which also matches the descriptors we use in Node source)
  • Switch to a new NFS-root-mounted architecture in the cluster (this is not an essential part of the change and the Ansible scripts allow for classic mounted, like @mhdawson's backup Pi). We can even do this now with older Pi's thanks to a new bootcode (well, it's "next", meaning it's a pre-release version). The catch is that they all still have SD cards which simply contain a vfat filesystem and bootcode.bin on it (the new 3 B+'s shouldn't need one as it can be done in firmware, I think maybe the 3's don't either but I haven't bothered testing). It does the full dhcp->tftp->nfs dance to get up and running. This makes management a bit easier and puts less strain on the SD cards (which I regularly have to replace). I still put the 1G swap file onto the SD card but I don't believe we even get into swap with our tests.
  • Move everything to Stretch / Debian 9, including the old B+'s
  • Install Docker on all hosts and build images for the Raspbian versions that we need to test. Right now we only need Wheezy and Jessie as per the list above but I'm also installing Stretch on the 2's and 3's in anticipation of using that for Node 10+.
  • Docker images run idle, full-time waiting for work (pretty low overhead), with /home/iojs/ mounted.
  • All git and other preparatory work is done on the base machine, tests are intended to be run inside the appropriate Docker container using docker exec to attach to the already running containers.
  • The iojs user doesn't have full Docker privs but only access to do sudo docker-node-exec.sh which handles docker exec and protects it from running arbitrary commands against Docker that may escalate privs on the host (once inside Docker it's restricted to the iojs user still).

I have converted 4 B+'s, 2 2's and 2 3's to this new setup to prove that it works and works well enough to replace what we have. I haven't touched the Release Pi's but this should work there too and allow us to build with different versions of Raspbian depending on the version of Node. The Scaleway ARMv7 machines probably need similar treatment to give us armv7l binaries.

I have an updated version of node-test-commit-arm-fanned here: https://ci.nodejs.org/job/rv-test-commit-arm-fanned/ — the main change being in the sub-job here: https://ci.nodejs.org/job/rv-test-binary-arm — which runs the following script on each host:

#!/bin/bash -ex

# Uses docker-node-exec.sh (source: https://github.com/nodejs/build/tree/master/ansible/roles/jenkins-worker/templates)
# on the Pis to run `docker exec`. The 'iojs' user only has sudo access to this script
# and not arbitrary `docker` commands for security reasons.
# It takes the following arguments:
#   -v <version> - the version of Raspbian: wheezy, jessie, stretch
#   -a           - run addon tests (default is js tests)
#   -f           - ignore flaky test failures
#   -s <i>       - the number of runs in this sequence
#   -r <j>       - the run number of *this* job in this sequence

case $label in
  pi1-docker) raspbian=wheezy;;
  pi2-docker) raspbian=wheezy;;
  pi3-docker) raspbian=jessie;;
  *) echo Error: Unsupported label $label; exit 1
esac

tar xavf binary/binary.tar.xz
touch config.gypi
md5sum out/Release/node

exec_args="-v $raspbian"
if test $IGNORE_FLAKY_TESTS = "true"; then
  exec_args="$exec_args -f"
fi

if test "$RUN_SUBSET" = "addons"; then
  sudo docker-node-exec.sh $exec_args -a
else
  sudo docker-node-exec.sh $exec_args -r $RUN_SUBSET -s 6
fi

(note that I've switched back to addons+6 test subsets rather than addons+7 like in node-test-binary-arm, that could be changed back easily enough, it'll shorten one-off runs but lengthen parallel runs, addons+5 is probably our optimal for the number of B+'s we have if we want to maximise throughput during busy times).

The raspbian=wheezy bits at the top allow us to select the Raspbian version we are going to test on for each type of Pi. That could be expanded to change according to Node version. So for example we might switch to Jessie, Jessie, Stretch for Node 10+, or, since we have the capacity, we could run both Jessie and Stretch on the 3's and they'd complete before the B+'s have finished.

The one catch with this is that we're running a very new kernel on all machines. We still test on the right libc for each Raspbian but the kernel is very late 4.x. However, we've been running on a 4.x for a while now because the firmware they ship on the newer batches of the B+'s won't even run on a standard Wheezy kernel so all our Pi's have been booting 4.x kernels into Wheezy & Jessie. I don't see a way around this but I also don't think it's a big deal.

It's probably too much to ask for someone to go over this and both grok & care enough about the details but I'd appreciate sign-off on the basic approach at least before moving forward.

/cc @joaocgreis cause you set up the current incarnation of the subset builds, also I still haven't got my head fully in the cross-compiler and we'll need to adjust that for Node 10+
/cc @maclover7 cause you're such a good reviewer and I know you're itching to retire setup/
/cc @seishun cause I know you're annoyed at the lack of progress on the compiler front and this is one piece of the puzzle

joaocgreis

This comment was marked as off-topic.

maclover7

This comment was marked as off-topic.

@rvagg
Copy link
Member Author

rvagg commented Mar 28, 2018

@maclover7 well, there really shouldn't be any breakage unless I've done something badly wrong, it should be transparent to collaborators except that names may change in the Jenkins jobs (pi1-docker instead of pi1-raspbian-wheezy for example) and the console output will look a little different at the top.

Right now node-test-binary-arm just has fewer Pi's available to it than normal, transition will just mean switching over functionality and adding more Pi's to the new structure but as long as there >0 in there for each class then it'll still run. That's the nice thing about treating these machines as a pool.

As per critique we got (well I got) when adding new jobs that broke on older branches, we'll have to run all of the supported branches through this to make sure we get all greens. So far I've only done master.

@seishun
Copy link
Contributor

seishun commented Mar 28, 2018

Could you explain how this affects the cross-compiling setup for Raspberry PI (if at all)?

@rvagg
Copy link
Member Author

rvagg commented Mar 28, 2018

@seishun for now it doesn't touch that, but when we start shifting raspbian versions we're going to need to compile using different toolchains, so the armv6 builds will probably be done by jessie which has gcc 4.9 for node 10. that'll be handled by the cross-compiler job but I'm assuming it's going to be a little bit easier than this to achieve since it can all be done on the same machine and just needs a selector for toolchain.

@rvagg
Copy link
Member Author

rvagg commented Mar 28, 2018

OK, I've removed everything but the container selector (-v) from the docker exec script and made it run a local file instead. This is now in Jenkins and it seems to be working fine:

#!/bin/bash -ex

# Uses docker-node-exec.sh (source: https://github.com/nodejs/build/tree/master/ansible/roles/jenkins-worker/templates)
# on the Pis to run `docker exec`. The 'iojs' user only has sudo access to this script
# and not arbitrary `docker` commands for security reasons.
# It takes the following argument: -v <version> - the version of Raspbian: wheezy, jessie, stretch
# One running, it will execute whatever is placed in the ./out/node-ci-exec file inside the appropriate container
# under the 'iojs' user as if local.

case $label in
  pi1-docker) raspbian=wheezy;;
  pi2-docker) raspbian=wheezy;;
  pi3-docker) raspbian=jessie;;
  *) echo Error: Unsupported label $label; exit 1
esac

tar xavf binary/binary.tar.xz
touch config.gypi
md5sum out/Release/node

if test $IGNORE_FLAKY_TESTS = "true"; then
  FLAKY_TESTS_MODE=dontcare
else
  FLAKY_TESTS_MODE=run
fi

echo FLAKY_TESTS_MODE=$FLAKY_TESTS_MODE

if test $RUN_SUBSET = "addons"; then
  echo "PYTHON=python FLAKY_TESTS=$FLAKY_TESTS_MODE make test-ci-native" > ./out/node-ci-exec
else
  echo "CC=should-not-compile CXX=should-not-compile PYTHON=python FLAKY_TESTS=$FLAKY_TESTS_MODE TEST_CI_ARGS=--run=${RUN_SUBSET},7 make test-ci-js" > ./out/node-ci-exec
fi

sudo docker-node-exec.sh -v $raspbian

I've also kicked off builds for each of the main release lines we support to make sure that it can do them all properly. https://ci.nodejs.org/job/rv-test-commit-arm-fanned/

joaocgreis

This comment was marked as off-topic.

@rvagg
Copy link
Member Author

rvagg commented Mar 30, 2018

I've brought online 7 of each type of Pi now, so that's enough for the 6+addons configuration we have no without waiting for availability. I've run all actively supported branches through this and only get the predictable failures from 4.x. I've also shifted to ./node-ci-exec rather than ./out/node-ci-exec just to make it more flexible.

I'm going to land this as it is now and go and switch over Jenkins to use this new setup. I will slowly convert the remaining Pi's but I want to leave some in the existing config in case we discover a fatal flaw in this plan once everyone starts using it and we need to switch back. I also expect there'll be a bit of churn in this ansible setup going forward when I try to get the release Pi's updated too. There's also a hiccup in having kernel modules on disk that match the kernel version delivered for nfs booting, in the last couple of days a new kernel was released and autoremove is getting rid of the old kernel files on disk but nfs booting is still via the original kernel I put in place .. I'll have to figure out how to make the right modules sticky but I'm not sure there'll be any ansible implications for that.

@rvagg rvagg merged commit 4cc84ec into master Mar 30, 2018
@rvagg rvagg deleted the rvagg/rpi branch March 30, 2018 06:01
@maclover7
Copy link
Contributor

@rvagg Once this is fully rolled out, can you remove dead code from setup/raspberry-pi?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants