Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up new Ampere Altra machines in the build farm to replace older packetnet systems #2729

Closed
sxa opened this issue Aug 19, 2021 · 27 comments
Closed
Assignees

Comments

@sxa
Copy link
Member

sxa commented Aug 19, 2021

Equinix (formerly packet) are replacing their older aarch64 hardware and consolidating us on new 160-core Ampere Altra systems. This issue will track progress/deployments/miration on the new machines.

My intention is to prototype creating multiple docker containers underneath a base OS of Ubuntu 20.04 (to allow a wider range of OSs to be used) to allow a wider range of testing. Unlike the ThunderX system we had previously these also support the 32-bit armv7 instruction set, so we can also look at running docker containers that could build and test our armv7l builds on these systems which could allow us to reduce our reliance on the raspberry pi systems.

(Ref #2708 as the machine that will likely be the first to be decommissioned)

@sxa sxa self-assigned this Aug 19, 2021
@sxa sxa changed the title Set up new Ampere Altra machiens in the build farm to replace older packetnet systems Set up new Ampere Altra machines in the build farm to replace older packetnet systems Aug 19, 2021
@sxa
Copy link
Member Author

sxa commented Aug 19, 2021

Successful test-commit run on two docker images on the machine, however there were issues running the playbooks:

@richardlau
Copy link
Member

  • On Ubuntu 20.04 there is no package with the name python-pip, only python-pip3 so this task fails

You could copy ansible/roles/jenkins-worker/tasks/partials/tap2junit/ubuntu.yml as ansible/roles/jenkins-worker/tasks/partials/tap2junit/ubuntu2004.yml and update the copy.

  • We cannot easily start the jenkins agent as service the task says the service doesnt' exist, so I've initiated them manually via nohup for now.

If using Docker it might make sense to follow a similar framework to what we do on the x64 containers? See

These run the Jenkins agent directly in the containers, see the templates in https://github.com/nodejs/build/tree/master/ansible/roles/docker/templates. The host running the containers gets a service to start the containers.

@sxa
Copy link
Member Author

sxa commented Aug 19, 2021

Yep that will probably be the better option :-)

@sxa
Copy link
Member Author

sxa commented Aug 19, 2021

Note: Also required the --sysctl net.ipv4.ip_unprivileged_port_start=1024 option when starting the container to allow the privileged-port tests to pass.

@rvagg
Copy link
Member

rvagg commented Aug 20, 2021

@sxa please try and use or extend the existing docker-host setup we have in Ansible; if you see opportunity for improvement then that's great and of course there'll be some things that are necessarily different for the new environment. What we want to avoid is duplication, doing essentially the same thing but in multiple ways .. we have enough of that with our mixed infra already!
Let me know if you need help grokking what it's doing now.

@sxa
Copy link
Member Author

sxa commented Aug 20, 2021

Absolutely @rvagg - this was the first time I'd used the ansible scripts directly myself so I wanted to do some experiments/education within statically generated docker containers and separate jenkins jobs on the system while I had a clean machine to play with before integrating the machine fully sicen they're easy to regenerate. Certainly don't want to add unnecessary complexity.

there'll be some things that are necessarily different for the new environment

Hopefully not too many :-)

@sxa
Copy link
Member Author

sxa commented Aug 24, 2021

@rvagg I've stuck a couple of DRAFT PRs in to support creating containers on the new system and am setting them up accordingly as I write this (nothing that can't be undone obviously). We'll have three of these new servers to replace the previous ones. We'll need to look at what to do to replace release-packetnet-centos7-arm64-1 since I dont' think the systems currently have the option for CentOS7 deployment, and my preference would probably be to carve up the third one in the same way as the first one I'm working on but would be good to have your input on whether you agree with doing that.

@sxa
Copy link
Member Author

sxa commented Aug 26, 2021

Current allocation of packetnet ThunderX arm64 machines:

Would @nodejs/build be ok with reducing this to two of the new systems running docker containers and the third as a release machine? I don't believe they can have CentOS7 directly installed just now, so we may have to run the release image within a containers too.

For the normal machines we then have to choose between using Ubuntu 20.04 as the host OS on each, or using CentOS8 (as long as It's supported!) on one and Ubuntu 20.04 on the other. As far as possible I'd probably choose not to run things directly on the host since it's likely quite wasteful of such a large machine, unless we want to be completely sure that we're running some tests on the bare metal in case certain problems are masked in docker containers.

I've set one of them up with 8 containers - four each of Ubuntu 18.04 and Ubuntu 20.04. Of each group of 4 one is from the normal dockerfile, and three are from the shared_libs version. If we're likely to run exclusively in containers on these machines I'd also like to set up some CentOS containers for continuity with what we currently have as well, but that doesn't necessarily need to be in the first phase. I'm assuming no-one minds dropping the out-of-support Ubuntu 16.04 test systems.

@rvagg
Copy link
Member

rvagg commented Aug 27, 2021

+1 to dropping 16.04

+1 to setting up the release machine with docker and releasing from within a centos container, although there might be some difficulties getting that working because there's additional steps to getting ssh config set up properly for releasing—some of which is typically done manually! I don't think we even have an Ansible way to get the ssh key in there.. So doing it in a Docker container might have challenges. Although we have cross-compile containers in our mix doing releases, so I suppose I got that working somehow!

The rest sounds good, along with the other comments I had via email. Sorry I haven't been very responsive, a lot going on at the moment.

@rvagg
Copy link
Member

rvagg commented Aug 27, 2021

ok, so another thought - the 18.04 containers may not even be necessary, I wouldn't object if you just wanted to go with 20.04 and multiply them a bit. Your call.

@sxa
Copy link
Member Author

sxa commented Aug 27, 2021

Sorry I haven't been very responsive, a lot going on at the moment.

No worries - that response is great and means I'll look at getting this stuff merged and live today and try and get some of the older systems decomissioned for Packet to take them back. I'll have a chat with Richard on the release machine setup.

I wouldn't object if you just wanted to go with 20.04

My gut feel is always that it's nice to at least try and support things that are currently in service so I'll stick with the current split on this machine and possibly only do 20.04 on the second machine.

@mhdawson
Copy link
Member

@sxa, All sounds good to me as well. One this is that with the security release next week it might make sense to try to hold on to the old machines until just after that goes out?

@sxa
Copy link
Member Author

sxa commented Sep 1, 2021

As of yesterday one of the Ubuntu 16.04 machines has been decomissioned (it had been marked offline for a few weeks): #2708 - the rest were still fully active during the security release.

The following docker images on the first Altra are now live:

The Ubuntu ones have been live for a few days and seem to be working ok, subject to needing the python-is-python3 package installed to avoid problems with build sof earlier node versions.

The CentOS one has just been made live today after adding the centos7-arm64-gcc* labels required to do the compiler selection properly. The PR for the dockerfile used to set these up is at #2738 and while I don't like the way I'm pulling down the gcc6 compiler stuff in there it does work so I propose merging that, then looking at whether we can have a better solution for the future.

I have marked the old CentOS7 boxes offline in jenkins for now to test all runs on the new CentOS7 system.

I've also removed the ubuntu1604-arm64 label from the job matrix in node-test-commit-arm so we will no longer be testing on the out-of-support Ubuntu 16.04 now that we have 18.04 and 20.04 (We could also add 21.04...) If the CentOS image deployed to day shows no problems I'll look at extended the number of those that we're running, then look at deploying a second Altra.

@richardlau
Copy link
Member

richardlau commented Sep 1, 2021

Is there a plan to run something on the sharedlib containers?

@sxa
Copy link
Member Author

sxa commented Sep 1, 2021

The node-test-commit-arm jobs should be able to run on them - the first one I'd set up with the tags was -3 on both of them. We can also rebalance towards CentOS more on the second one. I was planning to try and run that one with CentOS8 as the host OS instead of Ubuntu just for a mix (and to use the CentOS kernel version to give us more coverage) unless anyone things that's not a good idea

@mhdawson
Copy link
Member

mhdawson commented Sep 1, 2021

Just one container for centos7 ? Sounds like we had a couple of machines before?

@richardlau
Copy link
Member

richardlau commented Sep 1, 2021

@mhdawson That would (eventually) be one container per machine giving us two containers in the test CI (replacing the two existing Packet centos7 machines).

@mhdawson
Copy link
Member

mhdawson commented Sep 1, 2021

@richardlau k, got it.

@sxa
Copy link
Member Author

sxa commented Sep 7, 2021

Updated the following jobs so they will be ok on the new systems.

The libuv job needed some extra prereqs as per #2744
The node-stress-single-tests was fixed to ubuntu 16.04 so I've switched that to run on 20.04.

I'm going to formally decommission the following today:

They have been offline for a few days and no extra problems have been observed.

@sxa
Copy link
Member Author

sxa commented Sep 7, 2021

At this point the outstanding actions will be to set up the second Altra for build/test and the extra machine to replace the release ThunderX box.

@sxa
Copy link
Member Author

sxa commented Sep 7, 2021

(Splitting off the release machine into a separate issue as shown above)

The second build/test Altra test-equinix-centos8-arm64-1(139.178.85.13) has now been provisioned. Will need to be tested to see how well the dockerhost setup works on CentOS8 but worst case we can switch it to ubuntu 20.04 like the other one :-)

@sxa
Copy link
Member Author

sxa commented Sep 27, 2021

Release machine now decomissioned:

  • release-packetnet-centos7-arm64-1

@sxa
Copy link
Member Author

sxa commented Sep 28, 2021

I've been experimenting today with one of the other things I mentioend in the description of this issue - running armv7l OS images in containers on these new hosts. We've done this successfully at the Adoptium project and it's looking promising here so far too: #2775

@sxa
Copy link
Member Author

sxa commented Sep 29, 2021

arm32 container live and included in the main node-test-commit-arm job under the ubuntu2004-armv7l configuration. We will probably wish to expand the distribution list included so that we can run on a distribution that supports running jobs against node 12.x (Ref this PR)

@richardlau
Copy link
Member

I've set up the second altra (#2820 and #2828).

@richardlau
Copy link
Member

@sxa We're currently setting up RHEL 8 CI instances to replace CentOS 7 for Node.js. Any thoughts on how we should handle the release machine for ARM 64? We'll need to be able to keep building Node.js 12/14/16 on CentOS 7 but also build Node.js 18 onwards on RHEL 8.

We have hit one issue in V8 canary (nodejs/node-v8#220) that suggested an incompatibility with the kernel (too old) in CentOS 7 which I think means we wouldn't be able to run a RHEL 8 container on top of CentOS 7 as it would still be using the older kernel.

richardlau added a commit that referenced this issue Nov 3, 2022
Remove from the inventory machines that were hosted at Packet/Equinix
that have subsequently been migrated to other machines at Equinix and
OSUOSL.

Refs: #3028
Refs: #2729
@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants