Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: add self-hosted runners #278

Closed
mudler opened this issue Jun 15, 2021 · 7 comments
Closed

ci: add self-hosted runners #278

mudler opened this issue Jun 15, 2021 · 7 comments
Assignees
Labels
kind/enhancement New feature or request

Comments

@mudler
Copy link
Contributor

mudler commented Jun 15, 2021

We are at capacity with the GHA concurrent jobs limits, and this slows down development quite a lot.

Let's see if we can configure AWS spot instances or w/e else can provide a bunch of runner to be run in this repository.

e.g. by following among the lines of https://github.com/philips-labs/terraform-aws-github-runner to run our tests on. We have also to figure out if we can run vbox (or qemu) to run our test suite on top.

@mudler mudler added the kind/enhancement New feature or request label Jun 15, 2021
@mudler mudler self-assigned this Jun 15, 2021
@mudler mudler removed their assignment Jun 15, 2021
@kkaempf
Copy link
Contributor

kkaempf commented Jun 15, 2021

A self-hosted runner is ready. I just need to boot it.

@mudler
Copy link
Contributor Author

mudler commented Jun 15, 2021

This issue is about having more than one. We run quite some jobs in parallel, and it would quickly become another bottleneck

@mudler mudler self-assigned this Jun 21, 2021
@mudler mudler assigned Itxaka and mudler and unassigned mudler Jun 21, 2021
@mudler
Copy link
Contributor Author

mudler commented Jun 21, 2021

@Itxaka I'll take care of the single node only setup then 👍

@Itxaka
Copy link
Contributor

Itxaka commented Jun 21, 2021

I will have a look at some heat templates to easily add/remove ecp self-hosted runners

@mudler
Copy link
Contributor Author

mudler commented Jun 21, 2021

@Itxaka
Copy link
Contributor

Itxaka commented Jun 21, 2021

This is a POC of a script to deploy github runners in ECP: #304

Tested and working, see readme for details

Works...ok-ish. Workflows would need some adaptations to fully work, user-data migth need also adaptations if we want to use this, but it makes no sense to develop it further.

  • Need to remove the build dir on each job as it can be left over from a previous job
  • No vagrant/qemu/packer installed by default, so not all jobs can run.
  • We can add those packages and run everything in the same base, no more osX required for qemu builds
  • We could have workers based on size and OS and then use the labels to run the jobs on them, still requires more config and work to the deployment scripts and workflows.
  • We could have a "master " node that receives one job, creates an on demand instance with this scripts and uses it for one job, deleting it afterwards.
    • Makes it behave more like github runners.
    • Requires 1 "master" machine to sync everything
    • Should be easy to automate the creation, deletion, token adquisition, etc...
    • Requires investing in automation
    • Requires creating custom images with software preinstalled to avoid 5/10 minute machine boot and configuration per job
  • We could also have a big node that spawns several docker containers with the workers.
    • Allows us more flexibility
    • Supports wide range of base os
    • REALLY easy to grow or decrease
    • If managed via this scripts, its really easy to duplicate/delete/recreate

@mudler
Copy link
Contributor Author

mudler commented Jun 24, 2021

#319 should fix the problem for the time being.

The templating mechanism support to switch to local-runners - I've added ~8 of them without noticing notable peformance gains except the increase parallelism. Although that wouldn't last long as we run many more parallel jobs than 8 for each run.

Pipeline has been reworked and build times have been shrinked - the template supports using the local-runner only as build node and not as test-node (as requires virtualization and such).

To bring up the workers, I've created a cOS VM with the following cloud-init config:

name: "Default user"
stages:
   boot:
     - name: "Hostname and setup"
       hostname: "cos-node-1"
       commands:
       - echo 1 > /proc/sys/net/ipv6/conf/all/disable_ipv6
       dns:
        nameservers:
        - X
#       commands:
#       - passwd -d root
   network:
     - name: "Setup SSH keys"
       authorized_keys:
         admincos:
         - github:mudler
         root:
         - github:mudler
     - if: '[ -z "$(blkid -L COS_SYSTEM || true)" ]'
       name: "Load persisted ssh fingerprint"
       commands:
       - |
            # load ssh fingerprint
            if [ ! -d /usr/local/etc/ssh ]; then
            systemctl start sshd
            mkdir /usr/local/etc/ssh || true
            for i in /etc/ssh/*.pub; do cp -rf $i /usr/local/etc/ssh; done
            fi
     - name: "Setup k3s"
       if: '[ -z "$(blkid -L COS_SYSTEM || true)" ]'
       directories:
       - path: "/usr/local/bin"
         permissions: 0755
         owner: 0
         group: 0
       commands:
       - |
            curl -sfL https://get.k3s.io | \
            INSTALL_K3S_VERSION="v1.20.4+k3s1" \
            INSTALL_K3S_EXEC="--tls-san additional-outside-ip" \
            INSTALL_K3S_SELINUX_WARN="true" \
            sh -
   initramfs:
     - if: '[ -z "$(blkid -L COS_SYSTEM || true)" ]'
       name: "Persist"
       commands:
       - |
            target=/usr/local/.cos-state

            # Always want the latest update of systemd conf from the image
            mkdir -p ${target}/etc/systemd/
            rsync -av /etc/systemd/ ${target}/etc/systemd/
            # Only populate ssh conf once
            if [ ! -e ${target}/etc/ssh ]; then
            mkdir -p ${target}/etc/ssh/
            rsync -av /etc/ssh/ ${target}/etc/ssh/
            fi
            # make /tmp tmpfs
            cp -f /usr/share/systemd/tmp.mount ${target}/etc/systemd/system/
            # undo /home /opt mount from cos immutable-rootfs module
            sed -i '/overlay \/home /d' /etc/fstab
            sed -i '/overlay \/opt /d' /etc/fstab
            umount /home
            umount /opt
            # setup directories as persistent
            for i in root opt home var/lib/rancher var/lib/kubelet etc/systemd etc/rancher etc/ssh usr/libexec; do
            mkdir -p ${target}/$i /$i
            mount ${target}/$i /$i -t none -o bind
            done
            # This is hidden so that if you run some selinux label checking or relabeling the bind
            # mount won't screw up things.  If you have two files at different paths they will get
            # labeled with two different labels.
            mkdir -p ${target}/empty
            mount ${target}/empty ${target} -o bind,ro
            # persist machine-id
            if [ -s /usr/local/etc/machine-id ]; then
            cat /usr/local/etc/machine-id > /etc/machine-id
            else
            mkdir -p /usr/local/etc
            cp /etc/machine-id /usr/local/etc
            fi
            # ensure /var/log/journal exists so it's labeled correctly
            mkdir -p /var/log/journal
     - name: "Setup users"
       users: 
          admincos: 
            homedir: "/home/admincos"
     - name: "groups"
       ensure_entities: 
       - entity: |
                 kind: "group"
                 group_name: "wheel"
                 password: "x"
                 gid: 1020
                 users: "admincos"
       files: 
       - path: "/etc/sudoers.d/wheel"
         owner: 0
         group: 0
         permission: 0600   
         content: |
                   %wheel ALL=(ALL) NOPASSWD: ALL
       - path: "/etc/modprobe.d/ipv6.conf"
         owner: 0
         group: 0
         permission: 0664  
         content: |
                    alias net-pf-10 off
                    alias ipv6 off
                    options ipv6 disable_ipv6=1

For the GH deployment, I've followed https://github.com/rancher-sandbox/cOS-toolkit/wiki/Github-runner-on-k8s

@mudler mudler closed this as completed Jun 24, 2021
frelon pushed a commit to frelon/elemental-toolkit that referenced this issue May 12, 2023
…her#278)

This commit makes upgrade|reset|install to create and upgrade
`state.yaml` file including system wide data (deployed images,
partition labels, etc.)

It introduces the concept of installation state and stores such
a data in `state.yaml` file in two different locations, state partition root
and recovery partition root.

The purpose of this duplication is to be able to always find the
state.yaml file in a known location regardless of the image we are
booting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants