Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiple self-hosted runner instances. #2722

Merged
merged 11 commits into from
Aug 2, 2023

Conversation

rectified95
Copy link
Collaborator

@rectified95 rectified95 commented Aug 1, 2023

Description

Fixes #2229

This PR modifies the CI so it can use multiple GH runner service instances on the self-hosted runner machines running the kernel tests (driver and driver_native_only jobs).

For now, we're doubling the capacity by having 2 GH service instances per host. This change capitalizes on #2699, which converted our multi-VM tests to only need one VM.

Key change: runner to VM mapping
Each of the runner machines hosts 2 VMs named vm1 and vm2 as before. The key mechanism is the mapping of known runner service names to VM names, resolved at runtime by passing the runner.name from the GH context via the new test script parameter SelfHostedRunnerName. This can scale to any number of machines, subject to available RAM on the hosts, only requiring an update to test_execution.json.

Runner service separation was achieved by installing them in separate working directories. Since our test collateral is copied onto the VMs and run there, there is little potential for state leaking between jobs running on the same host. The only part during which jobs store their state in a shared location is during artifact compression, because they continue using [System.IO.Path]::GetTempFileName() to create the file name. However, that function guarantees name uniqueness for over 65k invocations, and our jobs remove the files they create regardless.
Ref: https://learn.microsoft.com/en-us/dotnet/api/system.io.path.gettempfilename?view=net-7.0#remarks

Bugs fixed and other changes

  • tcp_udp_listener.exe was not getting built for the NativeOnly* configs - this was masked by our runner checkpoints storing stale binaries from months ago. Since this EXE has not been modified since then, its old version still made the connect_redirect test pass. I exposed this bug accidentally when deleting the c:\eBPF folder from the runners and updating the checkpoints.
  • APPLICATION_VERIFIER_HANDLES_INVALID_HANDLE crash in socket_tests.exe due to invalidated socket handles being closed again in the destructor. Since AppVerifier was enabled on the runners long ago, this showing up now may have to do with me deleting the stale binaries and also forcing a less predictable (serialized) assignment of runners to jobs.
  • Optimized the JSON schema to remove the redundant storing of VM names in two JSON files, and prevent them growing linearly with the number of runners added.
  • Perform staged migration to new runners: all runners use the self-hosted label, while the new ones don't have the ebpf_cicd_label; other PRs will not be assigned the new runners until this PR is merged, after which all runners will be used.

Testing

Verified in the 'Set up job' sections of the CI output that the driver and driver_native_only jobs are scheduled in parallel using 4 different runners.

Documentation

Cosmetic change to SelfHostedRunnerSetup.md

Installation

N/A

@rectified95 rectified95 marked this pull request as ready for review August 2, 2023 05:42
@rectified95 rectified95 changed the title Multiple ebpf runners Use multiple self-hosted runner instances. Aug 2, 2023
@gtrevi
Copy link
Collaborator

gtrevi commented Aug 2, 2023

nit: can you reformat the PR description with the template?

dthaler
dthaler previously approved these changes Aug 2, 2023
Alan-Jowett
Alan-Jowett previously approved these changes Aug 2, 2023
gtrevi
gtrevi previously approved these changes Aug 2, 2023
@rectified95 rectified95 dismissed stale reviews from Alan-Jowett and dthaler via 6234979 August 2, 2023 19:01
@rectified95
Copy link
Collaborator Author

@dthaler / @Alan-Jowett would you mind approving again? I fixed up the labels everywhere and now all checks are again passing for me.

@rectified95 rectified95 added this pull request to the merge queue Aug 2, 2023
Merged via the queue into microsoft:main with commit 77191a8 Aug 2, 2023
66 checks passed
@rectified95 rectified95 deleted the multiple_ebpf_runners branch August 2, 2023 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move CI/CD Kernel Tests to Github runner
4 participants