Introduce sanity/compatibility test for live clusters#12175
Introduce sanity/compatibility test for live clusters#12175ryoqun wants to merge 57 commits intosolana-labs:masterfrom
Conversation
25d39ce to
a405b68
Compare
|
|
||
| // Eager rent collection repeats in cyclic manner. | ||
| // Each cycle is composed of <partiion_count> number of tiny pubkey subranges | ||
| // Each cycle is composed of <partition_count> number of tiny pubkey subranges |
There was a problem hiding this comment.
(while intentionally triggering the CI) let's increase my karma. :)
| all_test_steps() { | ||
| command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20 | ||
| wait_step | ||
| #command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20 |
There was a problem hiding this comment.
Obviously, I need to revert these before merging!
| # for your pain-less copy-paste | ||
|
|
||
| # UPDATE docs/src/clusters.md TOO!! | ||
| test_with_live_cluster "testnet" \ |
There was a problem hiding this comment.
When backporting to v1.2, I'll remove this line.
|
@mvines I think this pr is getting in pretty good shape. Could you review this? I changed to launch and run on adhoc GCE instance and the test duration is pretty short (~10 min) for both testnet and mainnet-beta. |
|
|
||
| ```bash | ||
| $ solana-validator \ | ||
| --entrypoint entrypoint.devnet.solana.com:8001 \ |
There was a problem hiding this comment.
Reordered for logical use order: entrypoint (contact to the cluster) => [trusted] validator (fetch genesis/snapshot) => expected-... (let's assert expected things finally)
| instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}') | ||
|
|
||
| on_trap() { | ||
| if [[ -z $instance_deleted ]]; then |
There was a problem hiding this comment.
global variables! \ o /
There was a problem hiding this comment.
I think it's safe to just try to delete here
|
|
||
| ##### Example `solana-validator` command-line | ||
|
|
||
| [comment]: <> (UPDATE ci/live-cluster-sanity.sh TOO!!) |
There was a problem hiding this comment.
There was a problem hiding this comment.
I noticed docusaus can't handle this correctly, this is the reason of failing travis build.
| -d '{"jsonrpc":"2.0","id":1, "method":"validatorExit"}' \ | ||
| http://localhost:18899 | ||
|
|
||
| (sleep 3 && kill "$tail_pid") & |
There was a problem hiding this comment.
this trick realizes set +e-less elegant wait.
| ./net/ssh.sh "$instance_ip" mkdir cluster-sanity | ||
|
|
||
| validator_log="$cluster_label-validator.log" | ||
| ./net/ssh.sh "$instance_ip" -Llocalhost:18899:localhost:18899 ./solana-validator \ |
There was a problem hiding this comment.
combined with --private-rpc and --rpc-bind-address, the exposure to the public internet is minimized by -L....
| show_log | ||
| done | ||
|
|
||
| echo "--- Monitoring validator $cluster_label" |
There was a problem hiding this comment.
should I also add the catchup phase?
There was a problem hiding this comment.
let's skip this for now. This will increase the test time.
| --trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \ | ||
| --expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \ | ||
| # for your pain-less copy-paste | ||
|
|
There was a problem hiding this comment.
I wonder it's nice to have to upload fetched snapshots to the buildkite as artifacts for reproducible testing if anything odd happens.
There was a problem hiding this comment.
done by uploading snapshots only if the build failed.
|
|
||
| (sleep 3 && kill "$tail_pid") & | ||
| kill_pid=$! | ||
| wait "$ssh_pid" "$tail_pid" "$kill_pid" |
There was a problem hiding this comment.
guard with timeout N
There was a problem hiding this comment.
well it turned out this is rather complicated. hint: wait must be shell builtin but timeout is just a normal command. let's skip this.
| source ci/_ | ||
| source ci/rust-version.sh stable | ||
|
|
||
| escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g") |
There was a problem hiding this comment.
If BUILDKITE_BRANCH is empty (like if ci/live-cluster-sanity.sh is run locally), set escaped_branch to $(whoami) perhaps?
| # ensure to delete leftover cluster | ||
| ./net/gce.sh delete -p "$instance_prefix" || true | ||
| # only bootstrap, no normal validator | ||
| ./net/gce.sh create -p "$instance_prefix" -n 0 |
There was a problem hiding this comment.
Let's ensure the instances are shut down promptly if something goes wrong:
| ./net/gce.sh create -p "$instance_prefix" -n 0 | |
| ./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1 |
There was a problem hiding this comment.
thanks for tipping this nice option. I didn't know it.
|
|
||
| _ cargo +"$rust_stable" build --bins --release | ||
| _ ./net/scp.sh ./target/release/solana-validator "$instance_ip:." | ||
| echo 500000 | ./net/ssh.sh "$instance_ip" sudo tee /proc/sys/vm/max_map_count > /dev/null |
There was a problem hiding this comment.
Instead of this, let's copy solana-sys-tuner in so it can set max_map_count and we verify that code path too
| @@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com | |||
|
|
|||
| ```bash | |||
There was a problem hiding this comment.
The doc/ and bank.rs changes in here look just fine, why don't you just land those as a separate PR while we work through the ci/ files in this PR
There was a problem hiding this comment.
There is no strong reason to create separate PRs. I just thought it's not worth to be its own prs. bank.rs changes are needed to trigger live-cluster tests. (yeah, I could improve the ci/buildkite-pipeline.sh). And docs chagnes somewhat mentions this pr about the update notice. So separating them introduces a bit of work.
| artifact_paths: "log-*.txt" | ||
| agents: | ||
| - "queue=cuda" | ||
| - command: "ci/live-cluster-sanity.sh" |
There was a problem hiding this comment.
Do we need to run this on every PR? It seems like a nightly would be suffcient
There was a problem hiding this comment.
Yeah, I think this worth on every PR. These are some reasons:
- nightly is a bit too infrequent in my opinion;
- According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
- I can tolerate with hourly, but then why not every-pr? ;)
- This doesn't make the whole CI longer from the PR author's perspective (
local-clusteris the longest at this pipeline phase...)
- According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
- it's less ideal compared to unit-tests, but this test could serve as a smoke test around process startup, whose tests are currently particularly weak.
- Running every PR could work as a last minute sanity check in the case of hotfix.
live-clusteroccupiesqueue=gce-deploywhich isn't so crowded compared to thequeue=default.- gossip/turbine/bpf exeuction code changes will benefit from testing with actual production environment as part of normal CI build. These area currently lacks integration tests with fixture data extracted from the real environment. So, no need to manually run validator each time for minor changes.
There was a problem hiding this comment.
live-clusteroccupiesqueue=gce-deploywhich isn't so crowded compared to thequeue=default.
I believe there are only one or two agents running gce-deploy ATM. So we'll want to bump that up first. It should just be a matter of ensuring the gcloud CLI tools are installed and pointed at the correct project, then adding a systemd service for the new agent
| source ci/rust-version.sh stable | ||
|
|
||
| escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g") | ||
| instance_prefix="testnet-live-sanity-$escaped_branch" |
There was a problem hiding this comment.
I think metrics will complain about this since there won't be a database named $instance_prefix. (I hit similar trying to get cute with the rolling upgrades instance names)
There was a problem hiding this comment.
Fortunately, this pr doesn't use much of net/*.shs. This isn't affected. Anyway, I've specifically setup a metric database for this job.
| instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}') | ||
|
|
||
| on_trap() { | ||
| if [[ -z $instance_deleted ]]; then |
There was a problem hiding this comment.
I think it's safe to just try to delete here
This reverts commit ae24ab6.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
Problem
bisecting is hurting... (hard-labored fruit this time: #12176)
Summary of Changes
I think if ci time and resource is allowed, this should be run on each prs instead of nightly ci job. and it seems that running this doesn't take much time.
- [ ] todo what to do if the tested cluster is dead? Maybe easy turn-off knob like github's(EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)skip-live-clusterlabel?- [ ] todo if the cluster is dead, fallback to some periodic backup of snapshot + minimum ledger?(EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)