Introduce sanity/compatibility test for live clusters by ryoqun · Pull Request #12175 · solana-labs/solana

ryoqun · 2020-09-10T21:18:24Z

Problem

bisecting is hurting... (hard-labored fruit this time: #12176)

Summary of Changes

I think if ci time and resource is allowed, this should be run on each prs instead of nightly ci job. and it seems that running this doesn't take much time.

~~- [ ] todo what to do if the tested cluster is dead? Maybe easy turn-off knob like github's skip-live-cluster label?~~ (EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)

~~- [ ] todo if the cluster is dead, fallback to some periodic backup of snapshot + minimum ledger?~~ (EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)

ryoqun · 2020-09-18T13:02:10Z


 // Eager rent collection repeats in cyclic manner.
-// Each cycle is composed of <partiion_count> number of tiny pubkey subranges
+// Each cycle is composed of <partition_count> number of tiny pubkey subranges


(while intentionally triggering the CI) let's increase my karma. :)

ryoqun · 2020-09-18T13:19:21Z

 all_test_steps() {
-  command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20
-  wait_step
+  #command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20


Obviously, I need to revert these before merging!

ryoqun · 2020-09-18T13:22:47Z

+    # for your pain-less copy-paste
+
+# UPDATE docs/src/clusters.md TOO!!
+test_with_live_cluster "testnet" \


When backporting to v1.2, I'll remove this line.

ryoqun · 2020-09-18T13:28:54Z

@mvines I think this pr is getting in pretty good shape. Could you review this? I changed to launch and run on adhoc GCE instance and the test duration is pretty short (~10 min) for both testnet and mainnet-beta.

ryoqun · 2020-09-18T13:43:56Z


 ```bash
 $ solana-validator \
+    --entrypoint entrypoint.devnet.solana.com:8001 \


Reordered for logical use order: entrypoint (contact to the cluster) => [trusted] validator (fetch genesis/snapshot) => expected-... (let's assert expected things finally)

ryoqun · 2020-09-18T13:51:33Z

+instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')
+
+on_trap() {
+  if [[ -z $instance_deleted ]]; then


global variables! \ o /

I think it's safe to just try to delete here

ryoqun · 2020-09-18T14:03:26Z


 ##### Example `solana-validator` command-line

+[comment]: <> (UPDATE ci/live-cluster-sanity.sh TOO!!)


TIL: https://stackoverflow.com/a/20885980

I noticed docusaus can't handle this correctly, this is the reason of failing travis build.

this is now fixed.

ryoqun · 2020-09-18T14:13:47Z

+    -d '{"jsonrpc":"2.0","id":1, "method":"validatorExit"}' \
+    http://localhost:18899
+
+  (sleep 3 && kill "$tail_pid") &


this trick realizes set +e-less elegant wait.

ryoqun · 2020-09-19T00:31:42Z

+  ./net/ssh.sh "$instance_ip" mkdir cluster-sanity
+
+  validator_log="$cluster_label-validator.log"
+  ./net/ssh.sh "$instance_ip" -Llocalhost:18899:localhost:18899 ./solana-validator \


combined with --private-rpc and --rpc-bind-address, the exposure to the public internet is minimized by -L....

no longer needed

ryoqun · 2020-09-19T00:32:47Z

+    show_log
+  done
+
+  echo "--- Monitoring validator $cluster_label"


should I also add the catchup phase?

let's skip this for now. This will increase the test time.

ryoqun · 2020-09-19T00:34:15Z

+    --trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \
+    --expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \
+    # for your pain-less copy-paste
+


I wonder it's nice to have to upload fetched snapshots to the buildkite as artifacts for reproducible testing if anything odd happens.

done by uploading snapshots only if the build failed.

ryoqun · 2020-09-19T16:56:00Z

+
+  (sleep 3 && kill "$tail_pid") &
+  kill_pid=$!
+  wait "$ssh_pid" "$tail_pid" "$kill_pid"


guard with timeout N

well it turned out this is rather complicated. hint: wait must be shell builtin but timeout is just a normal command. let's skip this.

mvines · 2020-09-23T04:39:51Z

+source ci/_
+source ci/rust-version.sh stable
+
+escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")


If BUILDKITE_BRANCH is empty (like if ci/live-cluster-sanity.sh is run locally), set escaped_branch to $(whoami) perhaps?

mvines · 2020-09-23T04:41:07Z

+# ensure to delete leftover cluster
+./net/gce.sh delete -p "$instance_prefix" || true
+# only bootstrap, no normal validator
+./net/gce.sh create -p "$instance_prefix" -n 0


Let's ensure the instances are shut down promptly if something goes wrong:

Suggested change

./net/gce.sh create -p "$instance_prefix" -n 0

./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1

thanks for tipping this nice option. I didn't know it.

mvines · 2020-09-23T04:42:36Z

+
+_ cargo +"$rust_stable" build --bins --release
+_ ./net/scp.sh ./target/release/solana-validator "$instance_ip:."
+echo 500000 | ./net/ssh.sh "$instance_ip" sudo tee /proc/sys/vm/max_map_count > /dev/null


Instead of this, let's copy solana-sys-tuner in so it can set max_map_count and we verify that code path too

mvines · 2020-09-23T04:48:03Z

@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com

 ```bash


The doc/ and bank.rs changes in here look just fine, why don't you just land those as a separate PR while we work through the ci/ files in this PR

There is no strong reason to create separate PRs. I just thought it's not worth to be its own prs. bank.rs changes are needed to trigger live-cluster tests. (yeah, I could improve the ci/buildkite-pipeline.sh). And docs chagnes somewhat mentions this pr about the update notice. So separating them introduces a bit of work.

t-nelson · 2020-09-19T18:45:31Z

    artifact_paths: "log-*.txt"
    agents:
      - "queue=cuda"
+  - command: "ci/live-cluster-sanity.sh"


Do we need to run this on every PR? It seems like a nightly would be suffcient

Yeah, I think this worth on every PR. These are some reasons:

nightly is a bit too infrequent in my opinion;

According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.

I can tolerate with hourly, but then why not every-pr? ;)

This doesn't make the whole CI longer from the PR author's perspective (local-cluster is the longest at this pipeline phase...)

it's less ideal compared to unit-tests, but this test could serve as a smoke test around process startup, whose tests are currently particularly weak.

Running every PR could work as a last minute sanity check in the case of hotfix.

live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.

gossip/turbine/bpf exeuction code changes will benefit from testing with actual production environment as part of normal CI build. These area currently lacks integration tests with fixture data extracted from the real environment. So, no need to manually run validator each time for minor changes.

live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.

I believe there are only one or two agents running gce-deploy ATM. So we'll want to bump that up first. It should just be a matter of ensuring the gcloud CLI tools are installed and pointed at the correct project, then adding a systemd service for the new agent

t-nelson · 2020-09-24T03:40:05Z

+source ci/rust-version.sh stable
+
+escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")
+instance_prefix="testnet-live-sanity-$escaped_branch"


I think metrics will complain about this since there won't be a database named $instance_prefix. (I hit similar trying to get cute with the rolling upgrades instance names)

Fortunately, this pr doesn't use much of net/*.shs. This isn't affected. Anyway, I've specifically setup a metric database for this job.

t-nelson · 2020-09-24T03:40:36Z

+instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')
+
+on_trap() {
+  if [[ -z $instance_deleted ]]; then


I think it's safe to just try to delete here

ryoqun · 2020-09-28T08:01:56Z

@mvines @t-nelson I think this is good for another review in code-wise. The only remaining job is to increase build-agent for gce-deploy.

t-nelson

LGTM, now! Rolling out more BK agents is currently blocked on #12527, though

This reverts commit ae24ab6.

stale · 2021-07-20T05:13:36Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2021-07-30T02:26:38Z

This stale pull request has been automatically closed. Thank you for your contributions.

ryoqun force-pushed the live-cluster-sanity branch from 25d39ce to a405b68 Compare September 18, 2020 12:37

ryoqun commented Sep 18, 2020

View reviewed changes

ryoqun requested a review from mvines September 18, 2020 13:23

ryoqun marked this pull request as ready for review September 18, 2020 13:26

ryoqun added v1.2 labels Sep 18, 2020

ryoqun changed the title ~~[wip] Live cluster sanity~~ Introduce sanity/compatibility test for live clusters Sep 18, 2020

ryoqun commented Sep 18, 2020

View reviewed changes

ryoqun commented Sep 19, 2020

View reviewed changes

mvines requested a review from t-nelson September 19, 2020 02:29

ryoqun mentioned this pull request Sep 19, 2020

Live cluster sanity v1.3 #12348

Closed

ryoqun commented Sep 19, 2020

View reviewed changes

mvines reviewed Sep 23, 2020

View reviewed changes

Comment thread ci/live-cluster-sanity.sh

mvines reviewed Sep 23, 2020

View reviewed changes

t-nelson reviewed Sep 24, 2020

View reviewed changes

ryoqun changed the title ~~Introduce sanity/compatibility test for live clusters~~ [NOTE TO SELF; UNCOMMENT]Introduce sanity/compatibility test for live clusters Sep 28, 2020

ryoqun changed the title ~~[NOTE TO SELF; UNCOMMENT]Introduce sanity/compatibility test for live clusters~~ [NOTE TO SELF; UNCOMMENT] Introduce sanity/compatibility test for live clusters Sep 28, 2020

t-nelson previously approved these changes Sep 28, 2020

View reviewed changes

ryoqun added 28 commits July 10, 2021 02:57

Increase duration of monitoring phase

b00613e

Run ledger-tool verify too

0edc8ce

Maybe bpf_loader.so needed only for ledger-tool?

389fb2a

Well, this shouldn't needed anymore

0fdde8a

Silly me.

70d6826

meh...

956af9f

Reduce rooted slots also rename confusing var

5350ca0

more var renaming fix....

ec336b5

Double timeout (testnet is slow for some reason)

c85aa05

Increase timeout...

29fa355

Tooooo much log

dd85437

Chery pick bank frozen INFO message

7078191

Cherry-pick more logs.

f39821b

less log

1f4ea03

Restore --expected-shred-version for faster boot?

a3f2739

longer timeout for ledger-tool and high-legel logs

da49d67

disable audit

2902804

longer

8a1ccdc

Revert "disable audit"

056e1da

This reverts commit ae24ab6.

Remove expected shred version?

31c4b1a

Update remote-live-cluster-sanity.sh

dfc13bc

Update buildkite-pipeline.sh

d6e8c02

Update to new validator subcommands

d2a9e33

Add --identity

7f44ee5

Add --no-poh-speed-test....

f116cab

Add more entrypoints

157960c

Add --force.....

ca1a3c0

Add --force

f74ea2c


		##### Example `solana-validator` command-line

		[comment]: <> (UPDATE ci/live-cluster-sanity.sh TOO!!)

	./net/gce.sh create -p "$instance_prefix" -n 0
	./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1

		@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com

		```bash

Conversation

ryoqun commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

ryoqun Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun commented Sep 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun commented Sep 28, 2020

ryoqun commented Sep 10, 2020 •

edited

Loading

ryoqun Sep 18, 2020 •

edited

Loading

ryoqun Sep 19, 2020 •

edited

Loading

ryoqun Sep 24, 2020 •

edited

Loading