Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Introduce sanity/compatibility test for live clusters#12175

Closed
ryoqun wants to merge 57 commits intosolana-labs:masterfrom
ryoqun:live-cluster-sanity
Closed

Introduce sanity/compatibility test for live clusters#12175
ryoqun wants to merge 57 commits intosolana-labs:masterfrom
ryoqun:live-cluster-sanity

Conversation

@ryoqun
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun commented Sep 10, 2020

Problem

bisecting is hurting... (hard-labored fruit this time: #12176)

Summary of Changes

I think if ci time and resource is allowed, this should be run on each prs instead of nightly ci job. and it seems that running this doesn't take much time.

- [ ] todo what to do if the tested cluster is dead? Maybe easy turn-off knob like github's skip-live-cluster label? (EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)

- [ ] todo if the cluster is dead, fallback to some periodic backup of snapshot + minimum ledger? (EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)

@ryoqun ryoqun force-pushed the live-cluster-sanity branch from 25d39ce to a405b68 Compare September 18, 2020 12:37
Comment thread runtime/src/bank.rs

// Eager rent collection repeats in cyclic manner.
// Each cycle is composed of <partiion_count> number of tiny pubkey subranges
// Each cycle is composed of <partition_count> number of tiny pubkey subranges
Copy link
Copy Markdown
Contributor Author

@ryoqun ryoqun Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(while intentionally triggering the CI) let's increase my karma. :)

Comment thread ci/buildkite-pipeline.sh Outdated
all_test_steps() {
command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20
wait_step
#command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, I need to revert these before merging!

Comment thread ci/live-cluster-sanity.sh
# for your pain-less copy-paste

# UPDATE docs/src/clusters.md TOO!!
test_with_live_cluster "testnet" \
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backporting to v1.2, I'll remove this line.

@ryoqun ryoqun requested a review from mvines September 18, 2020 13:23
@ryoqun ryoqun marked this pull request as ready for review September 18, 2020 13:26
@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Sep 18, 2020

@mvines I think this pr is getting in pretty good shape. Could you review this? I changed to launch and run on adhoc GCE instance and the test duration is pretty short (~10 min) for both testnet and mainnet-beta.

@ryoqun ryoqun changed the title [wip] Live cluster sanity Introduce sanity/compatibility test for live clusters Sep 18, 2020
Comment thread docs/src/clusters.md Outdated

```bash
$ solana-validator \
--entrypoint entrypoint.devnet.solana.com:8001 \
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered for logical use order: entrypoint (contact to the cluster) => [trusted] validator (fetch genesis/snapshot) => expected-... (let's assert expected things finally)

Comment thread ci/live-cluster-sanity.sh Outdated
instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')

on_trap() {
if [[ -z $instance_deleted ]]; then
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global variables! \ o /

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to just try to delete here

Comment thread docs/src/clusters.md Outdated

##### Example `solana-validator` command-line

[comment]: <> (UPDATE ci/live-cluster-sanity.sh TOO!!)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed docusaus can't handle this correctly, this is the reason of failing travis build.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now fixed.

Comment thread ci/live-cluster-sanity.sh Outdated
-d '{"jsonrpc":"2.0","id":1, "method":"validatorExit"}' \
http://localhost:18899

(sleep 3 && kill "$tail_pid") &
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this trick realizes set +e-less elegant wait.

Comment thread ci/live-cluster-sanity.sh Outdated
./net/ssh.sh "$instance_ip" mkdir cluster-sanity

validator_log="$cluster_label-validator.log"
./net/ssh.sh "$instance_ip" -Llocalhost:18899:localhost:18899 ./solana-validator \
Copy link
Copy Markdown
Contributor Author

@ryoqun ryoqun Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined with --private-rpc and --rpc-bind-address, the exposure to the public internet is minimized by -L....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed

Comment thread ci/live-cluster-sanity.sh Outdated
show_log
done

echo "--- Monitoring validator $cluster_label"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I also add the catchup phase?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's skip this for now. This will increase the test time.

Comment thread ci/live-cluster-sanity.sh Outdated
--trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \
--expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \
# for your pain-less copy-paste

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder it's nice to have to upload fetched snapshots to the buildkite as artifacts for reproducible testing if anything odd happens.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done by uploading snapshots only if the build failed.

@mvines mvines requested a review from t-nelson September 19, 2020 02:29
Comment thread ci/live-cluster-sanity.sh Outdated

(sleep 3 && kill "$tail_pid") &
kill_pid=$!
wait "$ssh_pid" "$tail_pid" "$kill_pid"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guard with timeout N

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it turned out this is rather complicated. hint: wait must be shell builtin but timeout is just a normal command. let's skip this.

Comment thread ci/live-cluster-sanity.sh Outdated
source ci/_
source ci/rust-version.sh stable

escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BUILDKITE_BRANCH is empty (like if ci/live-cluster-sanity.sh is run locally), set escaped_branch to $(whoami) perhaps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread ci/live-cluster-sanity.sh Outdated
# ensure to delete leftover cluster
./net/gce.sh delete -p "$instance_prefix" || true
# only bootstrap, no normal validator
./net/gce.sh create -p "$instance_prefix" -n 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure the instances are shut down promptly if something goes wrong:

Suggested change
./net/gce.sh create -p "$instance_prefix" -n 0
./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for tipping this nice option. I didn't know it.

Comment thread ci/live-cluster-sanity.sh Outdated

_ cargo +"$rust_stable" build --bins --release
_ ./net/scp.sh ./target/release/solana-validator "$instance_ip:."
echo 500000 | ./net/ssh.sh "$instance_ip" sudo tee /proc/sys/vm/max_map_count > /dev/null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, let's copy solana-sys-tuner in so it can set max_map_count and we verify that code path too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread ci/live-cluster-sanity.sh
Comment thread docs/src/clusters.md
@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com

```bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc/ and bank.rs changes in here look just fine, why don't you just land those as a separate PR while we work through the ci/ files in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no strong reason to create separate PRs. I just thought it's not worth to be its own prs. bank.rs changes are needed to trigger live-cluster tests. (yeah, I could improve the ci/buildkite-pipeline.sh). And docs chagnes somewhat mentions this pr about the update notice. So separating them introduces a bit of work.

Comment thread ci/buildkite-pipeline.sh
artifact_paths: "log-*.txt"
agents:
- "queue=cuda"
- command: "ci/live-cluster-sanity.sh"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to run this on every PR? It seems like a nightly would be suffcient

Copy link
Copy Markdown
Contributor Author

@ryoqun ryoqun Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this worth on every PR. These are some reasons:

  • nightly is a bit too infrequent in my opinion;
    • According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
      • I can tolerate with hourly, but then why not every-pr? ;)
    • This doesn't make the whole CI longer from the PR author's perspective (local-cluster is the longest at this pipeline phase...)
  • it's less ideal compared to unit-tests, but this test could serve as a smoke test around process startup, whose tests are currently particularly weak.
  • Running every PR could work as a last minute sanity check in the case of hotfix.
  • live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.
  • gossip/turbine/bpf exeuction code changes will benefit from testing with actual production environment as part of normal CI build. These area currently lacks integration tests with fixture data extracted from the real environment. So, no need to manually run validator each time for minor changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.

I believe there are only one or two agents running gce-deploy ATM. So we'll want to bump that up first. It should just be a matter of ensuring the gcloud CLI tools are installed and pointed at the correct project, then adding a systemd service for the new agent

Comment thread ci/live-cluster-sanity.sh Outdated
source ci/rust-version.sh stable

escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")
instance_prefix="testnet-live-sanity-$escaped_branch"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metrics will complain about this since there won't be a database named $instance_prefix. (I hit similar trying to get cute with the rolling upgrades instance names)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, this pr doesn't use much of net/*.shs. This isn't affected. Anyway, I've specifically setup a metric database for this job.

Comment thread ci/live-cluster-sanity.sh Outdated
instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')

on_trap() {
if [[ -z $instance_deleted ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to just try to delete here

@ryoqun ryoqun changed the title Introduce sanity/compatibility test for live clusters [NOTE TO SELF; UNCOMMENT]Introduce sanity/compatibility test for live clusters Sep 28, 2020
@ryoqun ryoqun changed the title [NOTE TO SELF; UNCOMMENT]Introduce sanity/compatibility test for live clusters [NOTE TO SELF; UNCOMMENT] Introduce sanity/compatibility test for live clusters Sep 28, 2020
@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Sep 28, 2020

@mvines @t-nelson I think this is good for another review in code-wise. The only remaining job is to increase build-agent for gce-deploy.

t-nelson
t-nelson previously approved these changes Sep 28, 2020
Copy link
Copy Markdown
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, now! Rolling out more BK agents is currently blocked on #12527, though

@stale
Copy link
Copy Markdown

stale Bot commented Jul 20, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale
Copy link
Copy Markdown

stale Bot commented Jul 30, 2021

This stale pull request has been automatically closed. Thank you for your contributions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

stale [bot only] Added to stale content; results in auto-close after a week.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants