Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Improve bench-tps keypair generation#7723

Merged
jstarry merged 7 commits intosolana-labs:masterfrom
jstarry:improve-bench-tps-generation
Jan 17, 2020
Merged

Improve bench-tps keypair generation#7723
jstarry merged 7 commits intosolana-labs:masterfrom
jstarry:improve-bench-tps-generation

Conversation

@jstarry
Copy link
Copy Markdown
Contributor

@jstarry jstarry commented Jan 9, 2020

Problem

Account generation at the beginning of bench-tps can sometimes be really slow. This could be due to a number of reasons:

  1. If a large portion of a batch of transfers fails, generation has no way to bail out of attempting to verify each transfer.
  2. RPC requests are slow due to geographic latency

Summary of Changes

  • Parallelize fund verification so that RPC requests are less of a bottleneck
  • Short circuit fund verification if more than half of the tx's failed
  • Short circuit verification of a tx if any of the "to" keypairs is funded or not funded
  • Transfer direction changes more frequently so that source/dest keypair funds stay balanced
  • Made the initial funds check much more lenient so that we can avoid re-running fund generation
  • General code cleanup

Fixes: #7597

@jstarry jstarry requested review from mvines and sakridge January 9, 2020 16:48
@jstarry jstarry force-pushed the improve-bench-tps-generation branch 2 times, most recently from 6735635 to ad668c0 Compare January 9, 2020 17:03
@jstarry jstarry added the v0.22 label Jan 9, 2020
@jstarry jstarry force-pushed the improve-bench-tps-generation branch from ad668c0 to d5f22ba Compare January 10, 2020 01:06
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 10, 2020

Codecov Report

Merging #7723 into master will decrease coverage by <.1%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           master   #7723     +/-   ##
========================================
- Coverage    81.7%   81.7%   -0.1%     
========================================
  Files         241     241             
  Lines       50731   50731             
========================================
- Hits        41491   41486      -5     
- Misses       9240    9245      +5

@jstarry
Copy link
Copy Markdown
Contributor Author

jstarry commented Jan 10, 2020

@sakridge @mvines this is ready for review now

@mvines
Copy link
Copy Markdown
Contributor

mvines commented Jan 13, 2020

@sakridge has a better eye than I do in for this part of the code, I defer review to him

@mvines mvines removed their request for review January 13, 2020 15:55
Comment thread bench-tps/src/bench.rs Outdated
let keypair_chunks = source_keypair_chunks.len() as u64;
let mut reclaim_lamports_back_to_source_account = false;
let mut i = keypair0_balance;
let mut i = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that mechanism was in place to quickly restart bench-tps with the same parameters as a previous run. We would detect the balance of the last keypair to check how many lamports had been transferred from one half of they keypairs to the other half. But I found that the assumptions were not valid for running with ramp-tps.

In this PR, I removed the quick-start mechanism but I think that was a mistake. I'll think of another approach that works well for repeated bench-tps runs as well as incremental ramp-tps runs

Comment thread bench-tps/src/bench.rs Outdated
if timer.elapsed() >= Duration::from_secs(5) {
if failed_verify > 0 {
debug!("total txs failed verify: {}", failed_verify);
let failed_verify = Arc::new(AtomicUsize::new(0));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about breaking this up into another function. The complexity was bad before, now it's even worse.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally in favour, I'll clean this up

@sakridge
Copy link
Copy Markdown
Contributor

Overall question, why are the funding transactions failing in the original?

Copy link
Copy Markdown
Contributor Author

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall question, why are the funding transactions failing in the original?

To be honest, not exactly sure but it's likely either:

  1. Chosen node for funding txs goes down or is unresponsive.
  2. Another bench-tps client is already running and funding txs get dropped by the network

This PR doesn't aim to fix the root cause for failed funding txs but it does aim to make funding faster and it will bail on verifying the fund txs if it comes across too many failures. In that case, the funding txs would be sent again (helping with 2.) and the multi client might pick a new RPC node (helping with 1.)

Comment thread bench-tps/src/bench.rs Outdated
if timer.elapsed() >= Duration::from_secs(5) {
if failed_verify > 0 {
debug!("total txs failed verify: {}", failed_verify);
let failed_verify = Arc::new(AtomicUsize::new(0));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally in favour, I'll clean this up

Comment thread bench-tps/src/bench.rs Outdated
let keypair_chunks = source_keypair_chunks.len() as u64;
let mut reclaim_lamports_back_to_source_account = false;
let mut i = keypair0_balance;
let mut i = 0;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that mechanism was in place to quickly restart bench-tps with the same parameters as a previous run. We would detect the balance of the last keypair to check how many lamports had been transferred from one half of they keypairs to the other half. But I found that the assumptions were not valid for running with ramp-tps.

In this PR, I removed the quick-start mechanism but I think that was a mistake. I'll think of another approach that works well for repeated bench-tps runs as well as incremental ramp-tps runs

@jstarry jstarry force-pushed the improve-bench-tps-generation branch from 0f187f5 to 69cbb82 Compare January 14, 2020 14:27
@jstarry
Copy link
Copy Markdown
Contributor Author

jstarry commented Jan 14, 2020

@sakridge I updated the PR and it's ready for another pass. I updated the logic around transfer direction to be a lot simpler. I tested the changes briefly on a cpu testnet and behaviour looks good. I'll run with gpus later today to make sure tps isn't affected.

@sakridge
Copy link
Copy Markdown
Contributor

@sakridge I updated the PR and it's ready for another pass. I updated the logic around transfer direction to be a lot simpler. I tested the changes briefly on a cpu testnet and behaviour looks good. I'll run with gpus later today to make sure tps isn't affected.

ok cool, i'll take a look

Comment thread bench-tps/src/bench.rs Outdated
// 100 lamports should give enough wiggle room to handle source / dest keypair sets getting unbalanced
let minimum_balance = 100;

if first_keypair_balance < minimum_balance || last_keypair_balance < minimum_balance {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get this minimum balance check fixed at 100. The user could have specified 1000 or 10,000 lamports per account, then if they have only 100 they would not be funded, is that correct? The cluster fees can be in the 1000 lamport range also, so I'm not sure a fixed low value here will work in all cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sakridge you're right, this was oversimplified and didn't take tx fees into account. I updated the PR with a new approach: 78516ca

@jstarry
Copy link
Copy Markdown
Contributor Author

jstarry commented Jan 15, 2020

I wasn't able to get a gpu testnet working but ran a gce cpu testnet with 25k avg tps and tx errors were pretty quiet so I feel pretty confident in this new approach of switching transfer directions more frequently.

EDIT: Metrics link: https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?orgId=2&from=1579052950236&to=1579053426330&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=testnet-dev-jstarry&var-hostid=All

@jstarry
Copy link
Copy Markdown
Contributor Author

jstarry commented Jan 16, 2020

@sakridge can you take another look? Thanks!

Copy link
Copy Markdown
Contributor

@sakridge sakridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice cleanup!

@jstarry jstarry merged commit b78b1bb into solana-labs:master Jan 17, 2020
@jstarry jstarry deleted the improve-bench-tps-generation branch January 17, 2020 02:35
mergify Bot pushed a commit that referenced this pull request Jan 17, 2020
* Improve bench-tps keypair generation

* Fix tests

* Fix move test

* cargo fmt

* Split up funding function into smaller functions

* Support restarting bench-tps without re-funding

* Change quick start logic and remove noisy log

(cherry picked from commit b78b1bb)
solana-grimes pushed a commit that referenced this pull request Jan 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bench-tps: Account funding slows or stalls completely

3 participants