Remove ThinClient from LocalCluster#1300
Conversation
13c4b28 to
3aa9a50
Compare
3aa9a50 to
dd7ad73
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1300 +/- ##
========================================
Coverage 82.1% 82.1%
========================================
Files 899 899
Lines 237263 237334 +71
========================================
+ Hits 194810 194920 +110
+ Misses 42453 42414 -39 |
| pub fn send_and_confirm_transaction_with_retries<T: Signers + ?Sized>( | ||
| &self, | ||
| keypairs: &T, | ||
| transaction: &mut Transaction, |
There was a problem hiding this comment.
Do we really need a reference or can consume here?
There was a problem hiding this comment.
think it needs a reference because it's retrying, and resigning the tx
| self.invoke(self.tpu_client.try_send_transaction(transaction)) | ||
| } | ||
|
|
||
| /// Serialize and send transaction to the current and upcoming leader TPUs according to fanout |
There was a problem hiding this comment.
I would use . and s for 3rd person verbs endings everywhere or nowhere (not sure how we do in another places):
/// Serialize and send transaction to the current and upcoming leader TPUs according to fanout
/// size.
/// Attempt to send and confirm tx `tries` times.
/// Wait for signature confirmation before returning.
/// Return the transaction signature.
There was a problem hiding this comment.
good call on the tense! will fix. None of the other TPU client descriptions use a period so I left them out
| let mut num_confirmed = 0; | ||
| let mut wait_time = MAX_PROCESSING_AGE; | ||
| // resend the same transaction until the transaction has no chance of succeeding | ||
| let wire_transaction = |
There was a problem hiding this comment.
maybe move serialization outside of the loop?
There was a problem hiding this comment.
as is I can't move it out of the loop because at the end of the loop (if the transaction fails to land), the transaction is modified to contain the latest blockhash to send again. See: https://github.com/gregcusack/solana/blob/dd7ad734635ae427916706438c3c0013165b1afe/tpu-client/src/tpu_client.rs#L157-L159
| tries: usize, | ||
| pending_confirmations: usize, | ||
| ) -> TransportResult<Signature> { | ||
| for x in 0..tries { |
There was a problem hiding this comment.
I would rename x to something more informative
| for tpu_address in &leaders { | ||
| let cache = self.tpu_client.get_connection_cache(); | ||
| let conn = cache.get_connection(tpu_address); | ||
| conn.send_data_async(wire_transaction.clone())?; |
There was a problem hiding this comment.
i think the clone is needed here because send_data_async() takes ownership over wire_transaction. And we are looping over the leaders and sending wire_transaction to the upcoming leaders. plus it won't compile without the clone.
| bincode::serialize(&transaction).expect("transaction serialization failed"); | ||
|
|
||
| while now.elapsed().as_secs() < wait_time as u64 { | ||
| let leaders = self |
There was a problem hiding this comment.
I don't fully understand why you want to use leaders and connection cache directly instead of using TpuClient::send_wire_transaction
There was a problem hiding this comment.
good question. this is part of the issue I don't fully understand. If send_wire_transaction() uses the non blocking TPUClient. Down the stack it calls:
let conn = connection_cache.get_nonblocking_connection(addr);
conn.send_data(&wire_transaction).await
And this gives the error described above in the PR description.
So i changed tpu_client to call: cache.get_connection(tpu_address) which calls connection.new_blocking_connection(). And this blocking connection works. but the nonblocking does not
There was a problem hiding this comment.
It sort of worrying. Do you know if this new method is going to be used only for integration tests?
There was a problem hiding this comment.
it is concerning for sure. Although the concern I have is not necessarily with send_and_confirm_transaction_with_retries() but with try_send_transaction(), send_wire_transaction(), and the nonblocking tpu client. Since calling send_wire_transaction() even on its own or with a couple retries fails.
There was a problem hiding this comment.
If this function is supposed to be used for integration tests, I would add some comments explaining the situation and proceed. If this PR is not blocking your work, we can try to look together into it next week.
There was a problem hiding this comment.
ya its just used/been tested with integration tests. I just added a note for this function explaining.
| ); | ||
| } | ||
| } | ||
| log::info!("{x} tries failed transfer"); |
There was a problem hiding this comment.
would import log if needed instead
| transaction: VersionedTransaction, | ||
| ) -> TransportResult<Signature> { | ||
| let wire_transaction = | ||
| bincode::serialize(&transaction).expect("serialize Transaction in send_batch"); |
There was a problem hiding this comment.
I think the expect message should use "should" by convention.
| M: ConnectionManager<ConnectionPool = P, NewConnectionConfig = C>, | ||
| C: NewConnectionConfig, | ||
| { | ||
| fn async_send_versioned_transaction( |
There was a problem hiding this comment.
Looks like these 2 functions are not used anywhere in this PR. Could you comment on why they are included?
There was a problem hiding this comment.
ya these methods are required/used by client.async_transfer(). See: https://github.com/gregcusack/solana/blob/57cede46cdd58e5f67e38bcef5783c670f12171d/local-cluster/tests/local_cluster.rs#L2925 and https://github.com/gregcusack/solana/blob/57cede46cdd58e5f67e38bcef5783c670f12171d/local-cluster/tests/local_cluster.rs#L2936
async_transfer() is a function in the AsyncClient trait. And AsyncClient is defined for ThinClient but not for TpuClient. So when we switched to using TpuClient we need to implement AsyncClient for TpuClient
| Arc::new(RpcClient::new(rpc_url)), | ||
| rpc_pubsub_url.as_str(), | ||
| TpuClientConfig::default(), | ||
| cache.clone(), |
There was a problem hiding this comment.
ya it is because cache is of type &Arc<ConnectionCache> and new_with_connection_cache() requires ownership over the Arc aka Arc<ConnectionCache>. If I try to match instead on connection_cache instead of &*connection_cache, then I end up matching on the Arc instead of the underlying connection_cache
57cede4 to
bb85f32
Compare
CriesofCarrots
left a comment
There was a problem hiding this comment.
I'd like to see if we can slim the number or complexity of pub fns being added
| /// They both invoke the nonblocking TPUClient and both fail when calling "transfer_with_client()" multiple times | ||
| /// I do not full understand WHY the nonblocking TPUClient fails in this specific case. But the method defined below | ||
| /// does work although it has only been tested in LocalCluster integration tests | ||
| pub fn send_and_confirm_transaction_with_retries<T: Signers + ?Sized>( |
There was a problem hiding this comment.
What is the minimal amount of logic in this function that is necessary to avoid the ConnectError(EndpointStopping) error in local cluster?
Is it getting the ConnectionCache and using send_data_async() directly that resolves the error?
I am wondering if we can put the retry/resign/confirm logic in a local-cluster function, instead of adding such a new, heavy api in TpuClient.
There was a problem hiding this comment.
good question/point. I'll look into this. I had designed this to be similar to what ThinClient had with its retries. But I guess there is a reason we are getting rid of ThinClient lol.
There was a problem hiding this comment.
update: we do need some retry logic here. using the ConnectionCache and send_data_async() alone does not resolve the error. Going to try and put retry logic in a separate LocalCluster function
EDIT: we actually do NOT need retries here. This has been updated to just send transactions to the upcoming leaders in the schedule. The local cluster tests themselves will poll for the signature confirmation if they want to. TpuClient API is updated to be simple, take in tx, send to next leaders.
3d3f905 to
fab2cbe
Compare
fab2cbe to
d7a650d
Compare
* setup tpu client methods required for localcluster to use TpuClient * add new_tpu_quic_client() for local cluster tests * update local-cluster src files to use TpuClient. tests next * finish removing thinclient from localcluster * address comments * add note for send_and_confirm_transaction_with_retries * remove retry logic from tpu-client. Send directly to upcoming leaders without retry.
A Followup PR to: #258.
This is the 10th PR on the way to remove ThinClient completely.
Problem
ThinClientis deprecated. Replacing withTpuClientSummary of Changes
Replace
ThinClientwithTpuClientinLocalClusterSpecifically, we use a Quic TpuClient, not a UDP TpuClient
Notes
This works but I can't say for sure why it works.
I had initially replaced all of
ThinClientwithinLocalClusterwith the nonblocking version ofTpuClient. I had made the switch such thattransfer_with_client()callsTpuClient.try_send_transaction()which callssend_wire_transaction_to_addr().send_wire_transaction_to_addr()uses a non-blocking connection from the connection_cache to send the transaction.However, using the changes above, I consistently ran into an issue with any test that called
add_validator()more than once. Under the hood,add_validator()callstransfer_with_client().transfer_with_client()transfers some amount of lamports to a destination account (aka the account associated with the validator we are trying to add). However in theLocalClustertests,transfer_with_client()would consistently fail when adding the second validator withadd_validator(). It would always fail with:transport custom error: "ConnectError(EndpointStopping).I noticed that
ThinClientused a blocking connection when sending transactions withretry_transfer()(which was used to send a transaction withintransfer_with_client(). So, I switched over to using the blockingTpuClientand that seems to work well. I had to make a few changes to expose the leader schedule, connection cache, and fanout slots from thenonblocking/tpu_clientup to the blockingtpu_client.I am still working on investigating. But thinks its probably a good idea to get some eyes on this since it does work as expected now.