-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add changes to fix issues running the Spacewalk client in Kubernetes #519
Conversation
@pendulum-chain/devs this is now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Even a simpler and more transparent solution than the previous retry
.
Regarding the error due to close connection, could we retry a new one after a failure here by simply creating a new client instance with Ok(OnlineClient::from_url(url).await?)
? Similar to what we do in the testing service in node.
Of course this would be a less robust solution than the native subxt solution.
Good point. I looked into this again and realized that this logic was kind of in place before. In this loop, the runner is restarted if |
Okay understood, I didn't consider this behavior, makes sense! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great changes! 👍🏼
Over the weekend, the runner client again encountered the error.
Because I slightly changed the phrasing in the log messages, I was able to pin it down to this line which means that the runner was only 'hanging' in this loop statement. Now I noticed that the call to maybe_restart_client doesn't do anything in this case because the child process (the actual vault client binary) is still running and working fine. I tested it and it's still able to process issue and redeem requests, and the RPC connection of the vault client is also working. This is interesting but my assumption is that the RPC connection of client is fine because of its periodic restart every 3 hours, thus the connection is always kind of fresh, whereas the runner never restarts or refreshes its RPC connection. That's why I added some logic to just try and create a new RPC client in the runner. |
Cargo.toml
Outdated
|
||
# We need to patch this to https://github.com/tkaitchuck/aHash/releases/tag/v0.8.11 to prevent a build error | ||
# 'error[E0635]: unknown feature `stdsimd`' that occurs because this feature was removed in the latest nightly versions | ||
ahash = { git = "https://github.com/tkaitchuck/aHash", rev = "db36e4c4f0606b786bc617eefaffbe4ae9100762" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also experienced this, that's why I moved up to nightly-2024-04-18
. 😔
Pendulum has 2 ahash dependencies:
https://github.com/pendulum-chain/pendulum/pull/463/files#diff-13ee4b2252c9e516a0547f2891aa2105c3ca71c6d7a1e682c69be97998dfc87eR150-R172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think nightly-2024-04-18
would work in Spacewalk too? Or was there another issue? I don't remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to test with +nightly
and it was fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define nightly-2024-04-18
in the rust-toolchain file in all references (README/github actions) then and remove this patch for ahash
from the Cargo.toml
file? Or what do you think @b-yap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-yap any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, I think the patch isn't needed?
I did a cargo update
in my previous PR; the ahash
in cargo.lock should be ok for now.
I mentioned the minimum nightly version in the readme; but I did not explain why.
https://github.com/pendulum-chain/pendulum?tab=readme-ov-file#how-to-run-tests
We could add the reason over there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the patch statement again and updated all references (also in the CI file) to point to the new nightly version. Let's see if the CI passes and then I merge.
as there is no reason not to parallelize the jobs anymore
@ebma you can squash and merge this now. I reverted to older version, as Pendulum's CI is using |
Removes the
backoff
crate. Assuming that theretry()
function introduces some side effects, we manually implement the retry logic now. The existing constants forRETRY_TIMEOUT
andRETRY_INTERVAL
are kept the same way, except now they are only used to derive the number of retries.Retrying for Subxt RPC error due to closed connection
Sometimes, the runner encounters the following error:
I found this PR which adds a new experimental implementation of an RPC client that automatically reconnects. This implementation is only available in subxt v0.35 or later. I tried bumping the subxt dependencies we use in Spacewalk to that version but I encountered conflicts because our Polkadot dependencies are too outdated.
-> I created #521 as a follow-up and we'll ignore this issue for now.
Related to https://github.com/pendulum-chain/tasks/issues/207.