-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cardano-wallet-shelley workers crash #1708
Comments
I saw this when running latest master revision on Only one of the two wallet workers has crashed. It is still possible to get the details of the wallet which did not crash. But listing wallets fails. Luckily I have been running with More details to come... |
@rvl I observe it pretty regularly on Ubuntu test env I'm using. If there are any DEBUG level logs that are worth turning on let me know. Attaching info - level wallet log from an "incident": Relevant excerpt (I think):
|
I saw manifestation of this in Daedalus few times. In Daedalus it looks then as if all wallets suddenly disappeared. That is because listing Shelley wallets in case of this error results in:
Luckily restarting Daedalus usually helps and everything is back to normal. |
@rvl From a few experiments I conducted the other day (observing crashes in the integration test cluster), I observed 3 failures that had the same symptoms and led to a "wallet_not_responding" kind of situation. Symptoms were, at some point in time, the wallet would fail to acquire a point on chain:
And, within a few seconds later, the worker would suddenly exits with a:
With no clear sign that any re-connection attempts did occur or anything really abnormal happened (the acquire failure is itself so-to-speak expected under some circumstances and shouldn't have any major impact on the life of the worker). From there, any subsequent request would return a 500 Internal Server Error with |
If I run 3 instances of set -x CARDANO_NODE_SOCKET_PATH /private/var/folders/h_/mlzym27d5lz_fx3ph2d547700000gn/T/test-7d0bcbd6e21b57e6/node/node.socket
while [ 1 ]
do
cardano-cli shelley query tip --testnet-magic 764824073
done against the relay node while the integration tests are running, I get many
, but that is different from My original suspicious was whether our strategy to close-and re-open the connection when the wallet worker rolls back actually was sound, as we can rollback often. |
1961: Handle Network.Socket.connect: resource exhausted exceptions r=KtorZ a=rvl ### Issue Number Relates to #1708 ### Overview - Adds regression test - Fixes unhandled exception resulting from parallel local snocket connections (see issue description for more details). Co-authored-by: Rodney Lorrimar <[email protected]> Co-authored-by: IOHK <[email protected]>
1969: Add more HasCallStack to shelley network layer r=KtorZ a=rvl ### Issue Number Relates to #1708 ### Overview - Adds HasCallStack annotations where possible, based on the principle that future bugs often appear in the same place as past bugs. - Make the parallel socket connections test more robust by removing the threadDelay. ### Comments - I was unable to add `HasCallStack` to `connectClient`. This requires a change in the ouroboros-network library. Co-authored-by: Rodney Lorrimar <[email protected]>
lgtm. I haven't seen that lately 🎉 |
After a year its still actual. and then worker becomes crushed |
Thanks. We are aware of this, it's been tracked in our Jira backlog (ADP-871) and is currently our highest priority bug... unfortunately a little bit on hold recently due to Alonzo work. |
plz update if possible |
The issue is fixed since v2021-11-11, and should be good also in current wallet version v2022-01-18. |
Context
1.
I observed that
workers
on cardano-wallet tend to die while wallet is just running on top of cardano-node. After running wallet overnight on a VPS server I had:500
:2.
I was able to reproduce the situation by starting and stopping
cardano-node
while wallet was running. (however it was intermittent, i.e. not every cardano-node restart was causing that).There was in the wallet.log upon restarting cardano-node (and workers crash as a result):
3.
Finally I managed to observed the situation when the worker of my wallet dies without any intervention from my side.
The worker crashes on
2020-06-02 09:59:26.14 UTC
. Here are corresponding logs from wallet and the node around that time.Wallet.log:
Node.log:
Steps to Reproduce
Expected behavior
Wallet should be running with no crashes.
Actual behavior
Wallet workers tend to crash which makes them not operational until wallet server is restarted.
Resolution
Analysis
The error
Network.Socket.connect: <socket: 24>: resource exhausted (Resource temporarily unavailable)
means thatconnect()
returnsEAGAIN
.This is because
localSnocket
sets up a listen queue of length 1. By contrast the TCP versionsocketSnocket
uses a listen queue of length 8.Our code uses
localSnocket
through theconnectClient
function. This function is used multiple times at once - one node client per wallet, one "global" client, and another node client for stake pools monitoring. We handle some errors, but the "resource exhausted" error is considered fatal.Solution
connect()
returningEAGAIN
, we should retry,backing off for a small random interval=> Handle Network.Socket.connect: resource exhausted exceptions #1961.HasCallStack
through the network layer, and print backtraces in the "Worker has exited unexpectedly" log message ⇒ Add more HasCallStack to shelley network layer #1969.QA
lib/shelley/test/unit/Cardano/Wallet/Shelley/NetworkSpec.hs
The text was updated successfully, but these errors were encountered: