Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cardano-wallet-shelley workers crash #1708

Closed
piotr-iohk opened this issue Jun 2, 2020 · 10 comments
Closed

Cardano-wallet-shelley workers crash #1708

piotr-iohk opened this issue Jun 2, 2020 · 10 comments
Assignees

Comments

@piotr-iohk
Copy link
Contributor

piotr-iohk commented Jun 2, 2020

Context

Information -
Version cardano-wallet: 2020.5.6 (git revision: 3fb783f), cardano-node: 1.12 044e82d68a5e5d05258c93eda6c00d8541e4d977
Platform Linux
Installation From source

1.

I observed that workers on cardano-wallet tend to die while wallet is just running on top of cardano-node. After running wallet overnight on a VPS server I had:

  • 2 out of 3 of my Byron wallets showing status: not_responding
  • Listing of Shelley wallets was giving 500 :
Code = 500,
Message = {"code":"wallet_not_responding","message":"That's embarassing. My associated worker for e9bc1ad5c4273eed7813483a40342e01ad47e243 is no longer responding. This is not something that is supposed to happen. The worker must have left a trace in the logs of severity 'Error' when it died which might explain the cause. Said differently, this wallet won't be accessible until the server is restarted but there are good chances it'll recover itself upon restart."}

2.

I was able to reproduce the situation by starting and stopping cardano-node while wallet was running. (however it was intermittent, i.e. not every cardano-node restart was causing that).

There was in the wallet.log upon restarting cardano-node (and workers crash as a result):

[cardano-wallet.network:Info:119][2020-06-02 08:16:13.26 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:121][2020-06-02 08:16:13.26 UTC] Couldn't connect to node (x11). Retrying in a bit...
[cardano-wallet.network:Info:121][2020-06-02 08:16:13.26 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:125][2020-06-02 08:16:13.27 UTC] Couldn't connect to node (x11). Retrying in a bit...
[cardano-wallet.network:Warning:127][2020-06-02 08:16:13.27 UTC] Couldn't connect to node (x11). Retrying in a bit...
[cardano-wallet.network:Warning:129][2020-06-02 08:16:13.27 UTC] Couldn't connect to node (x11). Retrying in a bit...
[cardano-wallet.network:Warning:123][2020-06-02 08:16:13.27 UTC] Couldn't connect to node (x11). Retrying in a bit...
[cardano-wallet.network:Info:119][2020-06-02 08:16:13.27 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:121][2020-06-02 08:16:13.27 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:16][2020-06-02 08:16:13.27 UTC] Intersection found: 5e708589
[cardano-wallet.network:Info:72][2020-06-02 08:16:13.27 UTC] Intersection found: 5e708589
[cardano-wallet.wallet-engine:Error:30][2020-06-02 08:16:13.28 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 127 Network.Socket.connect: <socket: 23>: resource exhausted (Resource temporarily unavailable)
[cardano-wallet.wallet-engine:Error:51][2020-06-02 08:16:13.29 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 129 Network.Socket.connect: <socket: 23>: resource exhausted (Resource temporarily unavailable)
[cardano-wallet.wallet-engine:Error:42][2020-06-02 08:16:13.29 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 125 Network.Socket.connect: <socket: 23>: resource exhausted (Resource temporarily unavailable)
[cardano-wallet.wallet-engine:Error:61][2020-06-02 08:16:13.29 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 123 Network.Socket.connect: <socket: 23>: resource exhausted (Resource temporarily unavailable)
[cardano-wallet.wallet-engine:Info:16][2020-06-02 08:16:13.36 UTC] c3ce146c: Applying blocks [34.18962 ... 34.18962]
[cardano-wallet.wallet-engine:Info:16][2020-06-02 08:16:13.36 UTC] c3ce146c: Creating checkpoint at 5e708589<-[52a91198-34.18962#31204]

3.

Finally I managed to observed the situation when the worker of my wallet dies without any intervention from my side.

The worker crashes on 2020-06-02 09:59:26.14 UTC. Here are corresponding logs from wallet and the node around that time.

Wallet.log:

[cardano-wallet.api-server:Info:209][2020-06-02 09:59:25.10 UTC] [RequestId 27] 200 OK in 0.035328973s
[cardano-wallet.network:Warning:11][2020-06-02 09:59:25.12 UTC] Couldn't connect to node (x22). Retrying in a bit...
[cardano-wallet.network:Info:11][2020-06-02 09:59:25.12 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Info:11][2020-06-02 09:59:25.12 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Warning:159][2020-06-02 09:59:26.13 UTC] Couldn't connect to node (x18). Retrying in a bit...
[cardano-wallet.network:Info:159][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:157][2020-06-02 09:59:26.13 UTC] Couldn't connect to node (x18). Retrying in a bit...
[cardano-wallet.network:Info:157][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:162][2020-06-02 09:59:26.13 UTC] Couldn't connect to node (x18). Retrying in a bit...
[cardano-wallet.network:Info:162][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:155][2020-06-02 09:59:26.13 UTC] Couldn't connect to node (x18). Retrying in a bit...
[cardano-wallet.network:Info:159][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:157][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Warning:161][2020-06-02 09:59:26.13 UTC] Couldn't connect to node (x18). Retrying in a bit...
[cardano-wallet.network:Info:161][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:165][2020-06-02 09:59:26.13 UTC] Couldn't connect to node (x18). Retrying in a bit...
[cardano-wallet.network:Info:165][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Info:165][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:162][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:161][2020-06-02 09:59:26.13 UTC] Mux ConnectionId {localAddress = LocalAddress {getFilePath = ""}, remoteAddress = LocalAddress {getFilePath = "/root/node/relay/node.socket"}} Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:23][2020-06-02 09:59:26.13 UTC] Intersection found: 46ad783d
[cardano-wallet.network:Info:35][2020-06-02 09:59:26.13 UTC] Intersection found: 46ad783d
[cardano-wallet.network:Info:26][2020-06-02 09:59:26.13 UTC] Intersection found: 46ad783d
[cardano-wallet.network:Info:38][2020-06-02 09:59:26.13 UTC] Intersection found: 46ad783d
[cardano-wallet.network:Info:14][2020-06-02 09:59:26.13 UTC] Intersection found: 46ad783d
[cardano-wallet.wallet-engine:Error:19][2020-06-02 09:59:26.14 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 155 Network.Socket.connect: <socket: 36>: resource exhausted (Resource temporarily unavailable)
[cardano-wallet.wallet-engine:Info:23][2020-06-02 09:59:26.14 UTC] 3f532c05: Applying blocks [35.3545 ... 35.3545]

Node.log:

[testnet:cardano.node.ChainDB:Info:95][2020-06-02 09:59:25.12 UTC] New reader was created
[testnet:cardano.node.ChainDB:Info:95][2020-06-02 09:59:25.12 UTC] ReaderNoLongerInMem
[testnet:cardano.node.ChainDB:Debug:95] [2020-06-02 09:59:25.12 UTC] TraceImmDBEvent
[testnet:cardano.node.ChainDB:Debug:95] [2020-06-02 09:59:25.12 UTC] TraceImmDBEvent
[testnet:cardano.node.ChainDB:Debug:95] [2020-06-02 09:59:25.12 UTC] TraceImmDBEvent
[testnet:cardano.node.ChainDB:Info:36][2020-06-02 09:59:25.17 UTC] Valid candidate (Point 759545, HashHeader {unHashHeader = b3c4529ef1be49f701d85f967c8e09a32f30fb2639de80451c5311fbd2a0acf8})
[testnet:cardano.node.ChainDB:Notice:36][2020-06-02 09:59:25.17 UTC] Chain extended, new tip: (Point 759545, HashHeader {unHashHeader = b3c4529ef1be49f701d85f967c8e09a32f30fb2639de80451c5311fbd2a0acf8})
[testnet:cardano.node.ChainDB:Debug:105] [2020-06-02 09:59:26.13 UTC] before next, messages elided = 1802406411496672
[testnet:cardano.node.ChainDB:Debug:105] [2020-06-02 09:59:26.13 UTC] Scheduled a garbage collection for 757251
[testnet:cardano.node.ChainDB:Info:105][2020-06-02 09:59:26.13 UTC] New reader was created
[testnet:cardano.node.ChainDB:Info:108][2020-06-02 09:59:26.13 UTC] New reader was created
[testnet:cardano.node.ChainDB:Info:114][2020-06-02 09:59:26.13 UTC] New reader was created
[testnet:cardano.node.ChainDB:Info:120][2020-06-02 09:59:26.13 UTC] New reader was created
[testnet:cardano.node.ChainDB:Info:125][2020-06-02 09:59:26.13 UTC] New reader was created
[testnet:cardano.node.ChainDB:Debug:88] [2020-06-02 09:59:33.16 UTC] Block added to queue: (Point 759572, HashHeader {unHashHeader = b2126d1f495a976d73eb9135550faa86c6b9d6cdff5f90fd20649d61c0ddf4a8}) queue size 1
[testnet:cardano.node.ChainDB:Notice:36][2020-06-02 09:59:33.16 UTC] Chain extended, new tip: (Point 759572, HashHeader {unHashHeader = b2126d1f495a976d73eb9135550faa86c6b9d6cdff5f90fd20649d61c0ddf4a8})
[testnet:cardano.node.ChainDB:Debug:36] [2020-06-02 09:59:33.19 UTC] before next, messages elided = 1802414403318408
[testnet:cardano.node.ChainDB:Debug:36] [2020-06-02 09:59:33.19 UTC] Chain added block (Point 759571, HashHeader {unHashHeader = db7db23ce97c4a399835c13bb854ad95b00416880b61c1d38b7aecfce951908e})
[testnet:cardano.node.ChainDB:Info:36][2020-06-02 09:59:33.19 UTC] Block fits onto some fork: (Point 759571, HashHeader {unHashHeader = db7db23ce97c4a399835c13bb854ad95b00416880b61c1d38b7aecfce951908e})

Steps to Reproduce

  1. Start a wallet connected to F&F testnet and create a few wallets (Byron and Shelley)

Expected behavior

Wallet should be running with no crashes.

Actual behavior

Wallet workers tend to crash which makes them not operational until wallet server is restarted.


Resolution

Analysis

The error Network.Socket.connect: <socket: 24>: resource exhausted (Resource temporarily unavailable) means that connect() returns EAGAIN.

This is because localSnocket sets up a listen queue of length 1. By contrast the TCP version socketSnocket uses a listen queue of length 8.

Our code uses localSnocket through the connectClient function. This function is used multiple times at once - one node client per wallet, one "global" client, and another node client for stake pools monitoring. We handle some errors, but the "resource exhausted" error is considered fatal.

Solution

  1. In the event of connect() returning EAGAIN, we should retry, backing off for a small random interval => Handle Network.Socket.connect: resource exhausted exceptions #1961.
  2. We can add a regression test which connects multiple network layers in parallel to the same node socket => Handle Network.Socket.connect: resource exhausted exceptions #1961.
  3. We can add HasCallStack through the network layer, and print backtraces in the "Worker has exited unexpectedly" log message ⇒ Add more HasCallStack to shelley network layer #1969.

QA

  1. Regression test added in lib/shelley/test/unit/Cardano/Wallet/Shelley/NetworkSpec.hs
@piotr-iohk piotr-iohk added BUG? and removed BUG? labels Jun 2, 2020
@piotr-iohk piotr-iohk changed the title Cardano-wallet workers crash Cardano-wallet-shelley workers crash Jun 9, 2020
@rvl
Copy link
Contributor

rvl commented Jul 2, 2020

I saw this when running latest master revision on shelley_testnet last night.

Only one of the two wallet workers has crashed. cardano-node is still running.

It is still possible to get the details of the wallet which did not crash. But listing wallets fails.

Luckily I have been running with | tee -a cardano-wallet.log, so at least there is an Info-level log to look through.

More details to come...

@piotr-iohk
Copy link
Contributor Author

@rvl I observe it pretty regularly on Ubuntu test env I'm using. If there are any DEBUG level logs that are worth turning on let me know. Attaching info - level wallet log from an "incident":

wallet.log

Relevant excerpt (I think):

[cardano-wallet.network:Warning:670] [2020-07-02 18:37:58.44 UTC] Connection lost with the node.
[cardano-wallet.wallet-engine:Error:29] [2020-07-02 18:37:58.44 UTC] 73d38c71: Unexpected error following the chain: ExceptionInLinkedThread ThreadId 665 (MuxError {errorType = MuxBearerClosed, errorMsg = "<socket: 16> closed when reading data, waiting on next header True", errorStack = []})
[cardano-wallet.network:Notice:29] [2020-07-02 18:37:58.44 UTC] Destroying cursor connection at ThreadId 665
[cardano-wallet.wallet-engine:Error:55] [2020-07-02 18:37:58.44 UTC] 1ceb45b3: Unexpected error following the chain: ExceptionInLinkedThread ThreadId 666 (MuxError {errorType = MuxBearerClosed, errorMsg = "<socket: 23> closed when reading data, waiting on next header True", errorStack = []})
[cardano-wallet.network:Notice:55] [2020-07-02 18:37:58.44 UTC] Destroying cursor connection at ThreadId 666
[cardano-wallet.wallet-engine:Error:71] [2020-07-02 18:37:58.44 UTC] bed3fc77: Unexpected error following the chain: ExceptionInLinkedThread ThreadId 668 (MuxError {errorType = MuxBearerClosed, errorMsg = "<socket: 27> closed when reading data, waiting on next header True", errorStack = []})
[cardano-wallet.network:Notice:71] [2020-07-02 18:37:58.44 UTC] Destroying cursor connection at ThreadId 668
[cardano-wallet.wallet-engine:Error:45] [2020-07-02 18:37:58.44 UTC] 1b0aa249: Unexpected error following the chain: ExceptionInLinkedThread ThreadId 670 (MuxError {errorType = MuxBearerClosed, errorMsg = "<socket: 31> closed when reading data, waiting on next header True", errorStack = []})
[cardano-wallet.network:Notice:45] [2020-07-02 18:37:58.44 UTC] Destroying cursor connection at ThreadId 670
[cardano-wallet.network:Info:29] [2020-07-02 18:37:58.44 UTC] Looking for an intersection with the node's local chain with: 01010101<-[3bfd2a29-0.0#0], b4503aca<-[df9984c1-15.10480#15000], 81cffd86<-[14e09790-15.12360#15100], 7a0a4f91<-[2c1c5d7b-15.12840#15123], 2c1c5d7b<-[22e52c78-15.12880#15124], 22e52c78<-[f2fb4d76-15.12889#15125], f2fb4d76<-[8c49d6ed-15.12911#15126], 8c49d6ed<-[c15f16be-15.12960#15127], c15f16be<-[1838faad-15.12965#15128], 1838faad<-[41787df4-15.12993#15129], 41787df4<-[f33a143f-15.13000#15130], f33a143f<-[09eaac01-15.13039#15131], 09eaac01<-[2ee8ff25-15.13040#15132], 2ee8ff25<-[67da4ece-15.13053#15133]
[cardano-wallet.network:Info:45] [2020-07-02 18:37:58.45 UTC] Looking for an intersection with the node's local chain with: 01010101<-[3bfd2a29-0.0#0], b4503aca<-[df9984c1-15.10480#15000], 81cffd86<-[14e09790-15.12360#15100], 7a0a4f91<-[2c1c5d7b-15.12840#15123], 2c1c5d7b<-[22e52c78-15.12880#15124], 22e52c78<-[f2fb4d76-15.12889#15125], f2fb4d76<-[8c49d6ed-15.12911#15126], 8c49d6ed<-[c15f16be-15.12960#15127], c15f16be<-[1838faad-15.12965#15128], 1838faad<-[41787df4-15.12993#15129], 41787df4<-[f33a143f-15.13000#15130], f33a143f<-[09eaac01-15.13039#15131], 09eaac01<-[2ee8ff25-15.13040#15132], 2ee8ff25<-[67da4ece-15.13053#15133]
[cardano-wallet.network:Info:71] [2020-07-02 18:37:58.45 UTC] Looking for an intersection with the node's local chain with: 01010101<-[3bfd2a29-0.0#0], b4503aca<-[df9984c1-15.10480#15000], 81cffd86<-[14e09790-15.12360#15100], 7a0a4f91<-[2c1c5d7b-15.12840#15123], 2c1c5d7b<-[22e52c78-15.12880#15124], 22e52c78<-[f2fb4d76-15.12889#15125], f2fb4d76<-[8c49d6ed-15.12911#15126], 8c49d6ed<-[c15f16be-15.12960#15127], c15f16be<-[1838faad-15.12965#15128], 1838faad<-[41787df4-15.12993#15129], 41787df4<-[f33a143f-15.13000#15130], f33a143f<-[09eaac01-15.13039#15131], 09eaac01<-[2ee8ff25-15.13040#15132], 2ee8ff25<-[67da4ece-15.13053#15133]
[cardano-wallet.network:Info:55] [2020-07-02 18:37:58.45 UTC] Looking for an intersection with the node's local chain with: 01010101<-[3bfd2a29-0.0#0], b4503aca<-[df9984c1-15.10480#15000], 81cffd86<-[14e09790-15.12360#15100], 7a0a4f91<-[2c1c5d7b-15.12840#15123], 2c1c5d7b<-[22e52c78-15.12880#15124], 22e52c78<-[f2fb4d76-15.12889#15125], f2fb4d76<-[8c49d6ed-15.12911#15126], 8c49d6ed<-[c15f16be-15.12960#15127], c15f16be<-[1838faad-15.12965#15128], 1838faad<-[41787df4-15.12993#15129], 41787df4<-[f33a143f-15.13000#15130], f33a143f<-[09eaac01-15.13039#15131], 09eaac01<-[2ee8ff25-15.13040#15132], 2ee8ff25<-[67da4ece-15.13053#15133]
[cardano-wallet.network:Warning:673] [2020-07-02 18:37:58.45 UTC] Connection lost with the node.
[cardano-wallet.network:Warning:716] [2020-07-02 18:37:58.45 UTC] Connection lost with the node.
[cardano-wallet.pools-engine:Error:102] [2020-07-02 18:37:58.45 UTC] Unexpected error following the chain: ExceptionInLinkedThread ThreadId 673 (MuxError {errorType = MuxBearerClosed, errorMsg = "<socket: 35> closed when reading data, waiting on next header True", errorStack = []})
[cardano-wallet.network:Notice:102] [2020-07-02 18:37:58.45 UTC] Destroying cursor connection at ThreadId 673
[cardano-wallet.pools-engine:Error:102] [2020-07-02 18:37:58.45 UTC] Chain follower exited with error. Worker will no longer monitor stake pools.
[cardano-wallet.pools-engine:Notice:102] [2020-07-02 18:37:58.45 UTC] Worker has exited: main action is over.
[cardano-wallet.wallet-engine:Error:90] [2020-07-02 18:37:58.46 UTC] b1fb8632: Unexpected error following the chain: ExceptionInLinkedThread ThreadId 716 (MuxError {errorType = MuxBearerClosed, errorMsg = "<socket: 36> closed when reading data, waiting on next header True", errorStack = []})
[cardano-wallet.network:Notice:90] [2020-07-02 18:37:58.46 UTC] Destroying cursor connection at ThreadId 716
[cardano-wallet.network:Info:90] [2020-07-02 18:37:58.46 UTC] Looking for an intersection with the node's local chain with: 01010101<-[3bfd2a29-0.0#0], b4503aca<-[df9984c1-15.10480#15000], 81cffd86<-[14e09790-15.12360#15100], 7a0a4f91<-[2c1c5d7b-15.12840#15123], 2c1c5d7b<-[22e52c78-15.12880#15124], 22e52c78<-[f2fb4d76-15.12889#15125], f2fb4d76<-[8c49d6ed-15.12911#15126], 8c49d6ed<-[c15f16be-15.12960#15127], c15f16be<-[1838faad-15.12965#15128], 1838faad<-[41787df4-15.12993#15129], 41787df4<-[f33a143f-15.13000#15130], f33a143f<-[09eaac01-15.13039#15131], 09eaac01<-[2ee8ff25-15.13040#15132], 2ee8ff25<-[67da4ece-15.13053#15133]
[cardano-wallet.network:Notice:11] [2020-07-02 18:37:58.69 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Notice:13] [2020-07-02 18:37:58.69 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Notice:814] [2020-07-02 18:37:58.70 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Notice:816] [2020-07-02 18:37:58.70 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Notice:817] [2020-07-02 18:37:58.70 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Notice:820] [2020-07-02 18:37:58.70 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Notice:824] [2020-07-02 18:37:58.71 UTC] Couldn't connect to node (x2). Retrying in a bit...
[cardano-wallet.network:Warning:13] [2020-07-02 18:37:58.94 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:11] [2020-07-02 18:37:58.94 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:814] [2020-07-02 18:37:58.95 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:816] [2020-07-02 18:37:58.95 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:817] [2020-07-02 18:37:58.95 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:820] [2020-07-02 18:37:58.95 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:824] [2020-07-02 18:37:58.96 UTC] Couldn't connect to node (x3). Retrying in a bit...
[cardano-wallet.network:Warning:11] [2020-07-02 18:37:59.44 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:13] [2020-07-02 18:37:59.44 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:816] [2020-07-02 18:37:59.45 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:820] [2020-07-02 18:37:59.45 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:817] [2020-07-02 18:37:59.45 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:814] [2020-07-02 18:37:59.45 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:824] [2020-07-02 18:37:59.46 UTC] Couldn't connect to node (x4). Retrying in a bit...
[cardano-wallet.network:Warning:13] [2020-07-02 18:38:00.19 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:11] [2020-07-02 18:38:00.19 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:816] [2020-07-02 18:38:00.20 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:820] [2020-07-02 18:38:00.20 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:817] [2020-07-02 18:38:00.20 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:814] [2020-07-02 18:38:00.20 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:824] [2020-07-02 18:38:00.21 UTC] Couldn't connect to node (x5). Retrying in a bit...
[cardano-wallet.network:Warning:11] [2020-07-02 18:38:01.44 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Info:11] [2020-07-02 18:38:01.44 UTC] conn::/root/node/relay/node.socket Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:13] [2020-07-02 18:38:01.44 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Info:13] [2020-07-02 18:38:01.44 UTC] conn::/root/node/relay/node.socket Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Info:13] [2020-07-02 18:38:01.44 UTC] conn::/root/node/relay/node.socket Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:11] [2020-07-02 18:38:01.44 UTC] conn::/root/node/relay/node.socket Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Warning:816] [2020-07-02 18:38:01.45 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Info:816] [2020-07-02 18:38:01.45 UTC] conn::/root/node/relay/node.socket Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:820] [2020-07-02 18:38:01.45 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Info:820] [2020-07-02 18:38:01.45 UTC] conn::/root/node/relay/node.socket Send MsgProposeVersions (fromList [(NodeToClientV_2,TInt 42)])
[cardano-wallet.network:Warning:817] [2020-07-02 18:38:01.45 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Warning:814] [2020-07-02 18:38:01.45 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Warning:824] [2020-07-02 18:38:01.46 UTC] Couldn't connect to node (x6). Retrying in a bit...
[cardano-wallet.network:Info:816] [2020-07-02 18:38:01.47 UTC] conn::/root/node/relay/node.socket Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:820] [2020-07-02 18:38:01.47 UTC] conn::/root/node/relay/node.socket Recv MsgAcceptVersion NodeToClientV_2 (TInt 42)
[cardano-wallet.network:Info:55] [2020-07-02 18:38:01.47 UTC] Intersection found: 67da4ece
[cardano-wallet.network:Info:71] [2020-07-02 18:38:01.47 UTC] Intersection found: 67da4ece
[cardano-wallet.wallet-engine:Error:19] [2020-07-02 18:38:01.47 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 814 Network.Socket.connect: <socket: 27>: resource exhausted (Resource temporarily unavailable)
[cardano-wallet.wallet-engine:Error:33] [2020-07-02 18:38:01.47 UTC] Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 817 Network.Socket.connect: <socket: 27>: resource exhausted (Resource temporarily unavailable)

@piotr-iohk
Copy link
Contributor Author

I saw manifestation of this in Daedalus few times. In Daedalus it looks then as if all wallets suddenly disappeared. That is because listing Shelley wallets in case of this error results in:

Code = 500,
Message = {"code":"wallet_not_responding","message":"That's embarassing. My associated worker for e9bc1ad5c4273eed7813483a40342e01ad47e243 is no longer responding. 
This is not something that is supposed to happen. 
The worker must have left a trace in the logs of severity 'Error' when it died which might explain the cause. 
Said differently, this wallet won't be accessible until the server is restarted but there are good chances it'll recover itself upon restart."}

Luckily restarting Daedalus usually helps and everything is back to normal.

@KtorZ
Copy link
Member

KtorZ commented Jul 27, 2020

@rvl From a few experiments I conducted the other day (observing crashes in the integration test cluster), I observed 3 failures that had the same symptoms and led to a "wallet_not_responding" kind of situation.

Symptoms were, at some point in time, the wallet would fail to acquire a point on chain:

Error when querying local state parameters for DelegationRewardsClient: AcquireFailurePointNotOnChain

And, within a few seconds later, the worker would suddenly exits with a:

Worker has exited unexpectedly: ExceptionInLinkedThread ThreadId 1972 Network.Socket.connect: <socket: 31>: resource exhausted (Resource temporarily unavailable)

With no clear sign that any re-connection attempts did occur or anything really abnormal happened (the acquire failure is itself so-to-speak expected under some circumstances and shouldn't have any major impact on the life of the worker).

From there, any subsequent request would return a 500 Internal Server Error with wallet_not_responding as a code, including for newly created wallets after the worker of an existing wallet initially crashed. This suggests some sort of failure on a shared components between wallets (possibly, the network layer?).

@Anviking
Copy link
Member

Anviking commented Jul 27, 2020

If I run 3 instances of

set -x CARDANO_NODE_SOCKET_PATH /private/var/folders/h_/mlzym27d5lz_fx3ph2d547700000gn/T/test-7d0bcbd6e21b57e6/node/node.socket
while [ 1 ]
do
  cardano-cli shelley query tip --testnet-magic 764824073
done

against the relay node while the integration tests are running, I get many

cardano-cli: Network.Socket.connect: <socket: 12>: does not exist (Connection refused)

, but that is different from Resource temporarily unavailable

My original suspicious was whether our strategy to close-and re-open the connection when the wallet worker rolls back actually was sound, as we can rollback often.

@rvl rvl self-assigned this Jul 27, 2020
iohk-bors bot added a commit that referenced this issue Jul 27, 2020
1961: Handle Network.Socket.connect: resource exhausted exceptions r=KtorZ a=rvl

### Issue Number

Relates to #1708

### Overview

- Adds regression test
- Fixes unhandled exception resulting from parallel local snocket connections (see issue description for more details).


Co-authored-by: Rodney Lorrimar <[email protected]>
Co-authored-by: IOHK <[email protected]>
rvl added a commit that referenced this issue Jul 28, 2020
rvl added a commit that referenced this issue Jul 28, 2020
iohk-bors bot added a commit that referenced this issue Jul 28, 2020
1969: Add more HasCallStack to shelley network layer r=KtorZ a=rvl

### Issue Number

Relates to #1708

### Overview

- Adds HasCallStack annotations where possible, based on the principle that future bugs often appear in the same place as past bugs.
- Make the parallel socket connections test more robust by removing the threadDelay.

### Comments

- I was unable to add `HasCallStack` to `connectClient`. This requires a change in the ouroboros-network library.


Co-authored-by: Rodney Lorrimar <[email protected]>
@piotr-iohk
Copy link
Contributor Author

lgtm. I haven't seen that lately 🎉

@Fell-x27
Copy link

Fell-x27 commented Jul 30, 2021

After a year its still actual.
[cardano-wallet.network:Warning:958] [2021-07-30 13:26:58.03 UTC] Connection lost with the node. Network.Socket.recvBuf: resource vanished (Connection reset by peer)

and then worker becomes crushed

@piotr-iohk
Copy link
Contributor Author

Thanks. We are aware of this, it's been tracked in our Jira backlog (ADP-871) and is currently our highest priority bug... unfortunately a little bit on hold recently due to Alonzo work.

@34r7h
Copy link

34r7h commented Feb 8, 2022

plz update if possible

@piotr-iohk
Copy link
Contributor Author

The issue is fixed since v2021-11-11, and should be good also in current wallet version v2022-01-18.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants