Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client v6 occasionally crushes with concurrent map read and map write #399

Open
Gaudeamus opened this issue Mar 26, 2023 · 11 comments
Open
Labels

Comments

@Gaudeamus
Copy link

Gaudeamus commented Mar 26, 2023

Hello,
working with remote cluster on slow network I'm facing with a regular problem causing entire app to crush
client version v6.12.0

fatal error: concurrent map read and map write

goroutine 87917046 [running]:
github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137

the same issue was with v5

github.com/aerospike/aerospike-client-go/v5.PartitionForWrite(0x0?, 0xc0a44b3830, 0xc4d432d800)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/partition.go:52 +0x96
github.com/aerospike/aerospike-client-go/v5.newExecuteCommand(0x7?, 0xc0a44b3830, 0x3?, {0x146b1ff, _}, {_, _}, _)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/execute_command.go:35 +0x9a
github.com/aerospike/aerospike-client-go/v5.(*Client).Execute(0xc3ec36ab40, 0xc01bc20970?, 0xc3bc5f?, {0x146b1ff, 0x9}, {0x146dec9, 0xb}, {0xc4d69cecb0, 0x1, 0x1})
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/client.go:738 +0x25c
@khaf
Copy link
Collaborator

khaf commented Mar 27, 2023

Thanks for your feedback, I've been working on it since yesterday. Can you provide a bit more information regarding your cluster? How many nodes do you have and what usually prompts the issue (nodes departing the cluster, etc.)

@khaf
Copy link
Collaborator

khaf commented Mar 27, 2023

Also, could you please paste the whole panic message so that I can see where the concurrent reads and writes are occurring? The current panic message does not include the part where the write is happening.

@khaf khaf added the bug label Mar 27, 2023
@Gaudeamus
Copy link
Author

Hi @khaf!

Cluster:
4 nodes with Ubuntu 20.04
Aerospike Enterprise Edition build 6.2.0.3

Here are all 4 aerospike-related goroutines from panic log

        goroutine 87917046 [running]:
        github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
        github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc11530fa54?, {0xc25e780d20?, 0x12424a0?, 0xc7f41bb2d0?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc11530fa54, 0x6}, {0xc380a96618?, 0x12?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc11530fa54?, 0x0?}, {0xc380a96618?, 0x1?})
                /src/services/.../db/aerospike.go:64 +0xea


        goroutine 88112780 [runnable]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized.func1()
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489
        created by github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489 +0xcd

        goroutine 3062 [select]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized(0xc29bc9efc0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:515 +0x13a
        github.com/aerospike/aerospike-client-go/v6.NewCluster(0xc0a23a6d00, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:144 +0x72a
        github.com/aerospike/aerospike-client-go/v6.NewClientWithPolicyAndHost(0x12ecce0?, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:88 +0xf9
        .../aerospike.(*Session).connect.func1()
                /src/aerospike/aerospike.go:198 +0xbc
        .../aerospike.(*Session).connect(0xc0999c7a40, {0xc6095e6468, 0x3, 0x3})
                /src/aerospike/aerospike.go:209 +0x119
        .../aerospike.(*Session).watch.func1()
                /src/aerospike/aerospike.go:271 +0x229
        .../aerospike.(*Session).watch
                /src/aerospike/aerospike.go:241 +0x16a

        goroutine 65471388 [sleep]:
        time.Sleep(0xf4240)
                /usr/local/go/src/runtime/time.go:195 +0x135
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).executeAt(0xc056e81e40, {0x17aadb0, 0xc056e81e40}, 0xc0999c7a90, 0x0?, {0x0?, 0x0?, 0x217c960?}, 0x0?, 0x0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2472 +0x239
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).execute(0x0?, {0x17aadb0, 0xc056e81e40}, 0x0?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2440 +0x8a
        github.com/aerospike/aerospike-client-go/v6.(*readCommand).Execute(...)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:264
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012400400?, {0xc2aecac730, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:360 +0x259
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc2ee1a42b4?, {0xc2aecac730?, 0x12424a0?, 0xc29303ee70?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc2ee1a42b4, 0x6}, {0xc65be927e0?, 0x2084400?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc2ee1a42b4?, 0x0?}, {0xc65be927e0?, 0x18?})
                /src/services/.../db/aerospike.go:64 +0xea

Below are two logs with some events and timestamps surrounding the crush
main log:

        34311  2023/03/24 22:48:50 last logger message
        ...crush...
        41762  2023/03/24 22:49:09 New logger started

aerospike.log

        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:32 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18548->10.152.129.108:3000: i/o timeout
        2023/03/24 22:41:54 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18734->10.152.129.108:3000: i/o timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:46:50 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:52:20 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: Timeout

@khaf
Copy link
Collaborator

khaf commented Mar 30, 2023

Just a heads up that I think I have identified the root cause of this issue, and the potential fix is coming with the next release early next week.

@Gaudeamus
Copy link
Author

hi, @khaf! Could you give any updates, please?

@khaf
Copy link
Collaborator

khaf commented May 31, 2023

@Gaudeamus Sorry to have been unresponsive, I thought I had replied to you. I just don't know how I've managed to miss your message. We were dealing with a few other issues and releasing the Go client kind of fell of the cracks. I will release the fix this week.

@khaf
Copy link
Collaborator

khaf commented Oct 30, 2024

Sorry that this issue has fallen though the cracks. I just went through the client code, and it is not immediately clear to me how this issue can happen outside of constantly closing and opening a new client object.

@Gaudeamus Are you creating a new Client object on each request? Or are you creating one and reusing that same object for all your requests?

@Gaudeamus
Copy link
Author

The Client is stored and reused after creation. Although there is watching routine that creates new client and closes the old one periodically. For the reason of poor connectivity some nodes often become unavailable

@khaf
Copy link
Collaborator

khaf commented Nov 8, 2024

@Gaudeamus I see two issues here:

  1. It seems that you are closing the client while there are still transactions ongoing. That would cause a race condition that you are seeing here.
  2. The issue of some nodes becoming unavailable is concerning to me. Is that a client issue or an infra issue? Do you mean that the client does not keep up with the state of the cluster and never connects to some new nodes?

@Gaudeamus
Copy link
Author

  1. After your last questions it occurred to me that I need to be more careful with switching clients. So I modified this part in a way that old client closes only after 1 minute delay since the new one was created. Idk if there is better solution, but it's seems to be working. Please direct me if there is any documentation related

  2. It's an infra issue. We have consul client and it holds aerospike's active hosts. If network or other issue changes the list we reconnect like I've described above. After deeper investigation I found that majority of node list changes are very brief. So I can rely on aerospike client node status tracking without new client setup - and it's seems client is tracking nodes states and reconnects to the old ones after network issue resolves. I can assume that client should be working well with fixed list of hosts. But still we need this part of logic if some permanent changes are made to the cluster and hosts would be different from the original list. So could you please suggest reliable solution of how to manage this

@khaf
Copy link
Collaborator

khaf commented Nov 8, 2024

  1. So I assume the issue is resolved.
  2. People usually use a DNS round-robin and rotate in and out the nodes in there; So you just use the one (a two for availability) seed and the rest will work from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants