Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd 3.5 and upgrade grpc/protobuf #3051

Merged
merged 10 commits into from
Mar 7, 2022
Merged

Conversation

crazy-max
Copy link
Member

@crazy-max crazy-max commented Jan 30, 2022

follow-up #3039

closes #2911
closes #3039
closes #3043

update etcd to 3.5 to make grpc/protobuf compatible.

errordeveloper and others added 6 commits January 30, 2022 14:53
Files in GOPATH have read-only persmissions, after simply copying
them can result in an access error.

Signed-off-by: Ilya Dmitrichenko <[email protected]>
(cherry picked from commit 1646833)
Signed-off-by: Ilya Dmitrichenko <[email protected]>
(cherry picked from commit 4ee4f47)
Signed-off-by: Ilya Dmitrichenko <[email protected]>
(cherry picked from commit 25e88b5)
Signed-off-by: Ilya Dmitrichenko <[email protected]>
(cherry picked from commit 4b579ac)
vendor.mod Show resolved Hide resolved
@errordeveloper
Copy link
Contributor

errordeveloper commented Jan 31, 2022

I've attempted etcd update earlier, I do recall it worked before I went on introducing vendor.mod.

However, at the time I was aiming to make etcd 3.4 work with a newer version of gRPC, that was to avoid making major changes in SwarmKit code.

@crazy-max crazy-max force-pushed the grpc-update branch 2 times, most recently from dcbee72 to 01ba48a Compare January 31, 2022 22:42
@crazy-max
Copy link
Member Author

@dperny Only 3 tests failing now if you have an idea: https://app.circleci.com/pipelines/github/docker/swarmkit/385/workflows/91ffe05e-57f8-4435-969b-f4327a02cbaf/jobs/10750?invite=true#step-108-156

--- FAIL: TestRaftSnapshotRestart (14.19s)
    testutils.go:97: 
        	Error Trace:	testutils.go:97
        	            				storage_test.go:191
        	Error:      	Received unexpected error:
        	            	state does not match on all nodes
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster.func1
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:81
        	            	github.com/docker/swarmkit/testutils.PollFuncWithTimeout
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:22
        	            	github.com/docker/swarmkit/testutils.PollFunc
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:60
        	            	github.com/docker/swarmkit/manager/state/raft_test.TestRaftSnapshotRestart
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/storage_test.go:191
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1259
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1581
        	            	polling failed
        	            	github.com/docker/swarmkit/testutils.PollFuncWithTimeout
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/docker/swarmkit/testutils.PollFunc
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:60
        	            	github.com/docker/swarmkit/manager/state/raft_test.TestRaftSnapshotRestart
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/storage_test.go:191
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1259
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1581
        	Test:       	TestRaftSnapshotRestart
--- FAIL: TestRaftSnapshot (10.33s)
    testutils.go:97: 
        	Error Trace:	testutils.go:97
        	            				testutils.go:457
        	            				storage_test.go:75
        	Error:      	Received unexpected error:
        	            	state does not match on all nodes
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster.func1
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:81
        	            	github.com/docker/swarmkit/testutils.PollFuncWithTimeout
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:22
        	            	github.com/docker/swarmkit/testutils.PollFunc
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:60
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.AddRaftNode
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:457
        	            	github.com/docker/swarmkit/manager/state/raft_test.TestRaftSnapshot
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/storage_test.go:75
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1259
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1581
        	            	polling failed
        	            	github.com/docker/swarmkit/testutils.PollFuncWithTimeout
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/docker/swarmkit/testutils.PollFunc
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:60
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.AddRaftNode
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:457
        	            	github.com/docker/swarmkit/manager/state/raft_test.TestRaftSnapshot
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/storage_test.go:75
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1259
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1581
        	Test:       	TestRaftSnapshot
--- FAIL: TestRaftSnapshotForceNewCluster (10.30s)
    testutils.go:97: 
        	Error Trace:	testutils.go:97
        	            				storage_test.go:314
        	Error:      	Received unexpected error:
        	            	state does not match on all nodes
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster.func1
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:81
        	            	github.com/docker/swarmkit/testutils.PollFuncWithTimeout
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:22
        	            	github.com/docker/swarmkit/testutils.PollFunc
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:60
        	            	github.com/docker/swarmkit/manager/state/raft_test.TestRaftSnapshotForceNewCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/storage_test.go:314
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1259
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1581
        	            	polling failed
        	            	github.com/docker/swarmkit/testutils.PollFuncWithTimeout
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/docker/swarmkit/testutils.PollFunc
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:60
        	            	github.com/docker/swarmkit/manager/state/raft_test.TestRaftSnapshotForceNewCluster
        	            		/home/circleci/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/storage_test.go:314
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1259
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1581
        	Test:       	TestRaftSnapshotForceNewCluster
FAIL

@dperny
Copy link
Collaborator

dperny commented Feb 7, 2022

I started tracking this test error down. Specifically, the problem occurs in this line:

https://github.com/docker/swarmkit/blob/d67fed1575fc219da2f512f3485d0a8ccad8fe67/manager/state/raft/testutils/testutils.go#L80-L82

By modifying this to output the actual values, I see that specifically the problem is this:

state does not match on all nodes: [ cur.Lead: 4558996606769285775, prev.Lead: 4558996606769285775; cur.Term: 2, prev.Term: 2; cur.Applied: 0, prev.Applied: 10 ]

so, cur.Applied is not equal to prev.Applied.

Eliding some of the intermediate steps, the value of Applied is ultimately only ever set by calling the raftLog.appliedTo method, which itself is only ever called, both before and after the update, in this code branch:

https://github.com/etcd-io/etcd/blob/986a2b51f4e87fe72e3fa2e85394dd659268dfcb/raft/node.go#L399-L402

Digging through the changes, this code get shuffled around and changed a lot between these updates, but most notably, this is the first time the actual behavior of the code changes independent of refactoring:

etcd-io/etcd@7a8ab37

See raft/rawnode.go

This is introduced in etcd-io/etcd#10063. In the comments, the PR author says this:

The applied index should be allowed to lag behind the commit index

This leads me to believe, even with almost no understanding of the raft library, that something about this change may be the root of the issue. I will have to spend more time digging into understanding the Raft library to truly understand if this is the case and if so what the fix is, though.

@thaJeztah
Copy link
Member

Hm... I'm definitely not an expert in this area myself. I did a bit of searching through git history for this code.

Looks like this check was originally added by @LK4D4 in 92a3aa6 (part of #180)

raft: wait for cluster readiness on each added node
Somehow concurrent join creates deadlock; raft event loop stops with
state:
node 1: term 2, leader 1
node 2: term 2, leader 1
node 3: term 2, leader 3

Also:
* changed polling to less cpu-expensive
* changed back raft tick because on slow machines it leads to
  reelections

Some tweaks/changes in this area were made after that by @aaronlehmann in 2b62941 amd 547188b (#255) to address #182

Those commits;

  • first commit: Use an artificial timebase for raft tests

  • second commit added the cur.Applied != prev.Applied check, and mentions:

    Make waitForCluster wait until all members have applied the same portion of the log to their state machine.

(sorry for the ping, @LK4D4 @aaronlehmann - just in case you recall more about "how this works", and/or have ideas on the things that @dperny mentioned)

@thaJeztah
Copy link
Member

Ticket opened in etcd; etcd-io/etcd#13741

@dperny
Copy link
Collaborator

dperny commented Mar 7, 2022

@crazy-max The fix, it turns out, is like 2 lines.

Here is the link to the comparison. The commit message has the explanation.

crazy-max/swarmkit@grpc-update...dperny:grpc-update-raft

@crazy-max
Copy link
Member Author

@crazy-max The fix, it turns out, is like 2 lines.

Here is the link to the comparison. The commit message has the explanation.

crazy-max/[email protected]:grpc-update-raft

Ok thx! Was thinking dbd9df1 would be ok.

@dperny
Copy link
Collaborator

dperny commented Mar 7, 2022

Your commit works as well, either way is fine.

@codecov
Copy link

codecov bot commented Mar 7, 2022

Codecov Report

Merging #3051 (12b19ee) into master (d67fed1) will increase coverage by 0.10%.
The diff coverage is 84.21%.

❗ Current head 12b19ee differs from pull request most recent head 310cbe4. Consider uploading reports for the commit 310cbe4 to get more accurate results

@@            Coverage Diff             @@
##           master    #3051      +/-   ##
==========================================
+ Coverage   62.18%   62.29%   +0.10%     
==========================================
  Files         155      155              
  Lines       24533    24539       +6     
==========================================
+ Hits        15257    15286      +29     
+ Misses       7678     7643      -35     
- Partials     1598     1610      +12     

@crazy-max crazy-max force-pushed the grpc-update branch 2 times, most recently from 92449d7 to 12b19ee Compare March 7, 2022 20:22
@crazy-max crazy-max marked this pull request as ready for review March 7, 2022 20:35
@crazy-max
Copy link
Member Author

PTAL @dperny @thaJeztah

After 3.3.x, etcd made a small change to the raft library that broke
Swarmkit. It also, as it turns out, broke their raft example.

The core issue is that a snapshot has an embedded ConfState from when
the snapshot is created. This ConfState, as it turns out, is not
supposed to be the one from when the snapshot was made. It should be the
one from when the snapshot is sent, the current ConfState.

When adding a new node to the quorum, the node must be caught up using a
snapshot. Previously, we were sending the snapshot exactly as it was
taken. However, because the snapshot predates the node's membership
in the cluster, the ConfState does not have the new node in it.

The change to the raft library was the raft library began checking the
snapshot ConfState, and rejecting snapshots where the node was missing
from the ConfState. The fix is just, as mentioned above, to overwrite
the ConfState from the snapshot with the current ConfState before
sending.

Signed-off-by: Drew Erny <[email protected]>
(cherry picked from commit b7c49a6)
@dperny dperny merged commit 616e8db into moby:master Mar 7, 2022
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crazy-max crazy-max deleted the grpc-update branch March 7, 2022 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants