Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix state.commit is out of range on restart #11888

Merged
merged 8 commits into from
May 15, 2020

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented May 14, 2020

This is a rebase of #10356 (which fixes #10219), with a couple differences:

  • Write .snap files as the very first step of the snapshot process
  • On startup, directly cross check the .snap file with the WAL log snap entries to find the most recent valid snapshot to use, to make sure orphaned .snap files are ignored
  • Remove orphaned .snap.db files when they are obviously out-of-date

cc @brk0v, @jingyih

@jpbetz jpbetz changed the title [WIP] Fix state.commit is out of range on restart Fix state.commit is out of range on restart May 14, 2020
@jpbetz jpbetz requested a review from jingyih May 14, 2020 16:25
@jpbetz
Copy link
Contributor Author

jpbetz commented May 14, 2020

cc @YoyinZyc

@jpbetz jpbetz self-assigned this May 14, 2020
Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

etcdserver/server_test.go Outdated Show resolved Hide resolved
wal/wal.go Outdated Show resolved Hide resolved
wal/wal_test.go Show resolved Hide resolved
etcdserver/storage.go Show resolved Hide resolved
etcdserver/raft.go Outdated Show resolved Hide resolved
etcdserver/raft.go Outdated Show resolved Hide resolved
wal/wal_test.go Show resolved Hide resolved
@gyuho
Copy link
Contributor

gyuho commented May 14, 2020

Overall looks great. Some minor changes requested. Could have been caught in our failpoints testing :)

/cc @xiang90 @wenjiaswe @spzala

@jpbetz
Copy link
Contributor Author

jpbetz commented May 14, 2020

Just for confirmation, I replicated this on master (without this fix in place):

13:06:05 etcd2 | panic: 91bc3c398fb3c146 state.commit 1065 is out of range [352, 354]
13:06:05 etcd2 | goroutine 1 [running]:
13:06:05 etcd2 | go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000170e70, 0x0, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/vendor/go.uber.org/zap/zapcore/entry.go:230 +0x545
13:06:05 etcd2 | go.uber.org/zap.(*SugaredLogger).log(0xc00000efc8, 0x4, 0x107c099, 0x2b, 0xc0001ec840, 0x4, 0x4, 0x0, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/vendor/go.uber.org/zap/sugar.go:234 +0x100
13:06:05 etcd2 | go.uber.org/zap.(*SugaredLogger).Panicf(...)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/vendor/go.uber.org/zap/sugar.go:159
13:06:05 etcd2 | go.etcd.io/etcd/v3/pkg/logutil.(*zapRaftLogger).Panicf(0xc00019fb90, 0x107c099, 0x2b, 0xc0001ec840, 0x4, 0x4)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/pkg/logutil/zap_raft.go:101 +0x7d
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.(*raft).loadState(0xc0001b4a00, 0x2, 0xfd422379fda50e48, 0x429, 0x0, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/raft.go:1576 +0x1be
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.newRaft(0xc0003dfb18, 0x1)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/raft.go:371 +0xd90
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.NewRawNode(0xc0003dfb18, 0xc00000efc8, 0xc0001e1620, 0xc0001e1680)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/rawnode.go:48 +0x2f
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.RestartNode(0xc0003dfb18, 0x1245740, 0xc00019fb90)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/node.go:237 +0x43
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdserver.restartNode(0x7ffc42ddce86, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0000db680, 0x1, 0x1, 0xc0000db980, ...)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdserver/raft.go:505 +0x6ed
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdserver.NewServer(0x7ffc42ddce86, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0000db680, 0x1, 0x1, 0xc0000db980, ...)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdserver/server.go:446 +0x2c0f
13:06:05 etcd2 | go.etcd.io/etcd/v3/embed.StartEtcd(0xc0000c9b80, 0xc0001ee000, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/embed/etcd.go:203 +0xcd3
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdmain.startEtcd(0xc0000c9b80, 0x1051cca, 0x6, 0xc0000dbf01, 0x2)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdmain/etcd.go:212 +0x40
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdmain.startEtcdOrProxyV2()
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdmain/etcd.go:107 +0x25f0
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdmain.Main()
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdmain/main.go:46 +0x37
13:06:05 etcd2 | main.main()
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/main.go:28 +0x20
13:06:05 etcd2 | Terminating etcd2

To replicate the Procfile change should be GOFAIL_HTTP="127.0.0.1:1111" GOFAIL_FAILPOINTS='etcdserver/raftBeforeSaveSnap=panic("raftBeforeSaveSnap")' which differs slightly from what is documented in #10219

@jpbetz
Copy link
Contributor Author

jpbetz commented May 14, 2020

Overall looks great. Some minor changes requested.

Feedback applied. Thanks for the review!

Could have been caught in our failpoints testing :)

Where do we test that, by the way? I'd like to review it and see if there is anything that can be done to make it test this case better.

@gyuho
Copy link
Contributor

gyuho commented May 14, 2020

Where do we test that, by the way?

We've been running failpoints in our CI https://github.com/etcd-io/etcd/blob/master/functional.yaml#L4, but maybe it wasn't enough.

lgtm.

I will merge after CIs and release them with new 3.3 and 3.4.

@jpbetz
Copy link
Contributor Author

jpbetz commented May 14, 2020

Thanks @gyuho Hold on merging for just a bit. I just noticed that .snap.db files might not be getting properly cleaned up if orphaned.

@jpbetz
Copy link
Contributor Author

jpbetz commented May 14, 2020

Added one more commit to remove orphaned .snap.db files during Release().

hexIndex := strings.TrimSuffix(filepath.Base(filename), ".snap.db")
index, err := strconv.ParseUint(hexIndex, 16, 64)
if err != nil {
return fmt.Errorf("failed to parse index from .snap.db filename '%s': %w", filename, err)
Copy link
Contributor

@tedyu tedyu May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.snap.db seems to be redundant since it is suffix of filename

Also, we should continue the loop and try to release other files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted #11899

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just call it .snap since there are already different files with that suffix. But the purpose of this PR isn't to rethink the file suffix anyway...

Continuing if we hit an error sounds fine. This is merged now though, so that would need a separate PR.

}
// snapshot files can be orphaned if etcd crashes after writing them but before writing the corresponding
// wal log entries
snapshot, err := ss.LoadNewestAvailable(walSnaps)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change created a local snapshot variable. Now the snapshot declared on line 337 is always passed empty to validation on line 535.

serathius added a commit to serathius/etcd that referenced this pull request Dec 20, 2023
…stent index and closing database on defer

`err` variable shared throughout the NewServer function and used on line
396 to defer decision whether backend should be closed when starting
the server failed.

`snapshot` variable is first defined 407, redeclared locally on line 496 and later
again used on line 625. Creation of local variable is a bug introduced
in etcd-io#11888.

Signed-off-by: Marek Siarkowicz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Corrupted WAL and snapshot restoring process
4 participants