Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange failed on master #28995

Closed
cockroach-teamcity opened this issue Aug 23, 2018 · 22 comments · Fixed by #29067
Closed

roachtest: clearrange failed on master #28995

cockroach-teamcity opened this issue Aug 23, 2018 · 22 comments · Fixed by #29067
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/eb54cb65ec8da407c8ce5e971157bb1c03efd9e8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=clearrange PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=855884&tab=buildLog

	test.go:494,cluster.go:1104,clearrange.go:126: unexpected node event: 7: dead

@cockroach-teamcity cockroach-teamcity added this to the 2.1 milestone Aug 23, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 23, 2018
@tbg tbg added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label Aug 23, 2018
@tbg
Copy link
Member

tbg commented Aug 23, 2018

Uh oh!

F180823 06:12:32.418853 3640704 storage/replica_consistency.go:154 [n7,s7,r12021/3:/Table/53/1/41{067853-138385}] consistency check failed with 1 inconsistent replicas
goroutine 3640704 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4200e0e00, 0xc4200e0e40, 0x3de8c00, 0x1e)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:965 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x46cc120, 0xc400000004, 0x3de8c5a, 0x1e, 0x9a, 0xc42edbbf80, 0x64)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:834 +0x804

@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Aug 23, 2018
tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
Got confused by this message in cockroachdb#28995. Prior to this
commit on a merge it looks like the subsuming range is
removing itself, but it's actually the subsumee.

Release note: None
@tbg
Copy link
Member

tbg commented Aug 23, 2018

More exfiltrated logs:

node 7 ran the consistency check, n2 just notices the discrepancy (unclear who's wrong)

8931:E180823 06:12:32.138367 1126699 storage/stores_server.go:68  consistency check failed on range r12021: expected checksum 3936dca286df2eef04cb2c243ed90696ff48ecf1a792e739f9a6e8ec51ed63e5da8fb60f5c68113f2c3b1680b7418affb433191d117bf8f3051a422b610f64c2, got 551d7fa8dedcc32d213d89e0d3f9ed8f8d033d8f21e39a7a7785d701c6928b42f3ec8830ef157fee8d6ad56a3c8bfdc5fffb4e3afb52cccba137c29d9ef20d4d
8937:E180823 06:12:32.145326 1126727 storage/stores_server.go:68  consistency check failed on range r12021: expected checksum 2d45d81d6c620d6ef3ae60356446815b50622b4ea5880de1a6f468a04d97f4c68f9bcd5e51da668be03fab91e895130ac9ba494bb4ec6c5effcaff13438ed34c, got 1e020cef830a95a3372c0aa03986faa988d174953fa82d30591f7674288e0371643450201c6d78afb734c7d13289f3e254629f74a0de6ebb803c756a5334d175

A minute before that check, we see the below on n7. The soon-to-fail range subsumes its right neighbor. And shortly thereafter, the log claims to remove itself, but it's actually removing the subsumee as supposed to (#29000).

clearrange/7.logs/cockroach.teamcity-855884-clearrange-0007.root.2018-08-23T06_09_31Z.012817.log
15544:I180823 06:11:21.209840 3599648 storage/replica_command.go:421  [merge,n7,s7,r12021/3:/Table/53/1/410{67853-95982}] initiating a merge of r12202:/Table/53/1/41{095982-138385} [(n6,s6):1, (n2,s2):2, (n7,s7):3, next=4, gen=842568420848] into this range
15545:I180823 06:11:21.243438 342 storage/store.go:2632  [n7,s7,r12021/3:/Table/53/1/410{67853-95982}] removing replica

Btw, the logs also have busy loops like this. I think this is perhaps due to the bad impl of RelocateRange we're using right now?

W180823 06:12:32.368173 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.369590 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.370987 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.372393 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.373889 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.375184 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.376607 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.377993 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.379286 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.380717 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.382058 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.383457 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.384881 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.386350 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.387853 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.389087 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.390471 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.391814 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.393116 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.394397 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.395965 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.397126 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.398334 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.399590 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.401147 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.402604 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]
W180823 06:12:32.403839 1126175 storage/replica_command.go:1063  [merge,n2,s2,r11316/2:/Table/53/1/407{17161-26977}] while adding target n7,s7: snapshot failed: (n7,s7):?: remote couldn't accept snapshot with error: [n7,s7],r12523: cannot apply snapshot: snapshot intersects existing range [n7,s7,r12529/3:/Table/53/1/407{35450-40870}]

@tbg
Copy link
Member

tbg commented Aug 23, 2018

I was confused that I wasn't finding the consistency checker diff and eventually figured out that the diff had some binary junk in it, so rg would ignore the file 🙈 rg -a to the rescue, and we get (the values had newlines in them, that I removed for legibility and because they confuse the markdown renderer -- attached the original file in case we need to reconstruct the actual values: cockroach.log):

--- leaseholder
+++ follower
+1535003892.132212906,0 /Local/Range/Table/53/1/41119998/RangeDescriptor
+  ts:2018-08-23 05:58:12.132212906 +0000 UTC
+  value:xxx
+  raw_key:016b12bd89f9027370fe000172647363 raw_value:4e6899c40308975d1207bd89f9027370fe1a07bd89f902739756220608061006180122060808100818022206080a100a180528063002
+0.000000000,0 /Local/Range/Table/53/1/41119998/Transaction/"a992116e-bad8-42a7-9f6c-04b33de40ea2"
+  ts:<zero>
+  value:xxx
+  raw_key:016b12bd89f9027370fe000174786e2da992116ebad842a79f6c04b33de40ea2 raw_value:120408001000180020002800328101e5167c7c030a360a10a992116ebad842a79f6c04b33de40ea21a10016b12bd89f9027370fe0001726473632a0a08aae187bb83bddba6153093a705380112056d657267652a0a0888d8fe9787bddba615320a08aae187bb83bddba6153a0a08aae187bb83bddba615420e080a120a08aae187bb83bddba615480172007a00800101

Since we have three replicas, and since only n2 reported a diff, I think this means that n7 (the leaseholder) and n6 (a follower) had an empty diff. The above thus states that n2 had two extra MVCC keys that shouldn't be there any more, namely:

  • a regular version (i.e. not an intent) of /Local/Range/Table/53/1/41119998/RangeDescriptor and
  • a transaction record anchored at /Local/Range/Table/53/1/41119998

Of course since we all think this has something to do with merges, one theory is:

  1. the range with the above key (...41119998) got merged away at some point, which explains why its range descriptor version is missing on the "sane" nodes but doesn't explain why a follower still has it
  2. there's a txn anchored on that key, which might be a red herring -- perhaps we're really seeing the full diff of what this range was and those are just the two main keys; merging this range away anchors a transaction on the subsumer, not on the subsumee, so the txn record here just means that this range itself at some point subsumed someone (or split).

Time to find that range in the logs. The key pretty printer does't make this easy, but let's see.

tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
This very spammy log message got in the way of the investigation in cockroachdb#28995.

Release note: None
@tbg
Copy link
Member

tbg commented Aug 23, 2018

Unfortunately I think that the history of this range has been rotated out of the logs because of the spammy "remote couldn't accept snapshot with error" busy loop, which easily rotates through 20mb every minute 🙈 Added a commit to #29000.

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Two more observations: ComputeChecksumRequest seems to be implemented in a bad way. It uses a key range and it seems like it could be split by DistSender:

func (*ComputeChecksumRequest) flags() int  { return isWrite | isRange }

Not sure what will happen if that actually occurs. Nothing good! I think we'll get this error from DistSender:

} else if lOK != rOK {
	return errors.Errorf("can not combine %T and %T", valLeft, valRight)
}

so it doesn't seem completely terrible, but either way, this sucks for splits, and it's even weirder for merges because the command will now declare a write only to parts of the keyspace. Again this seems fine because it really is a noop (and it shouldn't have to declare any part of the keyspace) but there's some cleanup to do.

I was also worried about something else but checked that it isn't a problem: We have an optimization that avoids sending noops through Raft, and ComputeChecksum looked like a noop. But the code does the right thing and actually sends it through Raft.

// 1. proposal.command == nil indicates that the evaluation was a no-op
// and that no Raft command needs to be proposed.

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Back to the original problem - the merge a minute before the checksum fatal picked up the keyspace owning the offending range descriptor key. My guess is that the problem was present in that subsumed range all along, but that the consistency checker just didn't run there.

To repro this quickly, it might be enough to add a synchronous consistency checker run from the merge queue after each merge operation. The root problem that I anticipate seeing there is that we subsume a range but that it then fails to show up on the followers (perhaps after involving snapshots?)

tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
This was another source of noise in cockroachdb#28995. It makes more sense to log a
snapshot in either the error when it fails or when we actually start
streaming it.

Release note: None
@tbg
Copy link
Member

tbg commented Aug 23, 2018

Repro'ed locally. I'm going to move on to other things, but @benesch I'm sure you can get somewhere with this script :-))

#!/bin/bash


killall -9 cockroach
rm -rf cockroach-data
mkdir cockroach-data

export COCKROACH_SCAN_INTERVAL=1m COCKROACH_SCAN_MIN_IDLE_TIME=10ms
./cockroach start --insecure --port=26257 --http-port=26258 --store=cockroach-data/1 &
./cockroach start --insecure --port=26259 --http-port=26260 --store=cockroach-data/2 --join=localhost:26257 --background
./cockroach start --insecure --port=26261 --http-port=26262 --store=cockroach-data/3 --join=localhost:26257 --background
./cockroach start --insecure --port=26263 --http-port=26264 --store=cockroach-data/4 --join=localhost:26257 --background

./bin/workload run kv --init --splits 10000 --max-ops 1
echo 'ALTER TABLE kv.kv SCATTER;' | ./cockroach sql --insecure
echo 'SET CLUSTER SETTING kv.range_merge.queue_enabled = true;' | ./cockroach sql --insecure

tail -f cockroach-data/*/logs/cockroach.log

Unfortunately I didn't get a diff thanks to the aggressive merging:

9254ac7fc78c08be603f0ea6d81516af5d149bf3a36050d14b5f607eb425596bd791c9c11bc1fe83fa7ff5b24, got 3ee4c581b968958ef2c19ac0636a787ce5d6a9a3a3a8ff1b0bc691a93d1838f3f5507f36567a011a6ce7b5d44c31cd3dabc4dabc27d0ee7a2c576d0ac2720c48
E180823 12:39:53.786829 169747 storage/replica_consistency.go:160  [merge,n2,s2,r5710/2:/Table/53/1/3{861667…-928077…}] consistency check failed with 1 inconsistent replicas; fetching details
I180823 12:39:53.792209 188 storage/store.go:3748  [n2,s2,r5710/2:/Table/53/1/3{861667…-928077…}] added to replica GC queue (peer suggestion)
I180823 12:39:53.794090 169783 storage/store.go:2656  [replicaGC,n2,s2,r5710/2:/Table/53/1/3{861667…-928077…}] removing replica r5710/2
I180823 12:39:53.794906 169783 storage/replica.go:880  [replicaGC,n2,s2,r5710/2:/Table/53/1/3{861667…-928077…}] removed 16 (0+16) keys in 1ms [clear=0ms commit=0ms]
E180823 12:39:53.850341 169747 util/log/crash_reporting.go:477  [merge,n2,s2,r5710/2:/Table/53/1/3{861667…-928077…}] Reported as error cd082357fc254624afb6d5281d3973a9
F180823 12:39:53.863366 169747 storage/replica_consistency.go:163  [merge,n2,s2,r5710/2:/Table/53/1/3{861667…-928077…}] replica inconsistency detected; could not obtain actual diff: computing own checksum: rpc error: code = Unknown desc = r5710 was not found
goroutine 169747 [running]:

This seems to repro every time and only takes <10m, so I'm not too worried about this bug any more. Diff is just like before, full log attached:

cockroach.toblerone.tschottdorf.2018-08-23T14_24_01Z.077191.log

E180823 14:29:05.440729 188297 storage/replica_consistency.go:106  [merge,n4,s4,r2723/2:/Table/53/1/-45{74042…-44863…}] replica (n3,s3):3 is inconsistent: expected ch
ecksum 05917dac5d9d974ba3819333c685a1c6873d39a14af0a82ae501b597461b9b93e4962fd1b7460c75de328270c87383248ff05ab6511bf9da6ddaf8c5277104b7, got 593e8e1e282434532b3d3f567
8fbe17a5f561aad1c56d521b381a0e89a0f36bce1ec3bffa1d275ef8be9180e6327f8127e9779131403a273eecf06db2c190ed3
--- leaseholder
+++ follower
+1535034484.120969097,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor
+  ts:2018-08-23 14:28:04.120969097 +0000 UTC
+  value:xxx
+  raw_key:016b12bd8980c0b6c2e211ba5182000172647363 raw_value:884d186d03089b09120bbd8980c0b6c2e211ba51821a0bbd8980c0b9e7c5c610e996220608011001180122060804100418022206
08021002180428053004
+0.000000000,0 /Local/Range/Table/53/1/-4560243296450227838/Transaction/"360b9fbf-b7d0-4577-8768-09060db15cea"
+  ts:<zero>
+  value:xxx
+  raw_key:016b12bd8980c0b6c2e211ba5182000174786e2d360b9fbfb7d04577876809060db15cea raw_value:120408001000180020002800328501c09c0f02030a3a0a10360b9fbfb7d0457787680906
0db15cea1a14016b12bd8980c0b6c2e211ba51820001726473632a0a0889bf85c0afb7e2a61530cb8d36380112056d657267652a0a08e7c8f2f9b6b7e2a615320a0889bf85c0afb7e2a6153a0a0889bf85c0af
b7e2a615420e0802120a0889bf85c0afb7e2a615480172007a00800101
E180823 14:29:05.442104 188297 util/log/crash_reporting.go:477  [merge,n4,s4,r2723/2:/Table/53/1/-45{74042…-44863…}] Reported as error 30748ae38f1e44f285fda57da7da76a
0

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Just added something to always produce consistency diffs on the first attempt, running another experiment. Branch is https://github.com/tschottdorf/cockroach/tree/repro/rangemergediff (the previous repro already had the change that aggressively ran consistency checks after each merge).

@tbg
Copy link
Member

tbg commented Aug 23, 2018

timeline of interleaved logs. TL;DR: we remove the RHS, it gets GC'ed on all three replicas, consistency check fails with range descriptor and transaction anchored on ... an unexpected key, /Table/53/1/-4560243296450227838, which is ... not a boundary of either the pre-merge ranges or the post-merge range 😕

180823 14:36:43.454267 393786 storage/replica_command.go:421  [merge,n1,s1,r5276/4:/Table/53/1/-4{62044…-57404…}] initiating a merge of r2723:/Table/53/1/-45{74042344143301909-44863131432842502} [(n1,s1):1, (n2,s2):4, (n3,s3):3, next=5, gen=842431082168] into this range
180823 14:36:43.465893 255 storage/store.go:2656  [n1,s1,r5276/4:/Table/53/1/-4{62044…-57404…}] removing replica r2723/1
180823 14:36:43.465896 220 storage/store.go:2656  [n2,s2,r5276/2:/Table/53/1/-4{62044…-57404…}] removing replica r2723/4
180823 14:36:43.466052 248 storage/store.go:2656  [n3,s3,r5276/5:/Table/53/1/-4{62044…-57404…}] removing replica r2723/3
180823 14:36:43.523116 228319 storage/stores_server.go:68  consistency check failed on range r5276: expected checksum c9d56171d60017b8033fdfb4901fe5b80b87abea3dce43bea21b42365eb54c2acd3dc3eb4d70ebe8b864fc38ead5e235dc9d1baba637354f5c5fcf1ae2199c38, got cebddf05ea7ca532940f9eb696174627d0044e9e5ca2f9b468f2c9248d782669d889a2c28caf78cfbe027edafc367be49be249ac538a54e0df0bfa4282408bf4

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Ok, I just confused myself. That key is the start key of r1179, which merges into r2723 just a hair before disaster strikes on r2723 (there are two such failed ranges in the logs, and I must've mixed them up, r5276 doesn't matter here). So it's really as easy as

  1. ranges [a,b) and [b,c) merge
  2. [a,c) fails consistency checks because a follower has /Txn/b and a version of /RangeDesc/b
I180823 14:29:05.344104 188297 storage/replica_command.go:421  [merge,n4,s4,r2723/2:/Table/53/1/-45{74042…-60243…}] initiating a merge of r1179:/Table/53/1/-45{60243296450227838-44863131432842502} [(n1,s1):1, (n4,s4):2, (n3,s3):5, next=6, gen=842637714136] into this range

First of all, what's up with the super high generation of 842637714136?

As an interesting datapoint, the version of the range descriptor has timestamp 2018-08-23 14:28:04.120969097 +0000 UTC whereas the the merge is initiated later, at 14:29:05.344104.

At pretty much 14:28:04, we see r1179 merge one of its neighbors:

I180823 14:28:04.120963 386055 storage/replica_command.go:421  [merge,n2,s2,r1179/4:/Table/53/1/-45{60243…-59805…}] initiating a merge of r5257:/Table/53/1/-4559{805083004873075-358311118345834} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=0] into this range

The generation counter at that time is zero, which is just something I'm observing.

Aha! I think it gets interesting. Get a load of what exactly transpires at 14:28:04.

  • 14:28:04: r1179 merges its neighbor (doesn't matter who, but it presumably leaves the intent)
  • 14:28:06: while that is ongoing, we send a preemptive snapshot to n3 (under the pre-merge keys, which will contain the txn record and presumably the range descriptor version we're seeing though I'm not sure what that is yet, but now one would guess that it's an intent)
  • the merge commits (but n3 now has this snapshot sitting around)
  • the replica change fails
  • 14:28:08 r1179/4 riggering stats recomputation to resolve delta of {ContainsEstimates:false LastUpdateNanos:1535034487961481545 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SysBytes:-21 SysCount:-1}
    Note that this is a bit odd, we don't expect these, but perhaps there's a bug in the merge computations
  1. r1179 subsumes some more (don't think this matters)
  2. 14:29:05: upreplication attempted again (streaming snapshot and everything), this time it completes.
180823 14:28:04.120963 386055 storage/replica_command.go:421  [merge,n2,s2,r1179/4:/Table/53/1/-45{60243…-59805…}] initiating a merge of r5257:/Table/53/1/-4559{805083004873075-358311118345834} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=0] into this range
180823 14:28:06.781367 387468 storage/store_snapshot.go:615  [n2,s2,r1179/4:/Table/53/1/-45{60243…-59805…}] sending preemptive snapshot 30b0974d at applied index 35
180823 14:28:06.781699 387468 storage/store_snapshot.go:657  [n2,s2,r1179/4:/Table/53/1/-45{60243…-59805…}] streamed snapshot to (n3,s3):?: kv pairs: 14, log entries: 25, rate-limit: 2.0 MiB/sec, 1ms
180823 14:28:06.782464 166700 storage/replica_raftstorage.go:784  [n3,s3,r1179/?:{-}] applying preemptive snapshot at index 35 (id=30b0974d, encoded size=10351, 1 rocksdb batches, 25 log entries)
180823 14:28:06.783242 166700 storage/replica_raftstorage.go:790  [n3,s3,r1179/?:/Table/53/1/-45{60243…-59805…}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=0ms]
180823 14:28:07.960829 387468 storage/replica_command.go:803  [n2,s2,r1179/4:/Table/53/1/-45{60243…-59805…}] change replicas (ADD_REPLICA (n3,s3):5): read existing descriptor r1179:/Table/53/1/-45{60243296450227838-59805083004873075} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=842610040056]
180823 14:28:07.967378 239 storage/store.go:2656  [n2,s2,r1179/4:/Table/53/1/-45{60243…-59805…}] removing replica r5257/4
180823 14:28:07.967497 256 storage/store.go:2656  [n4,s4,r1179/2:/Table/53/1/-45{60243…-59805…}] removing replica r5257/2
180823 14:28:07.967513 245 storage/store.go:2656  [n1,s1,r1179/1:/Table/53/1/-45{60243…-59805…}] removing replica r5257/1

I still can't claim to understand what's going on, but it seems that a preemptive snapshot picked up an in-flight merge during a failed upreplication, and that this snapshot somehow isn't wiped out by a later successful upreplication. Perhaps there is a bug when applying a preemptive snapshot over an older, narrower preemptive snapshot, where we should clear out existing data but don't? Hard to believe, but now I think the time has come for @benesch do async rubber duck the above.

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Ah, and I understand now why I got two consistency failures and why I was so confused for a bit. After the first consistency failure, the next replica that subsumes r2723 also barfs with the same diff, which makes sense. So it's just a cascading failure and the first one is the only one that matters.

tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
Got confused by this message in cockroachdb#28995. Prior to this
commit on a merge it looks like the subsuming range is
removing itself, but it's actually the subsumee.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
This very spammy log message got in the way of the investigation in cockroachdb#28995.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
This was another source of noise in cockroachdb#28995. It makes more sense to log a
snapshot in either the error when it fails or when we actually start
streaming it.

Release note: None
craig bot pushed a commit that referenced this issue Aug 23, 2018
29000: storage: misc logging improvements r=a-robinson a=tschottdorf

Got confused by this message in #28995. Prior to this
commit on a merge it looks like the subsuming range is
removing itself, but it's actually the subsumee.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg
Copy link
Member

tbg commented Aug 23, 2018

Reviewing the consistency checker diff, it's not particularly helpful in telling you the values of things. It won't let us know if there's an intent, for example, though if memory serves correctly an intent would imply the existence of an existing meta key, i.e. timestamp zero. We're not seeing that here, we're seeing only the stray version. I'll hack something together that decodes these.

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Decoding the values from the diff here shows:

1535034484.120969097,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
	Raw:r1179:/Table/53/1/-45{60243296450227838-59358311118345834} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=842353474984]

0.000000000,0 /Local/Range/Table/53/1/-4560243296450227838/Transaction/"360b9fbf-b7d0-4577-8768-09060db15cea": "merge" id=360b9fbf key=/Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor rw=true pri=0.04127971 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535034484.120969097,0 orig=1535034484.120969097,0 max=1535034484.120969097,0 wto=false rop=false seq=1

@tbg
Copy link
Member

tbg commented Aug 23, 2018

That is the version of the range descriptor you expect at the point in time when the first snapshot is sent (look for while that is ongoing). And the question I have is, isn't that version supposed to survive on all nodes? Sure, we'll put a deletion tombstone on top of it, but I didn't see us explicitly clear out these values anywhere. Is n3:s3 perhaps the only replica that is doing the "correct" thing, whereas the other two somehow lose the range descriptor (though deleting it also seems adequate?)

Or, of course, the consistency checker is misleading me and the problem is really that n3 is the only one lacking these keys.

But the transaction key gives me some confidence that this isn't it: the timestamp of the range descriptor version matches up precisely with the one of the transaction. Perhaps the snapshot didn't pick up the merge in process at the time (there's no intent), but it picked up the one before that had just committed but not yet been cleaned up after (we expect the txn records to go away after the intents are resolved).

@tbg
Copy link
Member

tbg commented Aug 23, 2018

More spelunking against the data directory. The offending timestamp is the only one which is present exactly once. Under the assumption that we never intentionally GC range descriptor versions during merges (GC definitely shouldnt' be doing it in this short test), this is unexpected and is in line with the

$ for i in 1 2 3 4; do ./cockroach debug range-descriptors ./cockroach-data/$i/ | grep 4560243296450227838; done | sort | uniq -c
   4 1535034246.965795502,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4552334657046302649)
   4 1535034246.965795502,0 /Local/Range/Table/53/1/-4575831326115147147/RangeDescriptor: [/Table/53/1/-4575831326115147147, /Table/53/1/-4560243296450227838)
   4 1535034249.693565990,0 /Local/Range/Table/53/1/-4571444146205427697/RangeDescriptor: [/Table/53/1/-4571444146205427697, /Table/53/1/-4560243296450227838)
   4 1535034254.094999965,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4557375216103792942)
   4 1535034259.075947732,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
   4 1535034259.125589455,0 /Local/Range/Table/53/1/-4568854684944728787/RangeDescriptor: [/Table/53/1/-4568854684944728787, /Table/53/1/-4560243296450227838)
   4 1535034267.192375037,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559805083004873075)
   4 1535034267.332880127,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559805083004873075)
   4 1535034267.388544287,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559805083004873075)
   4 1535034403.057928275,0 /Local/Range/Table/53/1/-4574042344143301909/RangeDescriptor: [/Table/53/1/-4574042344143301909, /Table/53/1/-4560243296450227838)
  ---> 1 1535034484.120969097,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
   4 1535034487.958837164,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
   4 1535034500.047021360,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4544863131432842502)
   4 1535034545.323696450,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4544863131432842502)
   4 1535034545.332151675,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4544863131432842502)

And that single copy of the key sits on n3, as the consistency checker claimed:

$ for i in 1 2 3 4; do echo $i; ./cockroach debug range-descriptors ./cockroach-data/$i/ | grep 4560243296450227838 | grep 1535034484.120969097,0; done
1
2
3
1535034484.120969097,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
4

@tbg
Copy link
Member

tbg commented Aug 23, 2018

What's interesting is that on n3, the next higher version of the range descriptor (above the --> line) which is present on all nodes is identical with the one-off version except in timestamp and generation:

1535034487.958837164,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
        Raw:r1179:/Table/53/1/-45{60243296450227838-59358311118345834} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=842354297296]

1535034484.120969097,0 /Local/Range/Table/53/1/-4560243296450227838/RangeDescriptor: [/Table/53/1/-4560243296450227838, /Table/53/1/-4559358311118345834)
        Raw:r1179:/Table/53/1/-45{60243296450227838-59358311118345834} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=842354297448]

and in particular the generation moves backwards (i.e. the outlier is ahead of its successor), which I thought isn't supposed to happen.

Ugh, I've been staring at this for too long. Time to take a break.

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Ah, the generation printing is messed up. This line

	fmt.Fprintf(&buf, ", next=%d, gen=%d]", r.NextReplicaID, r.Generation)

prints the Generation integer pointer as %d which basically gives you random stuff.

tbg added a commit to tbg/cockroach that referenced this issue Aug 23, 2018
Inspired by cockroachdb#28995, where I needed this.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 24, 2018
Inspired by cockroachdb#28995, where I needed this.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 24, 2018
Got confused by this message in cockroachdb#28995. Prior to this
commit on a merge it looks like the subsuming range is
removing itself, but it's actually the subsumee.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 24, 2018
This very spammy log message got in the way of the investigation in cockroachdb#28995.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 24, 2018
This was another source of noise in cockroachdb#28995. It makes more sense to log a
snapshot in either the error when it fails or when we actually start
streaming it.

Release note: None
@benesch
Copy link
Contributor

benesch commented Aug 25, 2018

Ok, check this out. Repro'd the failure with some additional log messages around when consistency computations occur. Full log is in this gist; relevant section reproduced below:

I180825 19:50:18.967983 99051 storage/replica_command.go:421  [merge,n4,s4,r9061/3:/Table/53/1/-3{28096…-12259…}] initiating a merge of r8559:/Table/53/1/-3{12259068284300941-03695145971405602} [(n2,s2):1, (n1,s1):4, (n4,s4):3, next=5, gen=1] into this range
I180825 19:50:18.995039 263 storage/store.go:2656  [n4,s4,r9061/3:/Table/53/1/-3{28096…-12259…}] removing replica r8559/3
I180825 19:50:18.995067 185 storage/store.go:2656  [n2,s2,r9061/1:/Table/53/1/-3{28096…-12259…}] removing replica r8559/1
I180825 19:50:18.995150 212 storage/store.go:2656  [n1,s1,r9061/4:/Table/53/1/-3{28096…-12259…}] removing replica r8559/4
E180825 19:50:18.996822 99051 storage/replica_consistency.go:63  [merge,n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] beginning check consistency for [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}]
E180825 19:50:18.998019 261 storage/replica_proposal.go:171  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] computing checksum at applied index 41
E180825 19:50:18.998288 234 storage/replica_proposal.go:171  [n1,s1,r9061/4:/Table/53/1/-3{28096…-03695…}] computing checksum at applied index 41
E180825 19:50:18.998339 216 storage/replica_proposal.go:171  [n2,s2,r9061/1:/Table/53/1/-3{28096…-03695…}] computing checksum at applied index 41
I180825 19:50:19.008823 99201 storage/store_snapshot.go:615  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] sending preemptive snapshot ecbec0fa at applied index 42
I180825 19:50:19.009352 99201 storage/store_snapshot.go:657  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] streamed snapshot to (n3,s3):?: kv pairs: 53, log entries: 32, rate-limit: 2.0 MiB/sec, 1ms
I180825 19:50:19.010084 120109 storage/replica_raftstorage.go:784  [n3,s3,r9061/?:{-}] applying preemptive snapshot at index 42 (id=ecbec0fa, encoded size=13305, 1 rocksdb batches, 32 log entries)
I180825 19:50:19.011125 120109 storage/replica_raftstorage.go:790  [n3,s3,r9061/?:/Table/53/1/-3{28096…-03695…}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=1ms]
I180825 19:50:19.011704 99201 storage/replica_command.go:803  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] change replicas (ADD_REPLICA (n3,s3):5): read existing descriptor r9061:/Table/53/1/-3{28096451031754326-03695145971405602} [(n2,s2):1, (n1,s1):4, (n4,s4):3, next=5, gen=2]
I180825 19:50:19.021256 99201 storage/replica.go:3743  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] proposing ADD_REPLICA((n3,s3):5): updated=[(n2,s2):1 (n1,s1):4 (n4,s4):3 (n3,s3):5] next=6
I180825 19:50:19.023110 99219 storage/replica_range_lease.go:554  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] transferring lease to s2
I180825 19:50:19.025157 192 storage/replica_proposal.go:216  [n2,s2,r9061/1:/Table/53/1/-3{28096…-03695…}] new range lease repl=(n2,s2):1 seq=8 start=1535226619.023118716,0 epo=1 pro=1535226619.023122231,0 following repl=(n4,s4):3 seq=7 start=1535226491.251059910,0 epo=1 pro=1535226491.251061975,0
I180825 19:50:19.025415 99219 storage/replica_range_lease.go:617  [n4,s4,r9061/3:/Table/53/1/-3{28096…-03695…}] done transferring lease to s2: <nil>
I180825 19:50:19.026365 117390 storage/replica_range_lease.go:554  [n2,s2,r9061/1:/Table/53/1/-3{28096…-03695…}] transferring lease to s2
I180825 19:50:19.026759 117390 storage/replica_command.go:803  [n2,s2,r9061/1:/Table/53/1/-3{28096…-03695…}] change replicas (REMOVE_REPLICA (n1,s1):4): read existing descriptor r9061:/Table/53/1/-3{28096451031754326-03695145971405602} [(n2,s2):1, (n1,s1):4, (n4,s4):3, (n3,s3):5, next=6, gen=2]
I180825 19:50:19.039757 117390 storage/replica.go:3743  [n2,s2,r9061/1:/Table/53/1/-3{28096…-03695…}] proposing REMOVE_REPLICA((n1,s1):4): updated=[(n2,s2):1 (n3,s3):5 (n4,s4):3] next=6
I180825 19:50:19.041185 767 storage/store.go:3748  [n1,s1,r9061/4:/Table/53/1/-3{28096…-03695…}] added to replica GC queue (peer suggestion)
I180825 19:50:19.041311 117390 storage/replica_command.go:421  [merge,n2,s2,r6630/1:/Table/53/1/-3{34967…-28096…}] initiating a merge of r9061:/Table/53/1/-3{28096451031754326-03695145971405602} [(n2,s2):1, (n3,s3):5, (n4,s4):3, next=6, gen=2] into this range
E180825 19:50:19.051785 218 storage/replica_proposal.go:171  [n3,s3,r9061/5:/Table/53/1/-3{28096…-03695…}] computing checksum at applied index 53
E180825 19:50:19.053416 120072 storage/stores_server.go:68  consistency check failed on range r9061: expected checksum 64f92492ec59f950aaa65b41c4a4890f3fa8c811b9816d024e517bb0ca50714617d0769bac59dca04dd001a72af88998d68cc15880ecc494cccfdb6e000a5b5f, got e3875f14e2c7f8c703da94a528911781a6587b33f573fc17bc1d4eac7888221d28bff3c811b587bf52b0f2829fd2a473995539b5e55ebadbc2b8f2eb7277ed19

So:

  1. r9061 initiates a merge with r8559. Current replicas are n1, n2, n4. Leaseholder is n4.
  2. n4 initiates a consistency check.
  3. n1, n2, and n4 compute a checksum at applied index 41.
  4. n4 sends a preemptive snapshot to n3 at applied index 42.
  5. n3 applies the snapshot.
  6. Another replica change/lease transfer occurs. Doesn't seem to matter.
  7. n3 computes a checksum at applied index 53.
  8. n4 sees this checksum, which is obviously going to differ, and explodes.

So why is n3 computing a checksum at applied index 53?! That seems extremely wrong.

It's possible, though unlikely, that the log message about computing a checksum at applied index 53 is actually a different consistency check. I'm going to add additional logging of the consistency check ID to rule that case out.

@benesch
Copy link
Contributor

benesch commented Aug 25, 2018

Theory appears to be confirmed:

cockroach-data/4/logs/cockroach.log:E180825 20:50:30.863380 254 storage/replica_proposal.go:171  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 27 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/3/logs/cockroach.log:E180825 20:50:30.863511 223 storage/replica_proposal.go:171  [n2,s2,r5750/1:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 27 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/2/logs/cockroach.log:E180825 20:50:30.863532 198 storage/replica_proposal.go:171  [n3,s3,r5750/2:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 27 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/4/logs/cockroach.log:I180825 20:50:30.873379 137945 storage/store_snapshot.go:615  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] sending preemptive snapshot 46f10776 at applied index 28
cockroach-data/4/logs/cockroach.log:I180825 20:50:30.873710 137945 storage/store_snapshot.go:657  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] streamed snapshot to (n1,s1):?: kv pairs: 39, log entries: 18, rate-limit: 2.0 MiB/sec, 1ms
cockroach-data/1/logs/cockroach.log:I180825 20:50:30.875197 399050 storage/replica_raftstorage.go:784  [n1,s1,r5750/?:{-}] applying preemptive snapshot at index 28 (id=46f10776, encoded size=8381, 1 rocksdb batches, 18 log entries)
cockroach-data/1/logs/cockroach.log:I180825 20:50:30.876085 399050 storage/replica_raftstorage.go:790  [n1,s1,r5750/?:/Table/53/1/-3{60825…-59823…}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=0ms]
cockroach-data/4/logs/cockroach.log:I180825 20:50:30.876726 137945 storage/replica_command.go:803  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] change replicas (ADD_REPLICA (n1,s1):4): read existing descriptor r5750:/Table/53/1/-3{608257894217454861-598237801595841486} [(n2,s2):1, (n3,s3):2, (n4,s4):3, next=4, gen=3]
cockroach-data/4/logs/cockroach.log:I180825 20:50:30.887205 137945 storage/replica.go:3743  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] proposing ADD_REPLICA((n1,s1):4): updated=[(n2,s2):1 (n3,s3):2 (n4,s4):3 (n1,s1):4] next=5
cockroach-data/4/logs/cockroach.log:I180825 20:50:30.889653 137980 storage/replica_range_lease.go:554  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] transferring lease to s3
cockroach-data/4/logs/cockroach.log:I180825 20:50:30.893841 137980 storage/replica_range_lease.go:617  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] done transferring lease to s3: <nil>
cockroach-data/2/logs/cockroach.log:I180825 20:50:30.893979 206 storage/replica_proposal.go:216  [n3,s3,r5750/2:/Table/53/1/-3{60825…-59823…}] new range lease repl=(n3,s3):2 seq=8 start=1535230230.889658704,0 epo=1 pro=1535230230.889661651,0 following repl=(n4,s4):3 seq=7 start=1535230003.009678966,0 epo=1 pro=1535230003.009681162,0
cockroach-data/2/logs/cockroach.log:I180825 20:50:30.895007 154081 storage/replica_range_lease.go:554  [n3,s3,r5750/2:/Table/53/1/-3{60825…-59823…}] transferring lease to s3
cockroach-data/2/logs/cockroach.log:I180825 20:50:30.895360 154081 storage/replica_command.go:803  [n3,s3,r5750/2:/Table/53/1/-3{60825…-59823…}] change replicas (REMOVE_REPLICA (n2,s2):1): read existing descriptor r5750:/Table/53/1/-3{608257894217454861-598237801595841486} [(n2,s2):1, (n3,s3):2, (n4,s4):3, (n1,s1):4, next=5, gen=3]
cockroach-data/2/logs/cockroach.log:I180825 20:50:30.910605 154081 storage/replica.go:3743  [n3,s3,r5750/2:/Table/53/1/-3{60825…-59823…}] proposing REMOVE_REPLICA((n2,s2):1): updated=[(n1,s1):4 (n3,s3):2 (n4,s4):3] next=5
cockroach-data/3/logs/cockroach.log:I180825 20:50:30.912817 438 storage/store.go:3748  [n2,s2,r5750/1:/Table/53/1/-3{60825…-59823…}] added to replica GC queue (peer suggestion)
cockroach-data/2/logs/cockroach.log:I180825 20:50:30.913405 154081 storage/replica_command.go:421  [merge,n3,s3,r5834/2:/Table/53/1/-36{10823…-08257…}] initiating a merge of r5750:/Table/53/1/-3{608257894217454861-598237801595841486} [(n1,s1):4, (n3,s3):2, (n4,s4):3, next=5, gen=3] into this range
cockroach-data/3/logs/cockroach.log:I180825 20:50:30.915757 143839 storage/store.go:2656  [replicaGC,n2,s2,r5750/1:/Table/53/1/-3{60825…-59823…}] removing replica r5750/1
cockroach-data/3/logs/cockroach.log:I180825 20:50:30.918531 143839 storage/replica.go:880  [replicaGC,n2,s2,r5750/1:/Table/53/1/-3{60825…-59823…}] removed 16 (0+16) keys in 2ms [clear=2ms commit=1ms]
cockroach-data/1/logs/cockroach.log:E180825 20:50:30.920436 254 storage/replica_proposal.go:171  [n1,s1,r5750/4:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 36 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/1/logs/cockroach.log:E180825 20:50:30.933251 399115 storage/stores_server.go:68  consistency check failed on range r5750: expected checksum af8d3a76b96709540fa9a3897dd588b50164aa7181e7b8124590cf5ce4339a81f5947c8fc7a675afbf06c014b8a5a6689047f9bc3f878aed5a65b4374427b429, got 15d7488c4559e6db2f1cf065df9600d4ae584f1f683334ddeb633f1e5995b73d3960b4628508bf47426bc06d6a2a9511ae7c31a4cc6b2b4b71b34d2c2061c2ee
cockroach-data/4/logs/cockroach.log:E180825 20:50:30.934132 137941 storage/replica_consistency.go:107  [merge,n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] replica (n1,s1):4 is inconsistent: expected checksum af8d3a76b96709540fa9a3897dd588b50164aa7181e7b8124590cf5ce4339a81f5947c8fc7a675afbf06c014b8a5a6689047f9bc3f878aed5a65b4374427b429, got 15d7488c4559e6db2f1cf065df9600d4ae584f1f683334ddeb633f1e5995b73d3960b4628508bf47426bc06d6a2a9511ae7c31a4cc6b2b4b71b34d2c2061c2ee
cockroach-data/4/logs/cockroach.log:E180825 20:50:30.935919 137941 util/log/crash_reporting.go:477  [merge,n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] Reported as error 8fcb356685d94db3a050a170b7832530
cockroach-data/4/logs/cockroach.log:F180825 20:50:30.935945 137941 storage/replica_consistency.go:155  [merge,n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] consistency check failed with 1 inconsistent replicas
cockroach-data/1/logs/cockroach.log:I180825 20:50:30.938248 250 storage/store.go:2656  [n1,s1,r5834/4:/Table/53/1/-36{10823…-08257…}] removing replica r5750/4
cockroach-data/2/logs/cockroach.log:I180825 20:50:30.939516 213 storage/store.go:2656  [n3,s3,r5834/2:/Table/53/1/-36{10823…-08257…}] removing replica r5750/2

The relevant lines:

cockroach-data/4/logs/cockroach.log:E180825 20:50:30.863380 254 storage/replica_proposal.go:171  [n4,s4,r5750/3:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 27 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/3/logs/cockroach.log:E180825 20:50:30.863511 223 storage/replica_proposal.go:171  [n2,s2,r5750/1:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 27 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/2/logs/cockroach.log:E180825 20:50:30.863532 198 storage/replica_proposal.go:171  [n3,s3,r5750/2:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 27 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)
cockroach-data/1/logs/cockroach.log:E180825 20:50:30.920436 254 storage/replica_proposal.go:171  [n1,s1,r5750/4:/Table/53/1/-3{60825…-59823…}] computing checksum at applied index 36 (id: 15a9487e-d941-499f-9eba-a2ccddc1fd3d)

@benesch
Copy link
Contributor

benesch commented Aug 25, 2018

Well this seems incredibly wrong. Check out the raft log on this guy:

0.000000000,0 /Local/RangeID/5750/u/RaftLog/logIndex:27: Term:7 Index:27  by lease #7
proposer_replica:<node_id:4 store_id:4 replica_id:3 > proposer_lease_sequence:7 max_lease_index:15 replicated_eval_result:<compute_checksum:<header:<key:"\275\211\200\315\354\344\220\005+j\363" end_key:"\275\211\200\316\001k\215\277K\034\212" > version:2 checksum_id:<15a9487e-d941-499f-9eba-a2ccddc1fd3d> snapshot:true > timestamp:<wall_time:1535230230862575432 > delta:<> > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
0.000000000,0 /Local/RangeID/5750/u/RaftLog/logIndex:36: Term:7 Index:36  by lease #8
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:8 max_lease_index:24 replicated_eval_result:<compute_checksum:<header:<key:"\275\211\200\316\001k\215\277K\034\212" end_key:"\275\211\200\316\007\020\246\300\301\r\016" > version:2 checksum_id:<15a9487e-d941-499f-9eba-a2ccddc1fd3d> snapshot:true > timestamp:<wall_time:1535230230918949588 > delta:<> > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
0.000000000,0 /Local/RangeID/5750/u/RaftLog/logIndex:37: Term:7 Index:37  by lease #8
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:8 max_lease_index:25 replicated_eval_result:<compute_checksum:<header:<key:"\275\211\200\316\007\020\246\300\301\r\016" end_key:"\275\211\200\316\020}\310\242\177T2" > version:2 checksum_id:<15a9487e-d941-499f-9eba-a2ccddc1fd3d> snapshot:true > timestamp:<wall_time:1535230230920061852 > delta:<> > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 

@benesch
Copy link
Contributor

benesch commented Aug 25, 2018

I suspect this is some sort of ambiguous retry error gone wrong. This comment seems ominous:

// We're relying on the fact that all commands are either idempotent
// (generally achieved through the wonders of MVCC) or, if they aren't,
// the 2nd application will somehow fail to apply (e.g. a command
// resulting from a RequestLease is not idempotent, but the 2nd
// application will be rejected by the ProposerLease verification).

I haven't verified yet that these bad consistency checks are running into this case, though.

@benesch
Copy link
Contributor

benesch commented Aug 25, 2018

Oh my god. Look at this craziness:

compute checksum request received: ComputeChecksum [/Table/53/1/1779302890593386200,/Table/53/1/1779454299280769289) (r5322)
compute checksum request received: ComputeChecksum [/Table/53/1/1779454299280769289,/Table/53/1/1781242436017112233) (r5322)

When the consistency check is executed, the node's distsender cache thinks there are two separate ranges. So it splits the ComputeChecksum batch! And two checksum computations get executed with the same checksum ID. That is utterly awful. In most cases, it's actually ok, because all replicas will execute both compute checksums and the second will be a no-op. But if the timing happens to be just right, you'll end up adding a new replica between the two compute checksums. That new replica won't have executed the first ComputeChecksum, but it will execute the second ComputeChecksum—at a different applied index! 💥

So the proximate cause here is the fact that ComputeChecksum is ranged instead of being keyed on the replica's start key. But I think there are deeper problems lurking with retries. ComputeChecksum doesn't appear to have any replay protection, but it definitely needs some.

benesch added a commit to benesch/cockroach that referenced this issue Aug 27, 2018
Previously, a ComputeChecksum command could apply twice with the same
ID. Consider the following sequence of events:

  1. DistSender sends a ComputeChecksum request to a replica.
  2. The request is succesfully evaluated and proposed, but a connection
     error occurs.
  3. DistSender retries the request, leaving the checksum ID unchanged!

This would result in two ComputeChecksum commands with the same checksum
ID in the Raft log. Somewhat amazingly, this typically wasn't
problematic. If all replicas were online and reasonably up-to-date,
they'd see the first ComputeChecksum command, compute its checksum, and
store it in the checksums map. When they saw the duplicated
ComputeChecksum command, they'd see that a checksum with that ID already
existed and ignore it. In effect, only the first ComputeChecksum command
for a given checksum ID mattered.

The problem occured when one replica saw one ComputeChecksum command but
not the other. There were two ways this could occur. A replica could go
offline after computing the checksum the first time; when it came back
online, it would have an empty checksum map, and the checksum computed
for the second ComputeChecksum command would be recorded instead. Or a
replica could receive a snapshot that advanced it past one
ComputeChecksum but not the other. In both cases, the replicas could
spuriously fail a consistency check.

A very similar problem occured with range merges because ComputeChecksum
requests are incorrectly ranged (see cockroachdb#29002). That means DistSender
might split a ComputeChecksum request in two. Consider what happens when
a consistency check occurs immediately after a merge: the
ComputeChecksum request is generated using the up-to-date, post-merge
descriptor, but DistSender might have the pre-merge descriptors cached,
and so it splits the batch in two. Both halves of the batch would get
routed to the same range, and both halves would have the same command
ID, resulting in the same duplicated ComputeChecksum command problem.

The fix for these problems is to assign the checksum ID when the
ComputeChecksum request is evaluated. If the request is retried, it will
be properly assigned a new checksum ID. Note that we don't need to worry
about reproposals causing duplicate commands, as the MaxLeaseIndex
prevents proposals from replay.

The version compatibility story here is straightforward. The
ReplicaChecksumVersion is bumped, so v2.0 nodes will turn
ComputeChecksum requests proposed by v2.1 nodes into a no-op, and
vice-versa. The consistency queue will spam some complaints into the log
about this--it will time out while collecting checksums--but this will
stop as soon as all nodes have been upgraded to the new version.†

Note that this commit takes the opportunity to migrate
storagebase.ReplicatedEvalResult.ComputeChecksum from
roachpb.ComputeChecksumRequest to a dedicated
storagebase.ComputeChecksum message. Separate types are more in line
with how the merge/split/change replicas triggers work and avoid
shipping unnecessary fields through Raft. Note that even though this
migration changes logic downstream of Raft, it's safe. v2.1 nodes will
turn any ComputeChecksum commands that were commited by v2.0 nodes into
no-ops, and vice-versa, but the only effect of this will be some
temporary consistency queue spam. As an added bonus, because  we're
guaranteed that we'll never see duplicate v2.1-style ComputeChecksum
commands, we can properly fatal if we ever see a ComputeChecksum request
with a checksum ID that we've already computed.

† It would be possible to put the late-ID allocation behind a cluster
version to avoid the log spam, but that amounts to allowing v2.1 to
initiate known-buggy consistency checks. A bit of log spam seems
preferable.

Fix cockroachdb#28995.
benesch added a commit to benesch/cockroach that referenced this issue Aug 27, 2018
ComputeChecksum was previously implemented as a range request, which
meant it could be split by DistSender, resulting in two ComputeChecksum
requests with identical IDs! If those split ranges get routed to the
same range (e.g. because the ranges were just merged), spurious checksum
failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would
not actually look at the range boundaries in the request header; it
always operated on the range's entire keyspace at the time the request
was applied.

The fix is simple: make ComputeChecksum a point request. There are no
version compatibility issues here; nodes with this commit are simply
smarter about routing ComputeChecksum requests to only one range.

Fix cockroachdb#29002.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 27, 2018
Inspired by cockroachdb#28995, where I needed this.

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Aug 27, 2018
Previously, a ComputeChecksum command could apply twice with the same
ID. Consider the following sequence of events:

  1. DistSender sends a ComputeChecksum request to a replica.
  2. The request is succesfully evaluated and proposed, but a connection
     error occurs.
  3. DistSender retries the request, leaving the checksum ID unchanged!

This would result in two ComputeChecksum commands with the same checksum
ID in the Raft log. Somewhat amazingly, this typically wasn't
problematic. If all replicas were online and reasonably up-to-date,
they'd see the first ComputeChecksum command, compute its checksum, and
store it in the checksums map. When they saw the duplicated
ComputeChecksum command, they'd see that a checksum with that ID already
existed and ignore it. In effect, only the first ComputeChecksum command
for a given checksum ID mattered.

The problem occured when one replica saw one ComputeChecksum command but
not the other. There were two ways this could occur. A replica could go
offline after computing the checksum the first time; when it came back
online, it would have an empty checksum map, and the checksum computed
for the second ComputeChecksum command would be recorded instead. Or a
replica could receive a snapshot that advanced it past one
ComputeChecksum but not the other. In both cases, the replicas could
spuriously fail a consistency check.

A very similar problem occured with range merges because ComputeChecksum
requests are incorrectly ranged (see cockroachdb#29002). That means DistSender
might split a ComputeChecksum request in two. Consider what happens when
a consistency check occurs immediately after a merge: the
ComputeChecksum request is generated using the up-to-date, post-merge
descriptor, but DistSender might have the pre-merge descriptors cached,
and so it splits the batch in two. Both halves of the batch would get
routed to the same range, and both halves would have the same command
ID, resulting in the same duplicated ComputeChecksum command problem.

The fix for these problems is to assign the checksum ID when the
ComputeChecksum request is evaluated. If the request is retried, it will
be properly assigned a new checksum ID. Note that we don't need to worry
about reproposals causing duplicate commands, as the MaxLeaseIndex
prevents proposals from replay.

The version compatibility story here is straightforward. The
ReplicaChecksumVersion is bumped, so v2.0 nodes will turn
ComputeChecksum requests proposed by v2.1 nodes into a no-op, and
vice-versa. The consistency queue will spam some complaints into the log
about this--it will time out while collecting checksums--but this will
stop as soon as all nodes have been upgraded to the new version.†

Note that this commit takes the opportunity to migrate
storagebase.ReplicatedEvalResult.ComputeChecksum from
roachpb.ComputeChecksumRequest to a dedicated
storagebase.ComputeChecksum message. Separate types are more in line
with how the merge/split/change replicas triggers work and avoid
shipping unnecessary fields through Raft. Note that even though this
migration changes logic downstream of Raft, it's safe. v2.1 nodes will
turn any ComputeChecksum commands that were commited by v2.0 nodes into
no-ops, and vice-versa, but the only effect of this will be some
temporary consistency queue spam. As an added bonus, because  we're
guaranteed that we'll never see duplicate v2.1-style ComputeChecksum
commands, we can properly fatal if we ever see a ComputeChecksum request
with a checksum ID that we've already computed.

† It would be possible to put the late-ID allocation behind a cluster
version to avoid the log spam, but that amounts to allowing v2.1 to
initiate known-buggy consistency checks. A bit of log spam seems
preferable.

Fix cockroachdb#28995.
craig bot pushed a commit that referenced this issue Aug 27, 2018
29067: storage: protect ComputeChecksum commands from replaying r=tschottdorf a=benesch

Previously, a ComputeChecksum command could apply twice with the same
ID. Consider the following sequence of events:

  1. DistSender sends a ComputeChecksum request to a replica.
  2. The request is succesfully evaluated and proposed, but a connection
     error occurs.
  3. DistSender retries the request, leaving the checksum ID unchanged!

This would result in two ComputeChecksum commands with the same checksum
ID in the Raft log. Somewhat amazingly, this typically wasn't
problematic. If all replicas were online and reasonably up-to-date,
they'd see the first ComputeChecksum command, compute its checksum, and
store it in the checksums map. When they saw the duplicated
ComputeChecksum command, they'd see that a checksum with that ID already
existed and ignore it. In effect, only the first ComputeChecksum command
for a given checksum ID mattered.

The problem occured when one replica saw one ComputeChecksum command but
not the other. There were two ways this could occur. A replica could go
offline after computing the checksum the first time; when it came back
online, it would have an empty checksum map, and the checksum computed
for the second ComputeChecksum command would be recorded instead. Or a
replica could receive a snapshot that advanced it past one
ComputeChecksum but not the other. In both cases, the replicas could
spuriously fail a consistency check.

A very similar problem occured with range merges because ComputeChecksum
requests are incorrectly ranged (see #29002). That means DistSender
might split a ComputeChecksum request in two. Consider what happens when
a consistency check occurs immediately after a merge: the
ComputeChecksum request is generated using the up-to-date, post-merge
descriptor, but DistSender might have the pre-merge descriptors cached,
and so it splits the batch in two. Both halves of the batch would get
routed to the same range, and both halves would have the same command
ID, resulting in the same duplicated ComputeChecksum command problem.

The fix for these problems is to assign the checksum ID when the
ComputeChecksum request is evaluated. If the request is retried, it will
be properly assigned a new checksum ID. Note that we don't need to worry
about reproposals causing duplicate commands, as the MaxLeaseIndex
prevents proposals from replay.

The version compatibility story here is straightforward. The
ReplicaChecksumVersion is bumped, so v2.0 nodes will turn
ComputeChecksum requests proposed by v2.1 nodes into a no-op, and
vice-versa. The consistency queue will spam some complaints into the log
about this--it will time out while collecting checksums--but this will
stop as soon as all nodes have been upgraded to the new version.†

Note that this commit takes the opportunity to migrate
storagebase.ReplicatedEvalResult.ComputeChecksum from
roachpb.ComputeChecksumRequest to a dedicated
storagebase.ComputeChecksum message. Separate types are more in line
with how the merge/split/change replicas triggers work and avoid
shipping unnecessary fields through Raft. Note that even though this
migration changes logic downstream of Raft, it's safe. v2.1 nodes will
turn any ComputeChecksum commands that were commited by v2.0 nodes into
no-ops, and vice-versa, but the only effect of this will be some
temporary consistency queue spam. As an added bonus, because  we're
guaranteed that we'll never see duplicate v2.1-style ComputeChecksum
commands, we can properly fatal if we ever see a ComputeChecksum request
with a checksum ID that we've already computed.

† It would be possible to put the late-ID allocation behind a cluster
version to avoid the log spam, but that amounts to allowing v2.1 to
initiate known-buggy consistency checks. A bit of log spam seems
preferable.

Fix #28995.

29083: storage: fix raft snapshots that span merges and splits r=tschottdorf a=benesch

The code that handles Raft snapshots that span merges did not account
for snapshots that spanned merges AND splits. Handle this case by
allowing snapshot subsumption even when the snapshot's end key does not
exactly match the end of an existing replica. See the commits within the
patch for details.

Fix #29080.

Release note: None

29117: opt: fix LookupJoinDef interning, add tests r=RaduBerinde a=RaduBerinde

Fixing an omission I noticed in internLookupJoinDef and adding missing
tests for interning defs.

Release note: None

Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@craig craig bot closed this as completed in #29067 Aug 27, 2018
benesch added a commit to benesch/cockroach that referenced this issue Aug 27, 2018
ComputeChecksum was previously implemented as a range request, which
meant it could be split by DistSender, resulting in two ComputeChecksum
requests with identical IDs! If those split ranges get routed to the
same range (e.g. because the ranges were just merged), spurious checksum
failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would
not actually look at the range boundaries in the request header; it
always operated on the range's entire keyspace at the time the request
was applied.

The fix is simple: make ComputeChecksum a point request. There are no
version compatibility issues here; nodes with this commit are simply
smarter about routing ComputeChecksum requests to only one range.

Fix cockroachdb#29002.

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Aug 27, 2018
Previously, a ComputeChecksum command could apply twice with the same
ID. Consider the following sequence of events:

  1. DistSender sends a ComputeChecksum request to a replica.
  2. The request is succesfully evaluated and proposed, but a connection
     error occurs.
  3. DistSender retries the request, leaving the checksum ID unchanged!

This would result in two ComputeChecksum commands with the same checksum
ID in the Raft log. Somewhat amazingly, this typically wasn't
problematic. If all replicas were online and reasonably up-to-date,
they'd see the first ComputeChecksum command, compute its checksum, and
store it in the checksums map. When they saw the duplicated
ComputeChecksum command, they'd see that a checksum with that ID already
existed and ignore it. In effect, only the first ComputeChecksum command
for a given checksum ID mattered.

The problem occured when one replica saw one ComputeChecksum command but
not the other. There were two ways this could occur. A replica could go
offline after computing the checksum the first time; when it came back
online, it would have an empty checksum map, and the checksum computed
for the second ComputeChecksum command would be recorded instead. Or a
replica could receive a snapshot that advanced it past one
ComputeChecksum but not the other. In both cases, the replicas could
spuriously fail a consistency check.

A very similar problem occured with range merges because ComputeChecksum
requests are incorrectly ranged (see cockroachdb#29002). That means DistSender
might split a ComputeChecksum request in two. Consider what happens when
a consistency check occurs immediately after a merge: the
ComputeChecksum request is generated using the up-to-date, post-merge
descriptor, but DistSender might have the pre-merge descriptors cached,
and so it splits the batch in two. Both halves of the batch would get
routed to the same range, and both halves would have the same command
ID, resulting in the same duplicated ComputeChecksum command problem.

The fix for these problems is to assign the checksum ID when the
ComputeChecksum request is evaluated. If the request is retried, it will
be properly assigned a new checksum ID. Note that we don't need to worry
about reproposals causing duplicate commands, as the MaxLeaseIndex
prevents proposals from replay.

The version compatibility story here is straightforward. The
ReplicaChecksumVersion is bumped, so v2.0 nodes will turn
ComputeChecksum requests proposed by v2.1 nodes into a no-op, and
vice-versa. The consistency queue will spam some complaints into the log
about this--it will time out while collecting checksums--but this will
stop as soon as all nodes have been upgraded to the new version.†

Note that this commit takes the opportunity to migrate
storagebase.ReplicatedEvalResult.ComputeChecksum from
roachpb.ComputeChecksumRequest to a dedicated
storagebase.ComputeChecksum message. Separate types are more in line
with how the merge/split/change replicas triggers work and avoid
shipping unnecessary fields through Raft. Note that even though this
migration changes logic downstream of Raft, it's safe. v2.1 nodes will
turn any ComputeChecksum commands that were commited by v2.0 nodes into
no-ops, and vice-versa, but the only effect of this will be some
temporary consistency queue spam. As an added bonus, because  we're
guaranteed that we'll never see duplicate v2.1-style ComputeChecksum
commands, we can properly fatal if we ever see a ComputeChecksum request
with a checksum ID that we've already computed.

† It would be possible to put the late-ID allocation behind a cluster
version to avoid the log spam, but that amounts to allowing v2.1 to
initiate known-buggy consistency checks. A bit of log spam seems
preferable.

Fix cockroachdb#28995.
benesch added a commit to benesch/cockroach that referenced this issue Aug 27, 2018
ComputeChecksum was previously implemented as a range request, which
meant it could be split by DistSender, resulting in two ComputeChecksum
requests with identical IDs! If those split ranges get routed to the
same range (e.g. because the ranges were just merged), spurious checksum
failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would
not actually look at the range boundaries in the request header; it
always operated on the range's entire keyspace at the time the request
was applied.

The fix is simple: make ComputeChecksum a point request. There are no
version compatibility issues here; nodes with this commit are simply
smarter about routing ComputeChecksum requests to only one range.

Fix cockroachdb#29002.

Release note: None
craig bot pushed a commit that referenced this issue Aug 28, 2018
29079: storage: make ComputeChecksum a point request r=tschottdorf a=benesch

ComputeChecksum was previously implemented as a range request, which
meant it could be split by DistSender, resulting in two ComputeChecksum
requests with identical IDs! If those split ranges get routed to the
same range (e.g. because the ranges were just merged), spurious checksum
failures could occur (#28995). Plus, the ComputeChecksum request would
not actually look at the range boundaries in the request header; it
always operated on the range's entire keyspace at the time the request
was applied.

The fix is simple: make ComputeChecksum a point request. There are no
version compatibility issues here; nodes with this commit are simply
smarter about routing ComputeChecksum requests to only one range.

Fix #29002.

Release note: None

29145: workload: make split concurrency constant r=nvanbenschoten a=nvanbenschoten

I had originally made the split concurrency for workload
dynamic, based on the concurrency of the workload itself.
This turned out to be a bad idea as it allowed for too
much contention during pre-splitting and resulted in
lots of split retries. The end result was that splits
slowed down over time instead of staying at a constant
rate.

This change makes the split concurrency constant like it
already is with fixture restoration:

https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/fixture.go#L449

This results in pre-splits on large cluster being more stable
and taking much less time (~50%).

Release note: None

Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants