-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange failed on master #28995
Comments
Uh oh! F180823 06:12:32.418853 3640704 storage/replica_consistency.go:154 [n7,s7,r12021/3:/Table/53/1/41{067853-138385}] consistency check failed with 1 inconsistent replicas |
Got confused by this message in cockroachdb#28995. Prior to this commit on a merge it looks like the subsuming range is removing itself, but it's actually the subsumee. Release note: None
More exfiltrated logs: node 7 ran the consistency check, n2 just notices the discrepancy (unclear who's wrong)
A minute before that check, we see the below on n7. The soon-to-fail range subsumes its right neighbor. And shortly thereafter, the log claims to remove itself, but it's actually removing the subsumee as supposed to (#29000).
Btw, the logs also have busy loops like this. I think this is perhaps due to the bad impl of RelocateRange we're using right now?
|
I was confused that I wasn't finding the consistency checker diff and eventually figured out that the diff had some binary junk in it, so
Since we have three replicas, and since only n2 reported a diff, I think this means that n7 (the leaseholder) and n6 (a follower) had an empty diff. The above thus states that n2 had two extra MVCC keys that shouldn't be there any more, namely:
Of course since we all think this has something to do with merges, one theory is:
Time to find that range in the logs. The key pretty printer does't make this easy, but let's see. |
This very spammy log message got in the way of the investigation in cockroachdb#28995. Release note: None
Unfortunately I think that the history of this range has been rotated out of the logs because of the spammy "remote couldn't accept snapshot with error" busy loop, which easily rotates through 20mb every minute 🙈 Added a commit to #29000. |
Two more observations: ComputeChecksumRequest seems to be implemented in a bad way. It uses a key range and it seems like it could be split by DistSender:
Not sure what will happen if that actually occurs. Nothing good! I think we'll get this error from DistSender: } else if lOK != rOK {
return errors.Errorf("can not combine %T and %T", valLeft, valRight)
} so it doesn't seem completely terrible, but either way, this sucks for splits, and it's even weirder for merges because the command will now declare a write only to parts of the keyspace. Again this seems fine because it really is a noop (and it shouldn't have to declare any part of the keyspace) but there's some cleanup to do. I was also worried about something else but checked that it isn't a problem: We have an optimization that avoids sending noops through Raft, and ComputeChecksum looked like a noop. But the code does the right thing and actually sends it through Raft. cockroach/pkg/storage/replica.go Lines 3527 to 3528 in 80812f2
|
Back to the original problem - the merge a minute before the checksum fatal picked up the keyspace owning the offending range descriptor key. My guess is that the problem was present in that subsumed range all along, but that the consistency checker just didn't run there. To repro this quickly, it might be enough to add a synchronous consistency checker run from the merge queue after each merge operation. The root problem that I anticipate seeing there is that we subsume a range but that it then fails to show up on the followers (perhaps after involving snapshots?) |
This was another source of noise in cockroachdb#28995. It makes more sense to log a snapshot in either the error when it fails or when we actually start streaming it. Release note: None
Repro'ed locally. I'm going to move on to other things, but @benesch I'm sure you can get somewhere with this script :-))
Unfortunately I didn't get a diff thanks to the aggressive merging:
This seems to repro every time and only takes <10m, so I'm not too worried about this bug any more. Diff is just like before, full log attached: cockroach.toblerone.tschottdorf.2018-08-23T14_24_01Z.077191.log
|
Just added something to always produce consistency diffs on the first attempt, running another experiment. Branch is https://github.com/tschottdorf/cockroach/tree/repro/rangemergediff (the previous repro already had the change that aggressively ran consistency checks after each merge). |
timeline of interleaved logs. TL;DR: we remove the RHS, it gets GC'ed on all three replicas, consistency check fails with range descriptor and transaction anchored on ... an unexpected key,
|
Ok, I just confused myself. That key is the start key of r1179, which merges into r2723 just a hair before disaster strikes on r2723 (there are two such failed ranges in the logs, and I must've mixed them up, r5276 doesn't matter here). So it's really as easy as
First of all, what's up with the super high generation of 842637714136? As an interesting datapoint, the version of the range descriptor has timestamp At pretty much 14:28:04, we see r1179 merge one of its neighbors:
The generation counter at that time is zero, which is just something I'm observing. Aha! I think it gets interesting. Get a load of what exactly transpires at 14:28:04.
I still can't claim to understand what's going on, but it seems that a preemptive snapshot picked up an in-flight merge during a failed upreplication, and that this snapshot somehow isn't wiped out by a later successful upreplication. Perhaps there is a bug when applying a preemptive snapshot over an older, narrower preemptive snapshot, where we should clear out existing data but don't? Hard to believe, but now I think the time has come for @benesch do async rubber duck the above. |
Ah, and I understand now why I got two consistency failures and why I was so confused for a bit. After the first consistency failure, the next replica that subsumes r2723 also barfs with the same diff, which makes sense. So it's just a cascading failure and the first one is the only one that matters. |
Got confused by this message in cockroachdb#28995. Prior to this commit on a merge it looks like the subsuming range is removing itself, but it's actually the subsumee. Release note: None
This very spammy log message got in the way of the investigation in cockroachdb#28995. Release note: None
This was another source of noise in cockroachdb#28995. It makes more sense to log a snapshot in either the error when it fails or when we actually start streaming it. Release note: None
29000: storage: misc logging improvements r=a-robinson a=tschottdorf Got confused by this message in #28995. Prior to this commit on a merge it looks like the subsuming range is removing itself, but it's actually the subsumee. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Reviewing the consistency checker diff, it's not particularly helpful in telling you the values of things. It won't let us know if there's an intent, for example, though if memory serves correctly an intent would imply the existence of an existing meta key, i.e. timestamp zero. We're not seeing that here, we're seeing only the stray version. I'll hack something together that decodes these. |
Decoding the values from the diff here shows:
|
That is the version of the range descriptor you expect at the point in time when the first snapshot is sent (look for Or, of course, the consistency checker is misleading me and the problem is really that n3 is the only one lacking these keys. But the transaction key gives me some confidence that this isn't it: the timestamp of the range descriptor version matches up precisely with the one of the transaction. Perhaps the snapshot didn't pick up the merge in process at the time (there's no intent), but it picked up the one before that had just committed but not yet been cleaned up after (we expect the txn records to go away after the intents are resolved). |
More spelunking against the data directory. The offending timestamp is the only one which is present exactly once. Under the assumption that we never intentionally GC range descriptor versions during merges (GC definitely shouldnt' be doing it in this short test), this is unexpected and is in line with the
And that single copy of the key sits on n3, as the consistency checker claimed:
|
What's interesting is that on n3, the next higher version of the range descriptor (above the
and in particular the generation moves backwards (i.e. the outlier is ahead of its successor), which I thought isn't supposed to happen. Ugh, I've been staring at this for too long. Time to take a break. |
Ah, the generation printing is messed up. This line fmt.Fprintf(&buf, ", next=%d, gen=%d]", r.NextReplicaID, r.Generation) prints the Generation integer pointer as |
Inspired by cockroachdb#28995, where I needed this. Release note: None
Inspired by cockroachdb#28995, where I needed this. Release note: None
Got confused by this message in cockroachdb#28995. Prior to this commit on a merge it looks like the subsuming range is removing itself, but it's actually the subsumee. Release note: None
This very spammy log message got in the way of the investigation in cockroachdb#28995. Release note: None
This was another source of noise in cockroachdb#28995. It makes more sense to log a snapshot in either the error when it fails or when we actually start streaming it. Release note: None
Ok, check this out. Repro'd the failure with some additional log messages around when consistency computations occur. Full log is in this gist; relevant section reproduced below:
So:
So why is n3 computing a checksum at applied index 53?! That seems extremely wrong. It's possible, though unlikely, that the log message about computing a checksum at applied index 53 is actually a different consistency check. I'm going to add additional logging of the consistency check ID to rule that case out. |
Theory appears to be confirmed:
The relevant lines:
|
Well this seems incredibly wrong. Check out the raft log on this guy:
|
I suspect this is some sort of ambiguous retry error gone wrong. This comment seems ominous: cockroach/pkg/storage/replica.go Lines 4775 to 4779 in 272a1bc
I haven't verified yet that these bad consistency checks are running into this case, though. |
Oh my god. Look at this craziness:
When the consistency check is executed, the node's distsender cache thinks there are two separate ranges. So it splits the ComputeChecksum batch! And two checksum computations get executed with the same checksum ID. That is utterly awful. In most cases, it's actually ok, because all replicas will execute both compute checksums and the second will be a no-op. But if the timing happens to be just right, you'll end up adding a new replica between the two compute checksums. That new replica won't have executed the first ComputeChecksum, but it will execute the second ComputeChecksum—at a different applied index! 💥 So the proximate cause here is the fact that ComputeChecksum is ranged instead of being keyed on the replica's start key. But I think there are deeper problems lurking with retries. ComputeChecksum doesn't appear to have any replay protection, but it definitely needs some. |
Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.
ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix cockroachdb#29002. Release note: None
Inspired by cockroachdb#28995, where I needed this. Release note: None
Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.
29067: storage: protect ComputeChecksum commands from replaying r=tschottdorf a=benesch Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see #29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix #28995. 29083: storage: fix raft snapshots that span merges and splits r=tschottdorf a=benesch The code that handles Raft snapshots that span merges did not account for snapshots that spanned merges AND splits. Handle this case by allowing snapshot subsumption even when the snapshot's end key does not exactly match the end of an existing replica. See the commits within the patch for details. Fix #29080. Release note: None 29117: opt: fix LookupJoinDef interning, add tests r=RaduBerinde a=RaduBerinde Fixing an omission I noticed in internLookupJoinDef and adding missing tests for interning defs. Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix cockroachdb#29002. Release note: None
Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.
ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix cockroachdb#29002. Release note: None
29079: storage: make ComputeChecksum a point request r=tschottdorf a=benesch ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix #29002. Release note: None 29145: workload: make split concurrency constant r=nvanbenschoten a=nvanbenschoten I had originally made the split concurrency for workload dynamic, based on the concurrency of the workload itself. This turned out to be a bad idea as it allowed for too much contention during pre-splitting and resulted in lots of split retries. The end result was that splits slowed down over time instead of staying at a constant rate. This change makes the split concurrency constant like it already is with fixture restoration: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/fixture.go#L449 This results in pre-splits on large cluster being more stable and taking much less time (~50%). Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/eb54cb65ec8da407c8ce5e971157bb1c03efd9e8
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=855884&tab=buildLog
The text was updated successfully, but these errors were encountered: