-
Notifications
You must be signed in to change notification settings - Fork 4k
mma: convert kvpb.ReplicationChanges to a single change per store #157118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It's annoying that we don't have a clear contract on what shapes the incoming replication changes can take. I see more than one TODO comment musing about real world applicability of various scenarios. I'll take a look, maybe we can use a [2]slice in the plan code and clarify this once and for all (and make it an obvious breaking change should this ever be relaxed). Will follow up on that separately.
@tbg reviewed 5 of 5 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
-- commits line 8 at r1:
Was this causing any noticeable issues? What was the effect of the previous status quo? I assume it would've triggered an assertion? I don't see a regression test. We should have one.
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 56 at r1 (raw file):
// convertReplicaChangeToMMA converts a replica change to a mma range change. // It will be passed to mma.RegisterExternalChange. func convertReplicaChangeToMMA(
Can you make a unit test for this? We don't need the entire harness thing, an echotest is probably the sweet spot for keeping things easy for you but tested and easily extended in the future. Something like
cockroach/pkg/util/admission/io_load_listener_test.go
Lines 335 to 348 in 797e646
| for _, tt := range tests { | |
| buf.Printf("%s:\n", tt.name) | |
| ioll := &ioLoadListener{ | |
| settings: cluster.MakeTestingClusterSettings(), | |
| l0CompactedBytes: metric.NewCounter(l0CompactedBytes), | |
| l0TokensProduced: metric.NewCounter(l0TokensProduced), | |
| } | |
| res := ioll.adjustTokensInner( | |
| ctx, tt.prev, tt.l0Metrics, 12, cumStoreCompactionStats{numOutLevelsGauge: 1}, 0, | |
| pebble.ThroughputMetric{}, 100, 10, 0, 0.50, 10, 100, false) | |
| buf.Printf("%s\n", res) | |
| } | |
| echotest.Require(t, string(redact.Sprint(buf)), filepath.Join(datapathutils.TestDataPath(t, "format_adjust_tokens_stats.txt"))) | |
| } |
Adding the test in a first commit (persisting the "before") and then updating it with the main commit.
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 124 at r1 (raw file):
} for _, chg := range changes { switch {
switch chg.ChangeType {
case roachpb.ADD_VOTER, roachpb.ADD_NON_VOTER:
// ...
case roachpb.REMOVE_VOTER, roachpb.REMOVE_NON_VOTER:
// ...
default:
}pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 216 at r1 (raw file):
// leaseholder status. This includes promotion/demotion changes. func (rc ReplicaChange) isUpdate() bool { return rc.prev.ReplicaID >= 0 && (rc.next.ReplicaID >= 0 || rc.next.ReplicaID == unknownReplicaID)
What is going on here?
I don't understand this logic:
- In
rc.prev >= 0, what's the 0 for? Don't replicaIDs start at 1? Or are we interpreting 0 as a valid replicaID in mma? - we're saying the change is an update if "updates replica type or leaseholder status" but then we don't look at the replica type or leaseholder status at all.
Taking a step back, "update" probably contrasts with removal or addition: it's a case in which the replica was there and remains there. Now, does a change in ReplicaID count as an update? I think in the "real world" promoting, say, an incoming voter to a voter doesn't change the replicaID. So replicaID changes imply that there was an intermittent removal. But the conditional here is all about replicaIDs, though they never get compared.
I understand this so little that I can't even figure out what it should do. Could you help clear this up by improving code and/or comment?
I see reading below that unknownReplicaID is used for the replica type changes. Okay. That part makes sense then, though it's also pretty opaque.
Taking another step back - why are we shying away from explicitly encoding what a ReplicaChange is trying to be? Rather than methods trying to reverse engineer it, the change could simply have an enum, no? We have an enum already. I don't see why we're not using it. Import cycle should be easy to avoid if there is one.
Have finished the review now, and still unsure why a change of the next.ReplicaID semantics was necessary as of this PR.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 363 at r1 (raw file):
target roachpb.ReplicationTarget, ) ReplicaChange { next.ReplicaID = unknownReplicaID
Does
cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go
Lines 169 to 185 in 797e646
| // NB: 0 is not a valid ReplicaID, but this component does not care about | |
| // this level of detail (the special consts defined above use negative | |
| // ReplicaID values as markers). | |
| // | |
| // Only following cases can happen: | |
| // | |
| // - prev.replicaID >= 0 && next.replicaID == noReplicaID: outgoing replica. | |
| // prev.IsLeaseholder can be true or false, since we can transfer the | |
| // lease as part of moving the replica. | |
| // | |
| // - prev.replicaID == noReplicaID && next.replicaID == unknownReplicaID: | |
| // incoming replica, next.ReplicaType must be VOTER_FULL or NON_VOTER. | |
| // next.IsLeaseholder can be true or false. | |
| // | |
| // - prev.replicaID >= 0 && next.replicaID >= 0: can be a change to | |
| // IsLeaseholder, or ReplicaType. next.ReplicaType must be VOTER_FULL or | |
| // NON_VOTER. |
need an update?
Noting my preference of finding a way to make this complicated way of expressing a type change unnecessary.
// - prev.replicaID >= 0 && next.replicaID >= 0: can be a change to
// IsLeaseholder, or ReplicaType. next.ReplicaType must be VOTER_FULL or
// NON_VOTER.
It would be unfortunate if these previously intuitive rules changed.
pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 835 at r1 (raw file):
var changes []*pendingReplicaChange for _, c := range change.pendingReplicaChanges { ch, ok := a.cs.pendingChanges[c.changeID]
This rename would've made an excellent separate commit, stripping down the main review. For the future.
|
Actually I can follow up right now. Take a look here: cockroach/pkg/kv/kvserver/client_atomic_membership_change_test.go Lines 113 to 122 in 797e646
This is a length-four slice of replication changes. This should work, even though we may not issue them like this anywhere in production code. I wouldn't be surprised if we will, one day. While we're here, I'd appreciate chewing through both of the TODOs: and make sure things "just work". Looking at cockroach/pkg/kv/kvserver/replica_command.go Lines 1397 to 1418 in 797e646
but it's unfortunate that there is really nothing that ties these two together. It is something that could simply be documented on the replication changes type, and referenced from both the above code and the conversion code. The conversion code should then be extended to fix the TODOs by designating the first voter as leaseholder but updating the actual code to also use the first incoming one (it currently uses the last one I think but there's only ever one except in tests I'm pretty sure) Apologies about piling up more on your plate here, but this seems like the right opportunity to clean this up. |
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall I audited this once before - I saw possibilities of 1 or 2 or 4 replica changes
cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go
Lines 1875 to 1879 in af3267c
| if len(changes) != 1 && len(changes) != 2 && len(changes) != 4 { | |
| panic(errors.AssertionFailedf( | |
| "applying replica changes must be of length 1, 2, or 4 but got %v in %v", | |
| len(changes), changes)) | |
| } |
cockroach/pkg/kv/kvserver/allocator/plan/util.go
Lines 55 to 57 in c83c57d
| promo := kvpb.ReplicationChangesForPromotion(addTarget) | |
| demo := kvpb.ReplicationChangesForDemotion(removeTarget) | |
| chgs = append(promo, demo...) |
convertReplicaChangeToMMA with every possible combination in data driven tests like #151510 (comment). I can take this next since it seems important.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
797e646 to
9e3741b
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like the right opportunity to clean this up.
Done
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
Was this causing any noticeable issues?
It was violating the invariant, because the code in mma_conversion.go was incorrect. We don't go around asserting downstream of PendingRangeChange that there is at most one change per store. And there are only two places that construct a PendingRangechange:
- Code in allocator_state that calls
makeRebalanceReplicaChangesthat structurally only has 2 changes, and that callsMakeLeaseTransferChangeswhich also has 2 changes. - The code external to mma that calls the code in mma_conversion.go.
So I think it is sufficient to fix the places that construct.
I've added a comment in PendingRangeChange and rangeState stating this invariant -- I mistakenly thought it was already stated.
Regarding what it triggered -- there wasn't an assertion. It was discovered when I tried to remove the ability to handle 4 changes (see one of the TODOs that this PR deletes).
pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 835 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This rename would've made an excellent separate commit, stripping down the main review. For the future.
Ack. I accidentally did this first when fiddling with the code and then forgot to roll it back.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 216 at r1 (raw file):
Have finished the review now, and still unsure why a change of the next.ReplicaID semantics was necessary as of this PR.
We were never modeling a VOTER => NON_VOTER transition or vice versa as a single change before, so the isUpdate was only ever being called when the lease was being shed or gained. Once that changed, returning false from here in some transitive way caused an incorrect panic.
Or are we interpreting 0 as a valid replicaID in mma?
Yes, we assume 0 is not invalid. I didn't want to assume anything given even the declaration of ReplicaID says nothing
// ReplicaID is a custom type for a range replica ID.
type ReplicaID int32
Which is why the {unknown,no}ReplicaIDs were defined as negative values.
Taking another step back - why are we shying away from explicitly encoding what a
ReplicaChangeis trying to be? Rather than methods trying to reverse engineer it, the change could simply have an enum, no? We have an enum already. I don't see why we're not using it.
The enum is coarse. For example the ChangeReplica I added also can include the case where it is acquiring or shedding the lease. I've added a comment to that effect in the enum. Historically, the enum was added later for logging and for some integration code. I've changed the code to now use the enum, since we don't care about the narrow case of whether the lease is being gained or lost, which would require looking at prev and next.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 363 at r1 (raw file):
need an update?
This was always not quite right, but we didn't notice it because we were not doing promotions/demotions.
I've made a big update here.
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 56 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you make a unit test for this? We don't need the entire harness thing, an
echotestis probably the sweet spot for keeping things easy for you but tested and easily extended in the future. Something like.cockroach/pkg/util/admission/io_load_listener_test.go
Lines 335 to 348 in 797e646
for _, tt := range tests { buf.Printf("%s:\n", tt.name) ioll := &ioLoadListener{ settings: cluster.MakeTestingClusterSettings(), l0CompactedBytes: metric.NewCounter(l0CompactedBytes), l0TokensProduced: metric.NewCounter(l0TokensProduced), } res := ioll.adjustTokensInner( ctx, tt.prev, tt.l0Metrics, 12, cumStoreCompactionStats{numOutLevelsGauge: 1}, 0, pebble.ThroughputMetric{}, 100, 10, 0, 0.50, 10, 100, false) buf.Printf("%s\n", res) } echotest.Require(t, string(redact.Sprint(buf)), filepath.Join(datapathutils.TestDataPath(t, "format_adjust_tokens_stats.txt"))) } Adding the test in a first commit (persisting the "before") and then updating it with the main commit.
Will do.
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 124 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
switch chg.ChangeType { case roachpb.ADD_VOTER, roachpb.ADD_NON_VOTER: // ... case roachpb.REMOVE_VOTER, roachpb.REMOVE_NON_VOTER: // ... default: }
Done
ef5ab59 to
bc16e41
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 56 at r1 (raw file):
Previously, sumeerbhola wrote…
Will do.
I've added a datadriven unit test, including testing for errors and panics. In addition to the test cases, the output also required a code change, so there is no before-after this PR. We know this was incorrect before, so it seems fine to skip testing the before state.
I've also validated that test coverage is solid, except for two ""unimplemented change type" panics that will only happen if someone is not using the enum consts.
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me - why does a store need to have only one pending change at a time (does this break any assumptions we’re relying on)?
I only skimmed through this and don’t plan to do a detailed review unless you think I should give a closer look.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does a store need to have only one pending change at a time (does this break any assumptions we’re relying on)?
For a range, we only want one change per store, so we can look at each change and look at the initial state and simply apply that change. We don't have to worry about multiple changes and ordering of the changes to figure out which is the intermediate state for the store and which is the final state. Similarly, it makes it easy to undo. I think the "A Note about Pending Change", covers some of this.
I only skimmed through this and don’t plan to do a detailed review unless you think I should give a closer look.
Up to you and @tbg on how to split the reviewing work for this PR.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
bc16e41 to
ee9734e
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing looks good, thanks! A few more comments where I don't feel like I 100% understand what's going on. But I think this is starting to wind down, thanks for pushing on.
@tbg reviewed 4 of 8 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 216 at r1 (raw file):
Yes, we assume 0 is not invalid. I didn't want to assume anything given even the declaration of
ReplicaIDsays nothing
Yep, that's fair and no need to change. You could add that explicitly on our internal replica ID - saying that we treat zero as a just any other regular value even though that isn't true in production at time of writing. This will clarify things for readers coming from kvserver.
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 56 at r1 (raw file):
Previously, sumeerbhola wrote…
I've added a datadriven unit test, including testing for errors and panics. In addition to the test cases, the output also required a code change, so there is no before-after this PR. We know this was incorrect before, so it seems fine to skip testing the before state.
I've also validated that test coverage is solid, except for two ""unimplemented change type" panics that will only happen if someone is not using the enum consts.
Nice, thank you!
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 161 at r3 (raw file):
if lhBeingRemoved && !pickedLeaseholder { return mmaprototype.PendingRangeChange{}, errors.Errorf("leaseholder being removed but no new leaseholder picked from %v", changes)
were you intentional about this being an error, not an assertion failed error?
pkg/kv/kvserver/mmaintegration/allocator_sync.go line 189 at r3 (raw file):
mmaChange, err = convertReplicaChangeToMMA(desc, usage, changes, leaseholderStoreID) if err != nil { log.KvDistribution.Errorf(ctx, "failed to convert replica change to mma: %v", err)
Is this being handled correctly? I'm noticing that we're falling through. mmaChange is probably empty. Needs a comment if this is truly what we want, including explaining the consequences.
pkg/kv/kvserver/mmaintegration/mma_conversion_test.go line 37 at r3 (raw file):
RaftCPUNanosPerSecond: 1, } datadriven.RunTest(t, datapathutils.TestDataPath(t, "convert_replica_change_to_mma"),
Could you generally use t.Name() as the path component for dd fixtures (instead of convert_replica_change_to_mma in this case)? That makes it easy to navigate to the testdata similar to how one would navigate to the test.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 1911 at r1 (raw file):
// test produces a change (when running under SMA) which is: // // r10 type: RemoveReplica target store n3,s3 (replica-id=5 type=NON_VOTER)->(replica-id=none type=VOTER_FULL)
👍
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 176 at r3 (raw file):
rangeID roachpb.RangeID // NB: 0 is not a valid ReplicaID, but this component does not care about
👍 nice comment, thanks!
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 253 at r3 (raw file):
// isPromoDemo returns true if the change is a promotion or demotion of a // replica.
// (potentially one sandwiched with a lease change).
Is there a world where we simplify things by treating a lease change as a ChangeReplica? It's a transition from one type to the same type, but while transferring the lease.
Currently we have this wrinkle that leases are represented differently depending on whether there's also a type change. The above suggestion would avoid that. Any reason to think this can't work?
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 546 at r3 (raw file):
} // TODO(sumeer): A single PendingRangeChange can model a bunch of replica
👍
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 591 at r3 (raw file):
// ReplicationChanges returns the replication changes for the pending range // change. It panics if the pending range change is not a change replicas
I know this references the method, but really it's a bit hard to understand. There are multiple changes in there, and some may be promo/demos. But additions and removals are also in there. But not lease transfers. It's a weird set of things that is allowed here, no?
Knowing how kvserver code works, this makes sense. We're translating the pending changes into replication actions, but kvserver doesn't represent the lease transfers as replication changes (which is probably something to improve, but not something we're going to fix right here). So we just want to "skip" anything lease related.
But now with the amalgamation of lease transfers and promos/demos, this method is in sort of an awkward middle ground: it panics when it sees a lease transfer, but just ignores the lease transfer if it's represented through a promo/demo. Wouldn't it be more consistent to allow any incoming slice of pending changes, and to simply skip over the lease parts in general?
The above explanation for this behavior (kvserver doesn't represent lease transfers as part of the mapped decision but decides on them ad-hoc later), if correct, would make a good explanatory comment to this method IMO.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 603 at r3 (raw file):
panic("unknown change type") case AddLease, RemoveLease: panic("lease changes are not a change replicas")
nit: the below is more standard and also works when the enum extends.
default:
panic(errors.AssertionFailedf("change type %v is not a change replicas", c.replicationChangeType)
case ChangeReplica, AddReplica, RemoveReplica:
// These are the only permitted cases.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 604 at r3 (raw file):
case AddLease, RemoveLease: panic("lease changes are not a change replicas") }
Do you understand why it's add-then-remove and not the other way around? Is it inconsequential?
Suggested comment:
// KV represents a change in replica type as an addition and a removal of the same replica.
// For example, if a replica changes from VOTER_FULL to VOTER_OUTGOING, we will emit
// an ADD_VOTER, followed by a REMOVE_VOTER.
// The order of add and remove is [insert explanation here, either it doesn't matter, or it's just kvserver convention?]
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 607 at r3 (raw file):
if c.replicaChangeType == ChangeReplica || c.replicaChangeType == AddReplica { chg := kvpb.ReplicationChange{Target: c.target} switch c.next.ReplicaType.ReplicaType {
so the type can't be VOTER_OUTGOING because that's leaving a joint config, and we don't do that here, right? A short comment to this effect could help readers who are trying to understand why not all cases are handled here (and explaining why). My understanding:
we don't
- remove VOTER_OUTGOING
- remove VOTER_DEMOTING_LEARNER
- remove VOTER_DEMOTING
because they're all states that are only relevant when a joint quorum is left, and that's not an operation we emit.
We do - remove VOTER_FULL
- remove NON_VOTER
but shouldn't we
- be able to remove LEARNER? I'm guessing we don't have to because the kvserver machinery will always remove learners before emitting another change, so they're not really a visible state.
- not have to handle removing _INCOMING etc? I thought we would only emit changes that start from a non-joint quorum.
- Similar vein, I'd expect to have to handle adding a VOTER_INCOMING (etc).
What am I missing?
ee9734e to
216a443
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 216 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yes, we assume 0 is not invalid. I didn't want to assume anything given even the declaration of
ReplicaIDsays nothingYep, that's fair and no need to change. You could add that explicitly on our internal replica ID - saying that we treat zero as a just any other regular value even though that isn't true in production at time of writing. This will clarify things for readers coming from
kvserver.
Added a comment
// NB: In practice, the kvserver code never generates 0 as a valid ReplicaID.
// The MMA code does not special case 0 to be an invalid value, because as of
// the time of writing this comment, the 0 value being invalid was an
// undocumented invariant of the kvserver code. Instead, the code here uses
// two negative values to represent special cases.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 253 at r3 (raw file):
// (potentially one sandwiched with a lease change).
Done
Is there a world where we simplify things by treating a lease change as a ChangeReplica? It's a transition from one type to the same type, but while transferring the lease.
Currently we have this wrinkle that leases are represented differently depending on whether there's also a type change. The above suggestion would avoid that. Any reason to think this can't work?
It probably can, but it isn't clear it is a unification that will simplify code since (a) integration code, both inside MMA and at higher layers that make those changes are different, (b) logging for "only lease" changes is slightly different, so that will need to demux. Definitely not in scope of this PR. And there are other more important pending changes modeling changes to be made after this PR, that I would like to tackle.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 591 at r3 (raw file):
Wouldn't it be more consistent to allow any incoming slice of pending changes, and to simply skip over the lease parts in general?
Skipping over something makes me nervous, since it is a silent incomplete translation.
But now with the amalgamation of lease transfers and promos/demos, this method is in sort of an awkward middle ground: it panics when it sees a lease transfer, but just ignores the lease transfer if it's represented through a promo/demo. Wouldn't it be more consistent to allow any incoming slice of pending changes, and to simply skip over the lease parts in general?
I don't think it is much of an amalgamation. It is only allowed to have AddReplica, RemoveReplica, ChangeReplica. The deficiency is in the kvserver integration code, that wants to (a) think in terms of a PendingRangeChange being IsChangeReplicas (existing todo where IsChangeReplicas is declared), (b) has no way to explicitly model which ADD_VOTER is the new leaseholder. For (b), I've added a todo here:
// TODO(tbg): The ReplicationChanges can include a new leaseholder replica,
// but the incoming leaseholder is not explicitly represented in
// kvpb.ReplicationChanges. This is an existing modeling deficiency in the
// kvserver code. In Replica.maybeTransferLeaseDuringLeaveJoint the first
// VOTER_INCOMING is considered the new leaseholder. So the code below places
// the new leaseholder (an ADD_VOTER) at index 0. It is not clear whether this
// is sufficient for all integration use-cases. Verify and fix as needed.
I've also changed the code
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 603 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: the below is more standard and also works when the enum extends.
default: panic(errors.AssertionFailedf("change type %v is not a change replicas", c.replicationChangeType) case ChangeReplica, AddReplica, RemoveReplica: // These are the only permitted cases.
Done
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 604 at r3 (raw file):
Do you understand why it's add-then-remove and not the other way around? Is it inconsequential?
I assumed it is inconsequential.
I've added the following comment (I assumed your comment was intended to say NON_VOTER and not VOTER_OUTGOING)
// The kvserver code represents a change in replica type as an
// addition and a removal of the same replica. For example, if a
// replica changes from VOTER_FULL to NON_VOTER, we will emit a pair
// of {ADD_NON_VOTER, REMOVE_VOTER} for the replica. The ordering of
// this pair does not matter.
//
// TODO(tbg): confirm that the ordering does not matter.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 607 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
so the type can't be VOTER_OUTGOING because that's leaving a joint config, and we don't do that here, right? A short comment to this effect could help readers who are trying to understand why not all cases are handled here (and explaining why). My understanding:
we don't
- remove VOTER_OUTGOING
- remove VOTER_DEMOTING_LEARNER
- remove VOTER_DEMOTING
because they're all states that are only relevant when a joint quorum is left, and that's not an operation we emit.
We do- remove VOTER_FULL
- remove NON_VOTER
but shouldn't we
- be able to remove LEARNER? I'm guessing we don't have to because the kvserver machinery will always remove learners before emitting another change, so they're not really a visible state.
- not have to handle removing _INCOMING etc? I thought we would only emit changes that start from a non-joint quorum.
- Similar vein, I'd expect to have to handle adding a VOTER_INCOMING (etc).
What am I missing?
Note, this is the next value (since this is the add case), which is required to be VOTER or NON_VOTER. I've added a comment in the declaration of ReplicaChange:
// The ReplicaType in next is either the zero value (for removals), or
// {VOTER_FULL, NON_VOTER} for additions/change, i.e., it represents the
// final goal state.
I've also added a mapReplicaTypeToVoterOrNonVoter method that is used when constructing the ReplicaChange.
For removals, we are handling all types. Well, I forgot to handle VOTER_DEMOTING_NON_VOTER, but that is fixed now, since it uses the same map function.
// Note: We're not using VOTER_OUTGOING since 20.1. We're using VOTER_DEMOTING
// instead. See #42251.
VOTER_OUTGOING ReplicaType = 3
pkg/kv/kvserver/mmaintegration/allocator_sync.go line 189 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this being handled correctly? I'm noticing that we're falling through.
mmaChangeis probably empty. Needs a comment if this is truly what we want, including explaining the consequences.
Good catch. Bug. Fixed it. But not tested since there isn't a test for this code yet.
pkg/kv/kvserver/mmaintegration/mma_conversion.go line 161 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
were you intentional about this being an error, not an assertion failed error?
This is defensive. It is the only error case that doesn't panic. Feel free to strengthen this in the future.
216a443 to
c45f41f
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/mmaintegration/mma_conversion_test.go line 37 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you generally use
t.Name()as the path component for dd fixtures (instead of convert_replica_change_to_mma in this case)? That makes it easy to navigate to the testdata similar to how one would navigate to the test.
Done
c45f41f to
acacc58
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 591 at r3 (raw file):
Previously, sumeerbhola wrote…
Wouldn't it be more consistent to allow any incoming slice of pending changes, and to simply skip over the lease parts in general?
Skipping over something makes me nervous, since it is a silent incomplete translation.
But now with the amalgamation of lease transfers and promos/demos, this method is in sort of an awkward middle ground: it panics when it sees a lease transfer, but just ignores the lease transfer if it's represented through a promo/demo. Wouldn't it be more consistent to allow any incoming slice of pending changes, and to simply skip over the lease parts in general?
I don't think it is much of an amalgamation. It is only allowed to have AddReplica, RemoveReplica, ChangeReplica. The deficiency is in the kvserver integration code, that wants to (a) think in terms of a
PendingRangeChangebeingIsChangeReplicas(existing todo where IsChangeReplicas is declared), (b) has no way to explicitly model which ADD_VOTER is the new leaseholder. For (b), I've added a todo here:// TODO(tbg): The ReplicationChanges can include a new leaseholder replica, // but the incoming leaseholder is not explicitly represented in // kvpb.ReplicationChanges. This is an existing modeling deficiency in the // kvserver code. In Replica.maybeTransferLeaseDuringLeaveJoint the first // VOTER_INCOMING is considered the new leaseholder. So the code below places // the new leaseholder (an ADD_VOTER) at index 0. It is not clear whether this // is sufficient for all integration use-cases. Verify and fix as needed.I've also changed the code
I've modified TestConvertReplicaChangeToMMA to also test that the new leaseholder is first in the index.
acacc58 to
e75ce59
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a few comments open, but approving - thanks for your work on this. I'm happy to defer all other discussions to follow-up work, except the small ones to avoid this PR clashing with my in-process refactor. And of course you'll want to review the commits I added.
The discussion item that I'm most curious about understanding at this point is the one about the panic in rebalanceStores. But that is not for this PR imo.
@tbg reviewed 1 of 8 files at r2, 1 of 3 files at r4, 3 of 3 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 253 at r3 (raw file):
Previously, sumeerbhola wrote…
// (potentially one sandwiched with a lease change).
Done
Is there a world where we simplify things by treating a lease change as a ChangeReplica? It's a transition from one type to the same type, but while transferring the lease.
Currently we have this wrinkle that leases are represented differently depending on whether there's also a type change. The above suggestion would avoid that. Any reason to think this can't work?It probably can, but it isn't clear it is a unification that will simplify code since (a) integration code, both inside MMA and at higher layers that make those changes are different, (b) logging for "only lease" changes is slightly different, so that will need to demux. Definitely not in scope of this PR. And there are other more important pending changes modeling changes to be made after this PR, that I would like to tackle.
I've made the simplification and find it worthwhile - adding a commit. Please take a look, this all went off without a hitch. Having just one way to represent lease transfers is much clearer to me. Determining whether a particular pending change is a lease transfer is also easy and we had that code already (and were calling it all over the place).
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 607 at r3 (raw file):
Previously, sumeerbhola wrote…
Note, this is the
nextvalue (since this is the add case), which is required to be VOTER or NON_VOTER. I've added a comment in the declaration ofReplicaChange:// The ReplicaType in next is either the zero value (for removals), or // {VOTER_FULL, NON_VOTER} for additions/change, i.e., it represents the // final goal state.I've also added a
mapReplicaTypeToVoterOrNonVotermethod that is used when constructing theReplicaChange.For removals, we are handling all types. Well, I forgot to handle
VOTER_DEMOTING_NON_VOTER, but that is fixed now, since it uses the same map function.// Note: We're not using VOTER_OUTGOING since 20.1. We're using VOTER_DEMOTING // instead. See #42251. VOTER_OUTGOING ReplicaType = 3
Thanks, this looks much cleaner.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 415 at r6 (raw file):
} // TODO: we will also try to rebalance away from this store if it is // currently in VOTER_DEMOTING_LEARNER, which seems unnecessary.
Could you please limit modifications to rebalanceStores (writing notes and TODOs is fine and appreciated, but do me a favor and keep them in a separate commit so I can deal with them easily in #157224 which starts taking apart this large function, thank you!)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 423 at r6 (raw file):
// lack of unit tests. // // TODO: should this panic be removed now?
Trying to wrap my head around this. First, we now that rstate has no pending changes here, so it can't be some sequencing issue between RegisterExternalChange/AdjustPendingChangeDisposition.
So I think this implies that if we have the lease, but then it externally gets transferred away to a non-local store and mma gets notified via AdjustPendingChangeDisposition, we'll be in an inconsistent state where cs.ranges[rangeID] has some problem. I'm not sure what the nature of the problem would be. Wouldn't it "just" have the same replica set, with the leaseholder on whatever store it got transferred to? I.e. the replicaRole for store.StoreID should be unchanged - after all, nobody cares about the lease in replicaRole. If the replica got removed in addition to this - why would it still stay around in cs.ranges? I could see a world in which it got removed from cs.ranges but is still in topKReplicas, so we'll get a nil element from cs.ranges[rangeID] but if that happened we'd already have panicked by now.
So what is the real scenario here, since I'm clearly missing something?
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 570 at r6 (raw file):
} // TODO(sumeer): A single PendingRangeChange can model a bunch of replica
➕
2fe7a40 to
e777230
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 415 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you please limit modifications to rebalanceStores (writing notes and TODOs is fine and appreciated, but do me a favor and keep them in a separate commit so I can deal with them easily in #157224 which starts taking apart this large function, thank you!)
Ack. I've removed these changes. Will send a separate PR.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 423 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Trying to wrap my head around this. First, we now that
rstatehas no pending changes here, so it can't be some sequencing issue between RegisterExternalChange/AdjustPendingChangeDisposition.
So I think this implies that if we have the lease, but then it externally gets transferred away to a non-local store and mma gets notified viaAdjustPendingChangeDisposition, we'll be in an inconsistent state wherecs.ranges[rangeID]has some problem. I'm not sure what the nature of the problem would be. Wouldn't it "just" have the same replica set, with the leaseholder on whatever store it got transferred to? I.e. the replicaRole forstore.StoreIDshould be unchanged - after all, nobody cares about the lease inreplicaRole. If the replica got removed in addition to this - why would it still stay around incs.ranges? I could see a world in which it got removed fromcs.rangesbut is still intopKReplicas, so we'll get a nil element fromcs.ranges[rangeID]but if that happened we'd already have panicked by now.
So what is the real scenario here, since I'm clearly missing something?
I didn't think through when adding the TODO -- it was mostly a note to come back to this.
We currently have the leaseholder provide the source-of-truth and then do rebalancing decision making, in an atomic step. And we recompute the top-k. So in that case this panic is valid. Our invariants around this are not well stated. I'll send a PR.
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 253 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I've made the simplification and find it worthwhile - adding a commit. Please take a look, this all went off without a hitch. Having just one way to represent lease transfers is much clearer to me. Determining whether a particular pending change is a lease transfer is also easy and we had that code already (and were calling it all over the place).
I'd prefer to undo this change from this PR and have you tackle this separately. The main reason is that I am nervous about prev.ReplicaType == next.ReplicaType, given that we always normalize the next.ReplicaType to be VOTER_FULL or NON_VOTER.
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg reviewed 1 of 1 files at r7, 2 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 253 at r3 (raw file):
Previously, sumeerbhola wrote…
I'd prefer to undo this change from this PR and have you tackle this separately. The main reason is that I am nervous about
prev.ReplicaType == next.ReplicaType, given that we always normalize thenext.ReplicaTypeto be VOTER_FULL or NON_VOTER.
SGTM, I'll send this separately for discussion.
Previously, a VOTER => NON_VOTER transition, or vice versa, was producing multiple changes for a store, which is invalid according to the contract of PendingRangeChange. This is now fixed by changing the logic in mmaintegration.convertReplicaChangeToMMA, and adding a mmaprototype.MakeReplicaTypeChange function. Informs cockroachdb#157049 Epic: CRDB-55052 Release note: None
e777230 to
ff8481a
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 234 at r11 (raw file):
// TODO(tbg): in MakeLeaseTransferChanges, next.ReplicaType.ReplicaType is // simply the current value, and not necessarily {VOTER_FULL, NON_VOTER}. // So the above comment is incorrect. We should clean this up.
Added this todo.
|
bors r=tbg |
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 253 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
SGTM, I'll send this separately for discussion.
Previously, a VOTER => NON_VOTER transition, or vice versa, was producing multiple changes for a store, which is invalid according to the contract of PendingRangeChange. This is now fixed by changing the logic in mmaintegration.convertReplicaChangeToMMA, and adding a mmaprototype.MakeReplicaTypeChange function.
Informs #157049
Epic: CRDB-55052
Release note: None