Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rac2: introduce log storage tracker #129751

Merged
merged 1 commit into from
Sep 3, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Aug 27, 2024

This PR introduces the data structure that observes all raft log storage writes, syncs and logical admissions, and tracks the stable/admitted log state correspondingly.

Log writes are ordered by leader term, then by index. Log writes have no gaps, except when a snapshot clears the log. All writes/syncs must be registered. Admissions can be registered selectively, for entries that are subject to RACv2.

Part of #129508

@pav-kv pav-kv requested review from a team as code owners August 27, 2024 20:13
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 27, 2024

First commit is #129747.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 27, 2024

@sumeerbhola The flow will be roughly as follows:

  1. Initialize from raft's LogMark, next to RawNode/RaftNode initialization.
  2. Call Append in ready handler, so that all subsequent log writes are registered with the tracker.
  3. Call Snap in ready handler, after persisting a snapshot, so that stable index is maintained correctly.
  4. Call Sync in OnLogSync, which is exhaustive (both blocking and non-blocking log writes call it). The synced LogMark can be deduced from msgs in OnLogSync for now.
  5. (in a follow-up PR) extend Append call with pri (or maybe there will be a separate call for this) to register per-priority entries; call a method from AdmitRaftEntries to admit them in the tracker.

Then wrap/integrate this data structure with mutexes and "dirty bits" so that index changes are scheduled for downstream processing.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

type LogMark = raft.LogMark

// LogTracker tracks the durable state of a raft log.

Does this need to be public, given it is used inside rac2?

I am assuming this corresponds to admittedTracker in the prototype. If yes, I am unsure whether these methods are actually exactly what we want. For instance we did not need NextIndex or Stable in admittedTracker and we were appending single entries (since it needed to call waitingForAdmissionState) which doesn't correspond to Append. Also, we didn't bother with Snap since if there were entries before the snapshot that were waiting for admission, we are incurring the write cost for them, so we want to keep them. One could argue that this choice is not actually useful since the leader will have returned the tokens anyway (which happens to be an implementation detail of the leader simply because it cannot guarantee liveness of admitted advancing for a replica in StateSnapshot -- we could potentially improve that).

For such a class, I would prefer to see the final version with the interface that we actually need, so that we minimize the interface churn, and minimize the interface surface area that the unit test needs to exercise.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

Previously, sumeerbhola wrote…

Does this need to be public, given it is used inside rac2?

I am assuming this corresponds to admittedTracker in the prototype. If yes, I am unsure whether these methods are actually exactly what we want. For instance we did not need NextIndex or Stable in admittedTracker and we were appending single entries (since it needed to call waitingForAdmissionState) which doesn't correspond to Append. Also, we didn't bother with Snap since if there were entries before the snapshot that were waiting for admission, we are incurring the write cost for them, so we want to keep them. One could argue that this choice is not actually useful since the leader will have returned the tokens anyway (which happens to be an implementation detail of the leader simply because it cannot guarantee liveness of admitted advancing for a replica in StateSnapshot -- we could potentially improve that).

For such a class, I would prefer to see the final version with the interface that we actually need, so that we minimize the interface churn, and minimize the interface surface area that the unit test needs to exercise.

I'll add a draft PR on top with the integration and fine-tune the methods.

Rationale for Stable/NextIndex is to replace the similar methods in RaftNode, at least in some intermediate version of my integration code this was useful. Maybe in the final version this will go away. There was no intention for the exact API match with your PR (it can be fine-tuned in follow-ups), this implements the first approximation of the concept and documents some expectations from caller/storage (the writes ordering etc).

Re Snap, the purpose is not to advance admitted indices (they can lag behind the snapshot or be updated if we want; we can change this aspect later if needed - having an explicit Snap call makes it trivial), but rather to update the stable index. If we don't do this, snapshots are implicit and indistinguishable from regular log appends. However, regular log appends can't have gaps, while snapshots can introduce them - these are different cases. I am inclined to be strict here.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

I'll add a draft PR on top with the integration and fine-tune the methods.

sounds good

There was no intention for the exact API match with your PR

That's totally fine, in that it doesn't need to match. What I would like is for it to match the purpose it needs to fulfill. Extra code is not worth the maintenance burden -- we can add it when it becomes necessary.

Rationale for Stable/NextIndex is to replace the similar methods in RaftNode, at least in some intermediate version of my integration code

I think I am starting to see the reason for the slight difference in approach here. The admittedTracker had a stable LogMark that could be ahead or behind the RawNode, and even the ahead case was best-effort in that it happened when sending the MsgAppResp. Perhaps what you have in mind is something stronger in that it is being kept more up-to-date, by it being informed of Snap and guaranteeing that storage will call Sync. Perhaps this leads to stronger invariant checking -- is that what you mean by ensuring there are no gaps and "inclined to be strict here"?
It seems to me that admittedTracker/LogTracker, even in this more up-to-date world can still lag RawNode, specifically in the Snap case. Is my understanding accurate?

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

Previously, sumeerbhola wrote…

I'll add a draft PR on top with the integration and fine-tune the methods.

sounds good

There was no intention for the exact API match with your PR

That's totally fine, in that it doesn't need to match. What I would like is for it to match the purpose it needs to fulfill. Extra code is not worth the maintenance burden -- we can add it when it becomes necessary.

Rationale for Stable/NextIndex is to replace the similar methods in RaftNode, at least in some intermediate version of my integration code

I think I am starting to see the reason for the slight difference in approach here. The admittedTracker had a stable LogMark that could be ahead or behind the RawNode, and even the ahead case was best-effort in that it happened when sending the MsgAppResp. Perhaps what you have in mind is something stronger in that it is being kept more up-to-date, by it being informed of Snap and guaranteeing that storage will call Sync. Perhaps this leads to stronger invariant checking -- is that what you mean by ensuring there are no gaps and "inclined to be strict here"?
It seems to me that admittedTracker/LogTracker, even in this more up-to-date world can still lag RawNode, specifically in the Snap case. Is my understanding accurate?

The LogTracker will be the most up-to-date for both entries and Snap case, since it will learn about storage syncs first. For log entry writes, it will learn from OnLogSync callbacks, before the local response is stepped to raft. For snapshot writes, it will learn from an equivalent OnSnapSync callback (hooked into raft ready handler after applySnapshot), which similarly happens before we inform raft.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

Perhaps this leads to stronger invariant checking -- is that what you mean by ensuring there are no gaps and "inclined to be strict here"?

Yes. Example:

  • The log is at (term=10, index=100).
  • We receive a snapshot for (term=15, index=200).
  • When it is synced, we could have called Sync(15, 200), but we haven't registered Append at index 200, so this is "in the future" / has a gap. Not nice for the invariant that Sync always comes after Append.
  • We could workaround by registering Append(0, {term: 15, index: 200}) before calling Sync. But this would regress stable index to 0. Alternatively, we could register Append(100, {term: 15, index: 200}), but there is no guarantee that entries <= 100 are all consistent with term 15, so we can't safely leave the stable index at 100.

So a stronger/cleaner option seems to be to factor out an explicit call for snapshot syncs.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Perhaps this leads to stronger invariant checking -- is that what you mean by ensuring there are no gaps and "inclined to be strict here"?

Yes. Example:

  • The log is at (term=10, index=100).
  • We receive a snapshot for (term=15, index=200).
  • When it is synced, we could have called Sync(15, 200), but we haven't registered Append at index 200, so this is "in the future" / has a gap. Not nice for the invariant that Sync always comes after Append.
  • We could workaround by registering Append(0, {term: 15, index: 200}) before calling Sync. But this would regress stable index to 0. Alternatively, we could register Append(100, {term: 15, index: 200}), but there is no guarantee that entries <= 100 are all consistent with term 15, so we can't safely leave the stable index at 100.

So a stronger/cleaner option seems to be to factor out an explicit call for snapshot syncs.

There are cases when stable update will be "lagging behind" in a sense that raft will learn about its regression first, and only during the ready handing the LogTracker will follow. Specifically, when a higher-term append or snapshot is below stable.

It's unclear which should the the source of truth (the RawNode or LogTracker), I think it depends on the caller. Need to think more about these cases.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

There are cases when stable update will be "lagging behind" in a sense that raft will learn about its regression first, and only during the ready handing the LogTracker will follow. Specifically, when a higher-term append or snapshot is below stable.

It's unclear which should the the source of truth (the RawNode or LogTracker), I think it depends on the caller. Need to think more about these cases.

@sumeerbhola Here is the integration commit that feeds the needed signals into LogTracker. Barring the datadriven test, the change looks simple.

About the LogTracker vs RawNode staleness. Once waitingForAdmissionState is integrated into LogTracker, we will have the nice property that all the state in LogTracker is consistent with the latest log write popped out of Ready. The only way LogTracker can lag behind RawNode is in not knowing that the LogMark.Term has been bumped (which potentially regresses stable index and admitted indices). This is fine though: when the term is bumped (there is a new leader), we don't have any updates to send to the new leader until we observe the first write for it. With the next Ready, the LogTracker will become up-to-date and will be able to send admitteds to the new leader.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

There are cases when stable update will be "lagging behind" in a sense that raft will learn about its regression first, and only during the ready handing the LogTracker will follow. Specifically, when a higher-term append or snapshot is below stable.

Yes, I was thinking of that case too.
I am slightly wary of turning this into a distraction. The purpose of LogTracker in the context of replication AC is narrow, and we don't want to overthink it.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 18 at r4 (raw file):

Previously, sumeerbhola wrote…

There are cases when stable update will be "lagging behind" in a sense that raft will learn about its regression first, and only during the ready handing the LogTracker will follow. Specifically, when a higher-term append or snapshot is below stable.

Yes, I was thinking of that case too.
I am slightly wary of turning this into a distraction. The purpose of LogTracker in the context of replication AC is narrow, and we don't want to overthink it.

What part is a distraction? Since we have this tracker component outside raft, it's inevitable that it falls out of sync with it. So we need to think in what way (described above: with steady leader we're more up-to-date than raft; with leader change and log regression we're one update behind) and how to handle it. Luckily here, it looks fine to me, since we know the Term in the tracker and can detect this case exactly: we should be sending admitted indices to a Term leader only if the Term in the tracker matches (meaning that the tracker is consistent with that leader's log).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r4, 1 of 7 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 33 at r5 (raw file):

	// Entries in (stable, last.Index] are still in the write queue.
	stable uint64
	// TODO(pav-kv): track log entry admissions too.

btw, instead of integrating this immediately into Processor I suspect it would be better to first unify this with waitingForAdmissionState. There are overlaps in that waitingForAdmissionState is doing a half-hearted job of dropping entries when the term advances (it only drops for priorities which the new term is appending, but if we want full consistency should do it for all).


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 49 at r5 (raw file):

// NextIndex returns the index following the last observed log index.
func (l *LogTracker) NextIndex() uint64 {
	return l.last.Index + 1

let's remove this if it is not needed.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 59 at r5 (raw file):

		// no gaps. Gaps can only appear when the log is cleared in response to
		// storing a snapshot, which is handled by the Snap method.
		return false

If all these false return values are simply going to result in panics in test builds (as in the integration), I think the panics should be embedded here and there shouldn't be a return value. Then the panic message can actually print out the specifics of the invariant that was violated. And for unit testing these paths we can catch the panic.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 60 at r5 (raw file):

		// storing a snapshot, which is handled by the Snap method.
		return false
	} else if to.Term > l.last.Term {

The invariant checking is distributed and was hard for me to reason about. Some of it is in this if-block and then some in the a later if-block. I see why this is structured this way, but would be easier with invariant comments, unnesting else-if blocks and reordering for the common case first. Something like:

	if to.Term < l.last.Term || after > l.last.Index {
		// Does not happen. Log writes are always ordered by (term, index), and have
		// no gaps. Gaps can only appear when the log is cleared in response to
		// storing a snapshot, which is handled by the Snap method.
		return false
	}
	// INVARIANT: t.Term >= l.last.Term && after <= l.last.Index
	if to.Term == l.last.Term {
		// Common case.
		if after != l.last.Index {
			// Does not happen. Log writes under the same leader don't have gaps.
			return false
		}
		l.last.Index = to.Index
		return true
	}
	// INVARIANT: t.Term > l.last.Term && after <= l.last.Index.
	l.last = to
	// NB: the effective stable index can potentially regress here, when the
	// leader term goes up. This corresponds to a situation when a new leader
	// overwrites a suffix of the log.
	l.stable = min(l.stable, after)
	return true


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 83 at r5 (raw file):

// completed writes can be skipped or invoked in any order, e.g. due to delivery
// concurrency. The tracker keeps the latest (in the logical sense) stable mark.
func (l *LogTracker) Sync(stable LogMark) bool {

nit: Can you rename these LogSynced and SnapSynced.Sync suggests a directive, and Snap suggests a snapshot may have arrived.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 33 at r5 (raw file):

Previously, sumeerbhola wrote…

btw, instead of integrating this immediately into Processor I suspect it would be better to first unify this with waitingForAdmissionState. There are overlaps in that waitingForAdmissionState is doing a half-hearted job of dropping entries when the term advances (it only drops for priorities which the new term is appending, but if we want full consistency should do it for all).

SGTM, I'll integrate with admissions first and update this PR.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 33 at r5 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

SGTM, I'll integrate with admissions first and update this PR.

Done. Working on the tests right now. I chose to inline waitingForAdmissionState because most of the term/regression checks are already done at the Append function level, so the Register call is very lightweight after all the term checks and truncations are done.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 49 at r5 (raw file):

Previously, sumeerbhola wrote…

let's remove this if it is not needed.

Done.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 60 at r5 (raw file):

Previously, sumeerbhola wrote…

The invariant checking is distributed and was hard for me to reason about. Some of it is in this if-block and then some in the a later if-block. I see why this is structured this way, but would be easier with invariant comments, unnesting else-if blocks and reordering for the common case first. Something like:

	if to.Term < l.last.Term || after > l.last.Index {
		// Does not happen. Log writes are always ordered by (term, index), and have
		// no gaps. Gaps can only appear when the log is cleared in response to
		// storing a snapshot, which is handled by the Snap method.
		return false
	}
	// INVARIANT: t.Term >= l.last.Term && after <= l.last.Index
	if to.Term == l.last.Term {
		// Common case.
		if after != l.last.Index {
			// Does not happen. Log writes under the same leader don't have gaps.
			return false
		}
		l.last.Index = to.Index
		return true
	}
	// INVARIANT: t.Term > l.last.Term && after <= l.last.Index.
	l.last = to
	// NB: the effective stable index can potentially regress here, when the
	// leader term goes up. This corresponds to a situation when a new leader
	// overwrites a suffix of the log.
	l.stable = min(l.stable, after)
	return true

Done.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 83 at r5 (raw file):

Previously, sumeerbhola wrote…

nit: Can you rename these LogSynced and SnapSynced.Sync suggests a directive, and Snap suggests a snapshot may have arrived.

Done.

@pav-kv pav-kv force-pushed the log-storage-tracker branch 3 times, most recently from 0fff10a to 019cebd Compare August 30, 2024 15:40
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 59 at r5 (raw file):

Previously, sumeerbhola wrote…

If all these false return values are simply going to result in panics in test builds (as in the integration), I think the panics should be embedded here and there shouldn't be a return value. Then the panic message can actually print out the specifics of the invariant that was violated. And for unit testing these paths we can catch the panic.

reminder


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 42 at r8 (raw file):

	stable uint64
	// admitted contains admitted log indices for each priority, in the last.Term
	// log coordinate system. Entries in (admitted[pri], last.Index], if any, are

I think we need the invariant admitted[i] <= stable. Also, the phrasing of this comment should be:
Entries in (admitted[pri], stable] ...
We can't say anything about entries beyond stable.

Oh, I see the code is imposing this invariant on the call to Admitted. Fine.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 107 at r8 (raw file):

		return false
	}
	// Invariant: after.Term > l.last.Term && after.Index <= l.last.Index.

nit: to.Term


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 167 at r8 (raw file):

// can be skipped or invoked in any order, e.g. due to delivery concurrency. The
// tracker keeps the latest (in the logical sense) admitted marks.
func (l *LogTracker) Admit(at LogMark, pri raftpb.Priority) bool {

nit: can you change this to Admitted. The methods requesting admission are named Admit in the existing code. I realize this could be considered a request to LogTracker to "admit" but this is more of a data-structural update so easier to comprehend as Admitted.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 177 at r8 (raw file):

	} else if at.Term == l.last.Term {
		l.admitted[pri] = max(l.admitted[pri], at.Index)
	}

I think the ignoring in the case at.Term != l.last.Term is a problem.

Unlike the ignoring that happens in LogSynced, where we are guaranteed that the next sync callback will be very soon (since we have sent it off to storage), there is no guarantee when the next entry will be admitted. It could be minutes later. We need to return these tokens.

For example, if entry 2 had term 3, and a new leader has called Append, and overwritten starting from 4, but none of those entries were subject to AC, then when we get the Admitted callback for (term=3, index=2) we should move admitted forward. I can see why the code is written this way, since we don't know how far to move admitted forward: we are not keeping track of the to value in Append. Even if we did, it doesn't help us since we have separated Register into a separate method, and we don't know if the caller has iterated past to in calling Register. And since Register is optional, we can't know.

One simple solution is to restore the idea that the member admitted vector indices are upper bounded by stable. We will want to keep a dirty bit for whether admitted has changed with the fuller integration (so we know whether to piggyback), that was done in the prototype. Hence, making the admitted vector the actual value that we return in GetAdmitted is more broadly useful. Then, with that invariant, we can simply advance to the stable index here (something couldn't have become stable without Register being called, if calling Register was needed). Since admitted is a function of the stable index and waiting, it simply needs to be updated whenever any of those inputs change.

@pav-kv pav-kv force-pushed the log-storage-tracker branch 2 times, most recently from 5d3c388 to ee10f49 Compare September 1, 2024 21:07
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 59 at r5 (raw file):

Previously, sumeerbhola wrote…

reminder

On it, just finishing up the algorithm/API first.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 42 at r8 (raw file):

Previously, sumeerbhola wrote…

I think we need the invariant admitted[i] <= stable. Also, the phrasing of this comment should be:
Entries in (admitted[pri], stable] ...
We can't say anything about entries beyond stable.

Oh, I see the code is imposing this invariant on the call to Admitted. Fine.

There can be entries beyond stable in the admission queue, just like there are entries beyond stable that are in the write queue. For example, the last entry can be in the queue. It is right to say that all entries waiting for admission are in (admitted[pri], last] range. They aren't dense though, there can be gaps (both because not all entries are registered, and because each priority is a subset or all entries, and because some entries might have already been admitted), hence "if any".

Might need to clarify this lack of density bit somewhere.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 107 at r8 (raw file):

Previously, sumeerbhola wrote…

nit: to.Term

Done.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 167 at r8 (raw file):

Previously, sumeerbhola wrote…

nit: can you change this to Admitted. The methods requesting admission are named Admit in the existing code. I realize this could be considered a request to LogTracker to "admit" but this is more of a data-structural update so easier to comprehend as Admitted.

Done. Renamed to LogAdmitted, to be consistent with LogSynced method. The Admitted name is reserved for the method that returns the current admitted vector.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 177 at r8 (raw file):

Previously, sumeerbhola wrote…

I think the ignoring in the case at.Term != l.last.Term is a problem.

Unlike the ignoring that happens in LogSynced, where we are guaranteed that the next sync callback will be very soon (since we have sent it off to storage), there is no guarantee when the next entry will be admitted. It could be minutes later. We need to return these tokens.

For example, if entry 2 had term 3, and a new leader has called Append, and overwritten starting from 4, but none of those entries were subject to AC, then when we get the Admitted callback for (term=3, index=2) we should move admitted forward. I can see why the code is written this way, since we don't know how far to move admitted forward: we are not keeping track of the to value in Append. Even if we did, it doesn't help us since we have separated Register into a separate method, and we don't know if the caller has iterated past to in calling Register. And since Register is optional, we can't know.

One simple solution is to restore the idea that the member admitted vector indices are upper bounded by stable. We will want to keep a dirty bit for whether admitted has changed with the fuller integration (so we know whether to piggyback), that was done in the prototype. Hence, making the admitted vector the actual value that we return in GetAdmitted is more broadly useful. Then, with that invariant, we can simply advance to the stable index here (something couldn't have become stable without Register being called, if calling Register was needed). Since admitted is a function of the stable index and waiting, it simply needs to be updated whenever any of those inputs change.

Done (differently)

Yeah, there is a buglet here. Another possible fix would be to set admitted[pri] to the index of the last entry removed by skip (if none were removed, this would be a no-op), and optionally bump it to waiting[0].Index - 1 or stable.

A better fix (in the updated version of this PR) is to not track admitted vector explicitly, since it is purely a function of stable and waiting slices:

admitted[pri] = min(stable, waiting[pri][0].Index-1)  // if `waiting[pri]` is non-empty
admitted[pri] = stable  // if empty

The invariant checks and the protocol of interaction with this data structure guarantees that in all Admitted() calls the indices will not regress (unless the term changes and rewrites a suffix).

We're going to need the "admitted changed" signal later, in the non-happy path handling. For now I'm adding the happy path in which the admission happens synchronously during the append, and we're always annotating MsgAppResp with the current Admitted() vector. Adding the "admitted changed" will be needed for piggybacking in a follow-up PR, and will be easy - I'll repurpose the bool return values in Append/LogSynced/LogAdmitted/SnapSynced methods to indicate this "dirty bit" to the caller.

@pav-kv pav-kv force-pushed the log-storage-tracker branch 3 times, most recently from 1526cbd to 75ca7e9 Compare September 2, 2024 13:08
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 59 at r5 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

On it, just finishing up the algorithm/API first.

Done.

@pav-kv pav-kv force-pushed the log-storage-tracker branch 3 times, most recently from e49d925 to 90c90ac Compare September 2, 2024 13:48
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 2, 2024

@sumeerbhola This is now in a shape that can be integrated with the happy path. See integration in b8b0499df3d8495df12006aa4e2b80e024983808 (basic integration) and 47b2b4677b55a01e65c509b4a240a2b22a2aaa3e (wrapping LogTracker with a mutex and "dirty bits" for handling the less happy paths and scheduling the admitted vector delivery).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 177 at r8 (raw file):
I won't get a chance to look at this until later today, but a couple of high-level comments -- they may not apply but you can consider them.

We're going to need the "admitted changed" signal later, in the non-happy path handling. For now I'm adding the happy path

In general I'm wary of such an incremental approach since it results in significant changes when the full code is developed. It results in more code being written and reviewed than is in the final product. As I said earlier, this may not be applicable here, but worth considering.

repurpose the bool return values in Append/LogSynced/LogAdmitted/SnapSynced methods to indicate this "dirty bit" to the caller.

The case where the dirty bit was primarily useful for is the callback from AC. The callbacks for storage syncs are coarser, and we we were anyway going to append the latest admitted regardless of whether it was dirty or not in the MsgAppResp caused by these storage callbacks. The callback from AC needs to

  • be quick and encounter minimal lock contention
  • should not allocate
    We were using it mark the dirty bit and schedule ready processing (if it was not already scheduled), and in ready processing enqueueing into the piggybacker.
    The piggybacker has a shared mutex across the node, though it is quite narrow. One could argue that it could be acquired in the AC callback, but I slightly worry here. Also, the message enqueued in the piggybacker causes an allocation (conversion from the admitted array to a slice that needs to be placed in the proto message).

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 177 at r8 (raw file):

As I said earlier, this may not be applicable here, but worth considering.

I don't expect LogTracker code to change much now. The way I'm structuring this is that LogTracker contains the core algorithm which is pretty much done (modulo this PR review), and the integration/extensions (like dealing with mutexes/scheduling and dirty bits) will happen in the wrapper: 47b2b4677b55a01e65c509b4a240a2b22a2aaa3e.

Good point re minimizing contention/cost in AdmittedLogEntry path.

Also, the message enqueued in the piggybacker causes an allocation (conversion from the admitted array to a slice that needs to be placed in the proto message).

This bit is optional, the piggybacker can/should remember just the latest admitted vector in allocation-free manner, and convert it to message at the send time. The vector is generated from scratch in Admitted() call and is immutable (and does not allocate since it's a fixed-size slice).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)


-- commits line 16 at r9:
nit: commit messages should be merged


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 42 at r8 (raw file):

There can be entries beyond stable in the admission queue, just like there are entries beyond stable that are in the write queue. For example, the last entry can be in the queue. It is right to say that all entries waiting for admission are in (admitted[pri], last] range.

I think we were talking about two different things, and are not disagreeing. My interpretation was that if admitted[pri] <= stable was an invariant we were imposing, then entries in (admitted[pri], stable] are waiting for admission (otherwise admitted would have advanced). The entries beyond stable may have been admitted. This handles both the case when (admitted[pri], stable] is an empty interval, and one where it isn't.

The revised comment phrasing is fine.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a small edit: AdmittedVector is now a struct tied with the leader term, for convenience.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)


-- commits line 16 at r9:

Previously, sumeerbhola wrote…

nit: commit messages should be merged

Done.


pkg/kv/kvserver/kvflowcontrol/rac2/log_tracker.go line 42 at r8 (raw file):

Previously, sumeerbhola wrote…

There can be entries beyond stable in the admission queue, just like there are entries beyond stable that are in the write queue. For example, the last entry can be in the queue. It is right to say that all entries waiting for admission are in (admitted[pri], last] range.

I think we were talking about two different things, and are not disagreeing. My interpretation was that if admitted[pri] <= stable was an invariant we were imposing, then entries in (admitted[pri], stable] are waiting for admission (otherwise admitted would have advanced). The entries beyond stable may have been admitted. This handles both the case when (admitted[pri], stable] is an empty interval, and one where it isn't.

The revised comment phrasing is fine.

Ack.

This commit introduces the data structure that observes all raft log
storage writes, syncs and logical admissions, and tracks the
stable/admitted log state correspondingly.

Log writes are ordered by leader term, then by index. Log writes have no
gaps, except when a snapshot clears the log. All writes/syncs must be
registered. Admissions can be registered selectively, for entries that
are subject to RACv2.

Epic: CRDB-37515
Release note: none
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola I ported waiting_for_admission_state datadriven tests to here, and audited that the behaviour is the same.

There was a bug in waitingForAdmissionState in this command: it should have removed the term 6 entry in response to a term 7 admission.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 3, 2024

bors r=sumeerbhola

@craig craig bot merged commit a1cfb1b into cockroachdb:master Sep 3, 2024
22 of 23 checks passed
@pav-kv pav-kv deleted the log-storage-tracker branch September 3, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants