Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: revamp async log storage protocol #124440

Open
7 of 9 tasks
pav-kv opened this issue May 20, 2024 · 1 comment
Open
7 of 9 tasks

raft: revamp async log storage protocol #124440

pav-kv opened this issue May 20, 2024 · 1 comment
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented May 20, 2024

The async log storage protocol (etcd-io/raft#8) can be improved in a few aspects:

  • The ABA problem prevention can be achieved by a more straightforward algorithm when we have the "last accepted term" tracking [raft: advance commit index safely #122690].
  • The need for preventing the ABA problem can be eliminated in the first place, if we delegate the unstable log to storage engine synchronously, and only assume flush/fsync is asynchronous [raft: clean unstable log early #122438].
  • We can remove various fields of raftpb.Message used only by this local protocol, to unclutter Message and make it smaller.

To support the new log write protocol, we are missing the notion of "leader term" - the term of the leader with whom our log is consistent. By raft invariants, all writes to, and acknowledgements from log storage are ordered by (leader term, index). Today, we approximate the "leader term" by using the last entry ID, but it complicates the protocol. It is also hard to reuse for Replication Admission Control because it requires remembering the unacknowledged entry IDs, but these are cleared from the unstable data structure as soon as the entries are written.

The plan is to support the "leader term" tracking:


Epic: CRDB-37515
Jira issue: CRDB-38897

@pav-kv pav-kv added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels May 20, 2024
Copy link

blathers-crl bot commented May 20, 2024

cc @cockroachdb/replication

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@pav-kv pav-kv self-assigned this Jul 1, 2024
craig bot pushed a commit that referenced this issue Jul 3, 2024
126474: raft: pass logSlice from maybeAppend to append r=pav-kv a=pav-kv

This PR propagates the appended `logSlice` down the stack from `maybeAppend` to `append`. All appends should use this type for readability and safety. Ultimately, the `logSlice` will be passed from the append method to `unstable.truncateAndAppend` which will implement a few more safety checks and use the `logSlice.term` to track the "last accepted term".

This change helped fixing a bunch of bugs in tests that initialized `RawNode` or `raftLog` incorrectly.

Part of #124440

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Jul 5, 2024
126475: raft: introduce logMark and snapshot type r=nvanbenschoten a=pav-kv

This PR introduces the type-safe `logMark` and `snapshot` types which carry information about the leader term in whose context / coordinate system the log indices are considered. We convert the snapshot `restore` and `commitTo` methods to use the new types.

The follow-up work will utilize the `term` fields for tracking the "[last accepted term](https://github.com/cockroachdb/cockroach/blob/4ee0b103cd3c3adeedf3a478ead5db78ec52e68e/pkg/raft/raft.go#L348-L363)" in the `unstable` structure, which will then feed into the log write protocol, and be used for less conservative commit index advancement decisions.

Part of #124440

126757: crosscluster/logical: add synthetic failure injection r=dt a=dt

First three commits are #126755.

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: David Taylor <[email protected]>
craig bot pushed a commit that referenced this issue Jul 10, 2024
126783: raft: make unstable a logSlice r=nvanbenschoten a=pav-kv

This PR refactors `raft.unstable` data structure to be a type-safe `logSlice`, and equips it with safety checks protecting the log append stack from appends that do not comply with the expected raft behaviours.

One immediate benefit of doing so is that `unstable` now knows the `lastEntryID` at all times, whereas previously it would fall back to fetching it from `Storage` interface when empty.

The second benefit is introducing the `logSlice.term` field, which carries strong semantics: the log is consistent with the leader log at this term. This allows implementing safety checks in append methods, preventing log truncations when not expected. Log truncations in raft can only happen when accepting a higher-term log append.

The `logSlice.term` field replaces `raft.accTerm`.

Part of #124440

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Jul 30, 2024
127424: raft: use last accepted term for log writes r=nvanbenschoten a=pav-kv

This PR modifies the log write protocol to take advantage of the last accepted term tracked in `raftLog`/`unstable`.

Log entry writes are ordered by the last accepted term (`unstable.term`). Entries in the log can be overwritten only if this term is bumped (when we accept a log write from a higher term leader). If `unstable.term` is unchanged when we receive an acknowledgement from log storage, then the acknowledged entries don't have any overwrites in the write queue, and it is safe to remove them from `unstable`.

Part of #124440

Co-authored-by: Pavel Kalinnikov <[email protected]>
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
craig bot pushed a commit that referenced this issue Aug 28, 2024
129790: raft,logstore: add helpers for MsgStorageAppend r=pav-kv a=pav-kv

This PR equips `MsgStorageAppend(Resp)` protocol messages with helpers useful for extracting the `LogMark`. To be used by the code outside `raft` package depending on the `LogMark` ordering of raft storage writes.

Part of #124440

Co-authored-by: Pavel Kalinnikov <[email protected]>
@exalate-issue-sync exalate-issue-sync bot assigned pav-kv and unassigned pav-kv Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

1 participant