Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: add tracing to raft #131850

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

andrewbaptist
Copy link
Collaborator

@andrewbaptist andrewbaptist commented Oct 3, 2024

This commit adds tracing to the various places within raft where state transitions happen. When a message is first proposed, it matches the context of the original message to the raft entry and registers it for tracing. Then it adds to the trace at the key points during the transitions and finally stops tracing once a MsgStorageApplyResp is received past the entry.

Fixes: #104035

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist
Copy link
Collaborator Author

A sample trace for a RF3 request:

     4.035ms      0.031ms                                    event:kv/kvserver/replica_proposal.go:1003 [n2,s2,r70/2:/{Table/104-Max}] need consensus on write batch with op count=1
     4.071ms      0.036ms                                    event:kv/kvserver/replica_raft.go:126 [n2,s2,r70/2:/{Table/104-Max}] evaluated request
     4.096ms      0.025ms                                    event:kv/kvserver/replica_raft.go:171 [n2,s2,r70/2:/{Table/104-Max}] proposing command to write 1 new keys, 1 new values, 0 new intents, write batch size=46 bytes
     4.123ms      0.027ms                                    event:kv/kvserver/replica_raft.go:282 [n2,s2,r70/2:/{Table/104-Max}] acquiring proposal quota (182 bytes)
     4.159ms      0.036ms                                    event:kv/kvserver/replica_raft.go:459 [n2,s2,r70/2:/{Table/104-Max}] submitting proposal to proposal buffer
     4.202ms      0.043ms                                    event:kv/kvserver/replica_proposal_buf.go:574 [n2,s2,r70/2:/{Table/104-Max}] flushing proposal to Raft
     4.307ms      0.105ms                                    event:raft/raft.go:1336 [n2,s2,r70/2:/{Table/104-Max}] 2->None MsgProp Term:0 Log:0/0
     4.360ms      0.053ms                                    event:raft/rawnode.go:445 [n2,s2,r70/2:/{Table/104-Max}] 2->1 MsgApp Term:8 Log:8/21
     4.374ms      0.014ms                                    event:raft/rawnode.go:445 [n2,s2,r70/2:/{Table/104-Max}] 2->3 MsgApp Term:8 Log:8/21
     4.396ms      0.022ms                                    event:raft/rawnode.go:445 [n2,s2,r70/2:/{Table/104-Max}] 2->AppendThread MsgStorageAppend Term:0 Log:8/22
    14.481ms     10.085ms                                    event:raft/rawnode.go:445 [n2,s2,r70/2:/{Table/104-Max}] 2->ApplyThread MsgStorageApply Term:0 Log:0/0
    14.520ms      0.039ms                                        === operation:local proposal _unfinished:1 _verbose:1 node:2 store:2 range:70/2:/{Table/104-Max} raft:
    14.555ms      0.035ms                                        event:kv/kvserver/replica_application_cmd.go:143 [n2,s2,r70/2:/{Table/104-Max},raft] ack-ing replication success to the client; application will continue async w.r.t. the client
    14.589ms      0.034ms                                        event:kv/kvserver/app_batch.go:116 [n2,s2,r70/2:/{Table/104-Max},raft] applying command
    14.668ms     11.268ms                                event:server/node.go:1690 [n2] node sending response

@andrewbaptist andrewbaptist force-pushed the 2024-10-02-raft-encoding branch 2 times, most recently from 74fc59c to 79a0dd5 Compare October 4, 2024 20:57
@andrewbaptist
Copy link
Collaborator Author

@pav-kv I attempted to incorporate all your changes, but hit a snag... After 2ad2bee, we now create tracing spans for all tasks launched from RunAsyncTask. I think this is the wrong thing to do, but it is unclear how to work around this.

We are currently logging all the "right" things, and the tracing spans mostly look good on both the client and the server, but there is way too much logging because of this.

I'm going to take a break now but will think again on Monday how to better work around this.

@andrewbaptist andrewbaptist force-pushed the 2024-10-02-raft-encoding branch 6 times, most recently from 7b4597f to 92dd0bd Compare October 8, 2024 19:55
@andrewbaptist andrewbaptist changed the title 2024 10 02 raft encoding raft: add tracing to raft Oct 8, 2024
@andrewbaptist
Copy link
Collaborator Author

andrewbaptist commented Oct 8, 2024

Outstanding tasks:

  • Handle the upgrade case due to the encoding change
  • Don't persist the encoded bit
  • Split the PRs into more digestible pieces
  • Bench tests

Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look. Will review again tomorrow.

pkg/kv/kvserver/kvserverpb/raft.proto Outdated Show resolved Hide resolved
pkg/kv/kvserver/raft.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_proposal_buf.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_proposal_buf.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start with a simpler thing?

  • No raftpb.Entry encoding change. Extend RaftMessageRequest instead.
  • As we discussed, trace context integration right now bears little advantage because the trace is not collected/routed anyway. So we can start with only logging, and no per-entry context.
  • The Map approach can be reduced to either tracing only one entry index (as in kvsever: hacky raft proposal tracer #131863). Or we could also say that there can be only one TraceID at a time, and, instead of a map, have a pair of (low, high) log mark scoped to this trace ID (the high mark would be bumped any time we see a newer entry being traced under the same trace ID), and intersect it with all messages until the trace ID is unregistered (e.g. when we crossed the high mark).

The Map in RaftTracer seems hard to use because it requires point deletions. Whereas we need "queue" semantics, to remove a prefix.

@andrewbaptist andrewbaptist force-pushed the 2024-10-02-raft-encoding branch 4 times, most recently from 84dbc3f to 4b1157b Compare October 10, 2024 23:03
@andrewbaptist
Copy link
Collaborator Author

I cleaned up the PR and removed all the raft encoding changes. I left the tracing in place because it seems to be useful in a number of cases. I'm not 100% sure, but I think there is still a problem with the way I'm using tracing that is leaking.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 10, 2024

@andrewbaptist Thanks, appreciate the commits separation. I'll review tomorrow.

@andrewbaptist andrewbaptist force-pushed the 2024-10-02-raft-encoding branch 7 times, most recently from 1c0dcb5 to 69bd881 Compare October 13, 2024 19:49
@andrewbaptist
Copy link
Collaborator Author

@pav-kv This should be ready for review now. It might be worth adding a flag to diable this if we decide it is too expensive, and the tests need a bit more work, but other than that, it is good. I've added some guards to prevent too many outstanding traces, and they make a decent compromise between losing too much information and impacting performance.

The new PR for changing the ProposalBuffer context is surprising that it didn't show up before, but it happened in one extended test under stress. I think it is a good change regardless.

I'm hoping to at a minimum get the protocol changes in before the branch cut, so if you want I can split that out into a separate PR.

pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
// hasn't changed. In the case of an ABA update, it does not break the
// invariant since some other trace was registered and deregistered, but
// there is still a slot available.
return r.numRegistered.CompareAndSwap(numRegistered, numRegistered+1)
Copy link
Collaborator

@pav-kv pav-kv Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is troublesome for a couple of reasons:

  1. Having a global read-modify-write variable like this numRegistered is an anti-pattern / not cache-efficient etc.

Some notes on this here:

  1. If someone outpaces us and increments the counter (yet it is still below the limit), we will skip this registering unnecessarily.

I don't think we should try super-hard to make this a hard limit. It will be best effort anyway because we can't guarantee consistency across 100k ranges. If there are a few requests racing, it's fine if we slightly overshoot (we can't overshoot by more than the number of raft workers, I think). It's fine if the limit is exceeded temporarily when the limit setting goes down or to 0.

An atomic increment is cheaper, so we should prefer it to CAS. Where possible, we should prefer it to not be a global variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was OK with missing a registration in a rare case of a race (since it can only happen if there are multiple in-progress attempts and at least one will win). I wasn't too worried about the cost here, as if we are already here we almost always will succeed on this CAS and are going to enable expensive tracing. The checks above will remove the common cases.

I understand the concern about the global variable, but I'm not sure of a good alternative for the 100K range case. Even having 1 trace per range will be too many in that case, and for a KV like workload it is likely to be in that situation. Maybe we can chat at the standup today.

pkg/kv/kvserver/rafttrace/rafttrace.go Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
func (r *RaftTracer) MaybeTrace(m raftpb.Message) []kvserverpb.TracedEntry {
// NB: This check is an optimization to handle the common case where there
// are no registered traces on the store.
if r.numRegistered.Load() == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should be checking a local variable here, instead of the global count. Better to avoid global synchronization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is a way to do this with a local variable. We could store a "replica level count" - but in practice this probably isn't any better. I can look at the performance of this, but I don't expect this will show up since it is only one extra main memory access per message.

pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
@andrewbaptist
Copy link
Collaborator Author

@pav-kv - Some thoughts about the PR - I made a few updates last night based on your review.

  • It would be good to at least get the protobuf changes in place before the branch cut. I debated adding a Term parameter to the protobuf, even if we don't use it for tracing. It might help if we later decide to use it. Thoughts?
  • I plan to add unit tests to RaftTracing directly rather than a system level test. Especially since the registration logic is a little more complex than before.
  • The current approach disables the tracing by default. I'm going to run all the perturbation tests this morning with it enabled (plus kvserver: store.Enqueue shouldn't start verbose tracing #132713) and add some traces and real data to this issue.

Any other thoughts on it?

@andrewbaptist andrewbaptist force-pushed the 2024-10-02-raft-encoding branch 3 times, most recently from d21fd7f to 20c5855 Compare October 16, 2024 16:54
@andrewbaptist andrewbaptist requested review from a team as code owners October 16, 2024 16:54
@andrewbaptist andrewbaptist requested review from xinhaoz, arjunmahishi and aa-joshi and removed request for a team October 16, 2024 16:54
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close. Thanks for all the rounds on this. I think everything is addressed, and I'm going to run another test on my test cluster to look at the logs.

func (r *RaftTracer) MaybeTrace(m raftpb.Message) []kvserverpb.TracedEntry {
// NB: This check is an optimization to handle the common case where there
// are no registered traces on the store.
if r.numRegistered.Load() == 0 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is a way to do this with a local variable. We could store a "replica level count" - but in practice this probably isn't any better. I can look at the performance of this, but I don't expect this will show up since it is only one extra main memory access per message.

pkg/kv/kvserver/rafttrace/rafttrace.go Show resolved Hide resolved
pkg/kv/kvserver/replica_proposal_buf.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rafttrace/rafttrace.go Outdated Show resolved Hide resolved
@andrewbaptist
Copy link
Collaborator Author

A trace with this enabled looks like this:
image
trace.txt

Searching for the id can be done with a command like this:

baptist_cockroachlabs_com@gceworker-abaptist:~/go/src/github.com/cockroachdb/cockroach$ roachprod ssh $CLUSTER "grep i27814158/e928.f674 logs/cockroach.log"
baptist-test:[1 2 3 4]: grep i27814158/e928.f674 lo... 4/4
   1: 	<err> COMMAND_PROBLEM: exit status 1

   2: 	<ok>
	I241016 19:04:23.721573 308 kv/kvserver/replica_proposal_buf.go:1185 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138708  registering local trace i27814158/e928.f674
	I241016 19:04:23.721682 308 kv/kvserver/replica_raft.go:1998 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138709  4->1 MsgApp Term:13 Log:13/27814157 Range:27814158-27814158
	I241016 19:04:23.721741 308 kv/kvserver/replica_raft.go:1998 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138710  4->2 MsgApp Term:13 Log:13/27814157 Range:27814158-27814158
	I241016 19:04:23.721784 308 kv/kvserver/replica_raft.go:1192 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138711  4->AppendThread MsgStorageAppend Term:0 Log:13/27814158 Range:27814158-27814158
	I241016 19:04:23.722126 319 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138712  4->4 MsgAppResp Term:13 Index:27814158
	I241016 19:04:23.722182 319 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138713  AppendThread->4 MsgStorageAppendResp Log:13/27814158
	I241016 19:04:23.724276 313 kv/kvserver/replica_raft.go:619 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138714  1->4 MsgAppResp Term:13 Index:27814158
	I241016 19:04:23.724415 313 kv/kvserver/replica_raft.go:1224 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138715  4->ApplyThread MsgStorageApply Term:0 Log:0/0 Range:27814158-27814158
	I241016 19:04:23.724554 313 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138716  ApplyThread->4 MsgStorageApplyResp Term:13 Index:27814158
	I241016 19:04:23.724573 313 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n2,s2,r150/4:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 138717  unregistered log index 27814158 from tracing

   3: 	<ok>
	I241016 19:04:23.723070 310 kv/kvserver/replica_raft.go:617 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144689  registering remote trace i27814158/e928.f674
	I241016 19:04:23.723101 310 kv/kvserver/replica_raft.go:619 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144690  4->2 MsgApp Term:13 Log:13/27814157 Range:27814158-27814158
	I241016 19:04:23.723128 310 kv/kvserver/replica_raft.go:1192 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144691  2->AppendThread MsgStorageAppend Term:0 Log:13/27814158 Range:27814158-27814158
	I241016 19:04:23.723329 346 kv/kvserver/replica_raft.go:1998 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144692  2->4 MsgAppResp Term:13 Index:27814158
	I241016 19:04:23.723388 296 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144693  AppendThread->2 MsgStorageAppendResp Log:13/27814158
	I241016 19:04:23.725028 307 kv/kvserver/replica_raft.go:1224 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144694  2->ApplyThread MsgStorageApply Term:0 Log:0/0 Range:27814158-27814158
	I241016 19:04:23.725125 307 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144695  ApplyThread->2 MsgStorageApplyResp Term:13 Index:27814158
	I241016 19:04:23.725142 307 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n3,s3,r150/2:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 144696  unregistered log index 27814158 from tracing

   4: 	<ok>
	I241016 19:04:23.722918 306 kv/kvserver/replica_raft.go:617 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111200  registering remote trace i27814158/e928.f674
	I241016 19:04:23.722955 306 kv/kvserver/replica_raft.go:619 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111201  4->1 MsgApp Term:13 Log:13/27814157 Range:27814158-27814158
	I241016 19:04:23.722981 306 kv/kvserver/replica_raft.go:1192 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111202  1->AppendThread MsgStorageAppend Term:0 Log:13/27814158 Range:27814158-27814158
	I241016 19:04:23.723258 331 kv/kvserver/replica_raft.go:1998 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111203  1->4 MsgAppResp Term:13 Index:27814158
	I241016 19:04:23.723304 311 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111204  AppendThread->1 MsgStorageAppendResp Log:13/27814158
	I241016 19:04:23.724763 309 kv/kvserver/replica_raft.go:1224 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111205  1->ApplyThread MsgStorageApply Term:0 Log:0/0 Range:27814158-27814158
	I241016 19:04:23.724854 309 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111206  ApplyThread->1 MsgStorageApplyResp Term:13 Index:27814158
	I241016 19:04:23.724864 309 kv/kvserver/replica_raft.go:1974 ⋮ [T1,Vsystem,n4,s4,r150/1:‹/Table/106/1/{-91320…-1}›,id=i27814158/e928.f674] 111207  unregistered log index 27814158 from tracing

@andrewbaptist andrewbaptist requested a review from a team as a code owner October 16, 2024 21:13
@andrewbaptist andrewbaptist requested review from herkolategan and srosenberg and removed request for a team October 16, 2024 21:13
Previously the proposal data context could be updated and read outside
of any locking. This normally didn't cause any problems as it was
infrequenty read and only updated once, however it was incorrect. It
needs to be protected by an atomic reference.

Epic: none

Release note: None
Adds tracing fields to allow tracing a Raft message over the wire.

Fixes: cockroachdb#104035

Release note: None
Previously describeTarget was only accessible from within the raft
package. Making it public helps with debuggability of Raft Entries
outside of Raft.

Fixes: cockroachdb#104035

Release note: None
The trace and span ids can be used for out of band tracing. We don't
want those to be redacted.

Epic: none

Release note: None
The raft types are enums that can't contain unsafe data. Making them
SafeValue allows using them more easily for logging without a
risk of them being redacted in important lines.

Epic: none

Release note: None
This change adds a tracing library for raft messages. Specifically it
allows registering an entry either directly or due to a remote raft
message and then possibly tracing the message at various key points.
When the message is finally unregistered, it will dump the trace to the
log. Additionally, on a local message, it will add the trace to the
context that is passed in.

Fixes: cockroachdb#104035

Release note: None
This commit adds tracing to the various places within raft where state
transitions happen. When a message is first proposed, it matches the
context of the original message to the raft entry and registers it for
tracing. Then it adds to the trace at the key points during the
transitions and finally stops tracing once a `MsgStorageApplyResp` is
received past the entry.

Fixes: cockroachdb#104035

Release note: None
Since raft tracing is not enabled by default, the perturbation tests
enable it. This can be removed if we later make raft tracing enabled.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: add tracing through Raft
3 participants