-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: add tracing to raft #131850
base: master
Are you sure you want to change the base?
raft: add tracing to raft #131850
Conversation
A sample trace for a RF3 request:
|
74fc59c
to
79a0dd5
Compare
@pav-kv I attempted to incorporate all your changes, but hit a snag... After 2ad2bee, we now create tracing spans for all tasks launched from We are currently logging all the "right" things, and the tracing spans mostly look good on both the client and the server, but there is way too much logging because of this. I'm going to take a break now but will think again on Monday how to better work around this. |
7b4597f
to
92dd0bd
Compare
Outstanding tasks:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look. Will review again tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start with a simpler thing?
- No
raftpb.Entry
encoding change. ExtendRaftMessageRequest
instead. - As we discussed, trace context integration right now bears little advantage because the trace is not collected/routed anyway. So we can start with only logging, and no per-entry context.
- The
Map
approach can be reduced to either tracing only one entry index (as in kvsever: hacky raft proposal tracer #131863). Or we could also say that there can be only oneTraceID
at a time, and, instead of a map, have a pair of(low, high)
log mark scoped to this trace ID (thehigh
mark would be bumped any time we see a newer entry being traced under the same trace ID), and intersect it with all messages until the trace ID is unregistered (e.g. when we crossed thehigh
mark).
The Map
in RaftTracer
seems hard to use because it requires point deletions. Whereas we need "queue" semantics, to remove a prefix.
84dbc3f
to
4b1157b
Compare
I cleaned up the PR and removed all the raft encoding changes. I left the tracing in place because it seems to be useful in a number of cases. I'm not 100% sure, but I think there is still a problem with the way I'm using tracing that is leaking. |
@andrewbaptist Thanks, appreciate the commits separation. I'll review tomorrow. |
1c0dcb5
to
69bd881
Compare
@pav-kv This should be ready for review now. It might be worth adding a flag to diable this if we decide it is too expensive, and the tests need a bit more work, but other than that, it is good. I've added some guards to prevent too many outstanding traces, and they make a decent compromise between losing too much information and impacting performance. The new PR for changing the ProposalBuffer context is surprising that it didn't show up before, but it happened in one extended test under stress. I think it is a good change regardless. I'm hoping to at a minimum get the protocol changes in before the branch cut, so if you want I can split that out into a separate PR. |
// hasn't changed. In the case of an ABA update, it does not break the | ||
// invariant since some other trace was registered and deregistered, but | ||
// there is still a slot available. | ||
return r.numRegistered.CompareAndSwap(numRegistered, numRegistered+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is troublesome for a couple of reasons:
- Having a global read-modify-write variable like this
numRegistered
is an anti-pattern / not cache-efficient etc.
Some notes on this here:
- https://www.1024cores.net/home/lock-free-algorithms/scalability-prerequisites
- https://www.1024cores.net/home/lock-free-algorithms/false-sharing---false
- If someone outpaces us and increments the counter (yet it is still below the limit), we will skip this registering unnecessarily.
I don't think we should try super-hard to make this a hard limit. It will be best effort anyway because we can't guarantee consistency across 100k ranges. If there are a few requests racing, it's fine if we slightly overshoot (we can't overshoot by more than the number of raft workers, I think). It's fine if the limit is exceeded temporarily when the limit setting goes down or to 0.
An atomic increment is cheaper, so we should prefer it to CAS. Where possible, we should prefer it to not be a global variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was OK with missing a registration in a rare case of a race (since it can only happen if there are multiple in-progress attempts and at least one will win). I wasn't too worried about the cost here, as if we are already here we almost always will succeed on this CAS and are going to enable expensive tracing. The checks above will remove the common cases.
I understand the concern about the global variable, but I'm not sure of a good alternative for the 100K range case. Even having 1 trace per range will be too many in that case, and for a KV like workload it is likely to be in that situation. Maybe we can chat at the standup today.
func (r *RaftTracer) MaybeTrace(m raftpb.Message) []kvserverpb.TracedEntry { | ||
// NB: This check is an optimization to handle the common case where there | ||
// are no registered traces on the store. | ||
if r.numRegistered.Load() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should be checking a local variable here, instead of the global count. Better to avoid global synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is a way to do this with a local variable. We could store a "replica level count" - but in practice this probably isn't any better. I can look at the performance of this, but I don't expect this will show up since it is only one extra main memory access per message.
@pav-kv - Some thoughts about the PR - I made a few updates last night based on your review.
Any other thoughts on it? |
d21fd7f
to
20c5855
Compare
20c5855
to
144722b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting close. Thanks for all the rounds on this. I think everything is addressed, and I'm going to run another test on my test cluster to look at the logs.
func (r *RaftTracer) MaybeTrace(m raftpb.Message) []kvserverpb.TracedEntry { | ||
// NB: This check is an optimization to handle the common case where there | ||
// are no registered traces on the store. | ||
if r.numRegistered.Load() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is a way to do this with a local variable. We could store a "replica level count" - but in practice this probably isn't any better. I can look at the performance of this, but I don't expect this will show up since it is only one extra main memory access per message.
144722b
to
cc7b257
Compare
A trace with this enabled looks like this: Searching for the id can be done with a command like this:
|
Previously the proposal data context could be updated and read outside of any locking. This normally didn't cause any problems as it was infrequenty read and only updated once, however it was incorrect. It needs to be protected by an atomic reference. Epic: none Release note: None
Adds tracing fields to allow tracing a Raft message over the wire. Fixes: cockroachdb#104035 Release note: None
Previously describeTarget was only accessible from within the raft package. Making it public helps with debuggability of Raft Entries outside of Raft. Fixes: cockroachdb#104035 Release note: None
The trace and span ids can be used for out of band tracing. We don't want those to be redacted. Epic: none Release note: None
The raft types are enums that can't contain unsafe data. Making them SafeValue allows using them more easily for logging without a risk of them being redacted in important lines. Epic: none Release note: None
This change adds a tracing library for raft messages. Specifically it allows registering an entry either directly or due to a remote raft message and then possibly tracing the message at various key points. When the message is finally unregistered, it will dump the trace to the log. Additionally, on a local message, it will add the trace to the context that is passed in. Fixes: cockroachdb#104035 Release note: None
This commit adds tracing to the various places within raft where state transitions happen. When a message is first proposed, it matches the context of the original message to the raft entry and registers it for tracing. Then it adds to the trace at the key points during the transitions and finally stops tracing once a `MsgStorageApplyResp` is received past the entry. Fixes: cockroachdb#104035 Release note: None
Since raft tracing is not enabled by default, the perturbation tests enable it. This can be removed if we later make raft tracing enabled. Epic: none Release note: None
4842eff
to
0c7e4fa
Compare
This commit adds tracing to the various places within raft where state transitions happen. When a message is first proposed, it matches the context of the original message to the raft entry and registers it for tracing. Then it adds to the trace at the key points during the transitions and finally stops tracing once a
MsgStorageApplyResp
is received past the entry.Fixes: #104035
Release note: None